If you're stuck with mongo in legacy infrastructure
and it doesn't make sense to refactor/architect
it away, I suggest tokumx. It's allowed us to kick the can on this problem for at least another year. Almost no lock contention, far more compact on disk (even cheap disk space adds up) and (what seems to be) a
growing set of users.
I'm optimistic that pg9.4 will be our migration path. But regardless, tokumx has given us the breathing room
to defer the decision.
I was considering tokumx because it seems like scaled better than normal mongodb but to be honest I'd love to move of this database entirely. I'm somewhat unfamiliar with postgres but does it have a storage/querying system comparable to mongo?
If you want to store & query JSON [1], Postgres 9.3 is great! Plus you can index functions, meaning you get cool things like fast JSON look up, and doing case insensitive searches. Which is hard in Mongo, you would either have to do a slow regexp look up, or save a lower case version in your application logic.
CREATE INDEX ON members ((lower(my_json_data->>'email')));
As the communities grow, more people learn about mongodb's limitations and feel the need to switch. I like your "ruby" example because what mongodb faces is really similar: Easy to start with, hard to go... well... "web-scale" (by the way is this expresson becoming a word or has it already been?)
I think that is the point. I would tweak the definition a little and say "Legacy is anything that's not fashionable anymore, but working and still being used just because it's working."
Yahoo isn't the only large company to instruct its employees to avoid mongodb due to it being AGPL. In my opinion, the only people using mongodb in a commercial setting are those who are paying 10gen for a commercial license or those that don't know they are violating the AGPL license.
Most people in a commercial setting aren't modifying the MongoDB source code. AGPL does not require that any software that communicates with AGPL software be AGPL'd.
But yes, if you're using MongoDB in a commercial endeavor, and you modify the source code, and you're using AGPL version, you do need to share your changes to the MongoDB source code.
As I understand it, the drivers are a point of contention, and the parent's parent explains it well.
Technically, according the AGPL, MongoDB's database drivers licenses (apache) are incompatible with AGPL, and technically should be licensed under AGPL. Now the parent says that should be fine for official drivers because MongoDB isn't going to sue themselves, but the issue is for community drivers, like the Node Driver or the Golang driver. Since the AGPL states that any software built for the exclusive use for the accompanying software must be AGPL - then it follows that community drivers should be AGPL as well.
To me that means not only can you not modify the database, but you cannot modify the drivers. And I'm also unsure if that also means that any applications that link those drivers means that they must be AGPL as well. And if your web application must be AGPL, it also means that the source of whatever service you are providing must be available as well. So in a way it doesn't just affect corporations that want to modify Mongo, it affects everyone who wants to use Mongo (with a community driver atleast).
>>> The Affero GPL is designed to close the so-called "application service provider loophole" in the GPL, which lets ASPs use GPL code without distributing their changes back to the open source community. Under the AGPL, if you use code in a web service, you required to open source it.
I have no idea why you were downvoted, and I can't reply to your child post, but I wanted to point out that you're both agreeing with one another: if you're not changing the source, there shouldn't be an issue. Most deployments AFAIK wouldn't change the source, so no issue, but if you do AGPL does indeed require you to change it back even if you're not distributing said source.
> the only people using mongodb in a commercial setting are those who are paying 10gen for a commercial license or those that don't know they are violating the AGPL license.
It is not clear to me exactly what the problem is from the message. Are we now discouraged to use MongoDB as a database (from startup to university), or writing MongoDB database driver based on existing MongoDB language driver such as pymongo or writing on top of MongoDB's database driver?
Yeah that definitely helps clarifying the issue, I think.
As long as Openstack only use Apache licensed code >>from MondgoDB Inc.<< and diligently avoids using any open source contributions from any community contributor to the MongoDB ecosystem, then you remain compliant the your CLA.
I wouldn't have known the APGL licensed database and the conflict with MongoDB, Inc. licensed code and community code. I guess now all ORM built upon pymongo used in commercial settings is that trouble zone then?
This is a huge bummer. Definitely an alert for those looking forward to use MongoDB at commercial settings as parent said.
This is actually really interesting. Due to it's nature, using mongo as a data source can easily become something very deeply ingrained into your application.
Whereas if mysql shipped with a similar restriction, you could easily flip the connection strings and have it mostly working on postgres or something else.
I would love to see a use case from a large deployment. MongoDB is trivial at small scale and it is only when you get to large deployments that it really needs some TLC. If they can somehow make large deployments simple it might make mongo a viable contender again for Humongous data ( If you have a data schema that would play nicely with it )
i would love to see a solid technical reason to choose mongodb over any of the other NoSQL db's (couchdb, riak, redis, etc..) other than "it's popular".
Geolocation, specifically GeoJSON. That's the main reason why I chose it (I started working on my app while it was at 2.0). When 2.4 came out with better geospatial indices (albeit basic compared to PostgreSQL+PostGIS) and GeoJSON support, I moved to using GeoJSON, and I am happy so far.
The website/app is at https://rwt.to , and an example route search is; from "Milky Way, Johannesburg" to "O.R. Tambo International Airport".
I should note that I've had a look at geocouch and it didn't fit my use case, I'm not doing trivial 'find my 3 places near [y,x]' queries, but am traversing a pseudo-network of routes to calculate directions. Neo4j also wouldn't have worked in my case. TokuMX is based on MongoDB 2.2 as far as I'm aware, so them too.
That's a very good reason, and the first real one I have heard, thanks man.
Also, I used to work for MapBox, and I know we did one project on mongo which I was not involved in, and afterwards we built everything with CouchDB (which is how I got acquainted with it).
For the geo stuff we actually used a lot of sqlite and to a lesser extent spatialite. We would pre-calculate things and build them into the rendered tiles in mbtiles format, or stream the point/polygon data from the couch database for realtime client-side compositing.
But yeah, routing is pretty high level stuff. I think they are only now putting the finishing touches on their openstreetmap driven routing system many years later.
I would consider "Its easy to get started with" a valid technical reason.
Of all the "We moved from MongoDB to Cassandra/Riak/etc and gained massively!" I've rarely seen - and its possible that this is selection bias - companies start with the other NoSQL options.
I want to say, that unlike MongoDB - the others actually force you to think about your data and actively decide how you are going to store it. With MongoDB you can pretty much add an index on anything, but with Cassandra (maybe Riak/Dynamo too) you only get one free index before you have to denormalize and write application code to keep your performance.
Then lastly, MongoDB is good enough for most use cases. We didn't see major performance issues until we started constantly writing data to it (high write/low read) (basically we were wrestling with lock contention). I'd wager for a significant amount of MongoDB deployments, not only is Mongo easy to use, but fast enough too.
So while the other NoSQLs are (probably) more complicated and likely more performant, MongoDB, to me, hits a sweet spot of ease of use and performance that is good enough for most applications out there.
However, considering other "raw" technical aspects like performance, durability and scaling I've never seen anything that has shown MongoDB to be a leader.
"I've rarely seen - and its possible that this is selection bias - companies start with the other NoSQL options."
It seems like everyone starts with Mongo, because everyone starts with Mongo.
This means that you don't have the deluge of posts from people moving from other databases,
a) there are much fewer of them
b) they chose them for solid technical reasons (not just because everyone does this)
So as for your perception that other NOSQL databases are "probably" more complicated, you should know that complexity is an objective measure. I think that mongo is definitely a lot more objectively complex than couchdb, and from what I have read around the subject, many of the other NoSQL databases.
What Mongo could well be is 'easier', which is relative. It seems like it's more familiar to certain programmers, which is kind of echoed by the fact that there's an incredibly popular object relational mapper (mongoose), that is being used with what is supposedly a non-relational database.
It's from a very insightful presentation by the creator of the Clojure language, and I only wrote a summary because I got sick of trying to get people to watch an hour-long video before trying to discuss systems on this level.
Mongoose adds a bit to the table. It actually adds schema validation, which mongo doesn't inherently have, and should be part of the application anyhow. I feel that's the biggest reason to use mongoose over the straight mongo driver in many cases.
I've used Mongo in a couple projects where it was a great fit. The scale wasn't huge, but having pre-shaped data for a mostly read scenario was great. I've found that it works really nicely for a lot of situations, and would definitely be a consideration.
I find that document databases work best when your data is read far more than written to, and when you can shape your data structures for simple key reads in most cases combined with indexed searches. I would consider the use of ElasticSearch or RethinkDB in most cases where you might look at MongoDB. It really depends on your needs here.
Riak and Couch offer other advantages, and like anything it really depends. Cassandra is another nice option for larger scalability, but everything has a cost.
Mongo is very reasonable, and to be honest, if you don't need more than a single server for your needs, it's really easy to get up and running quickly, and development tooling is decent enough, and the concepts are pretty easy to get up to speed with.
I can't speak for everyone, but to me MongoDB is far easier than any other NoSQL engine I've looking into. The reason why I said "probably" because I can't speak for every NoSQL database out there.
We had a 5 node cluster in Mongo that we moved to Cassandra last summer. While our experience with Cassandra is by and large much more performant and cost effective than MongoDB, getting setup with Cassandra was not as easy with MongoDB. With MongoDB you can literally start throwing data in your database, then add an index after the fact. With Cassandra we had to make sure our data was modeled correctly, and decide where we would denormalize. Riak from what I remember has a similar data model to Cassandra, and Redis isn't something you just "start up and go" (mainly because its an in memory store).
So I know for a fact that Cassandra, Riak, Dynamo, and Redis are far more complex than MongoDB. Cassandra even requires you run a "repair" command periodically, and that alone makes it more ops work than Cassandra. We can even throw HBase in there too as it requires Zookeeper nodes, Named Nodes, and all that Hadoop goodness.
Now none of these databases are hard to use, but compared to mongo, mongo is a cakewalk. You literally spin it up, throw json inside, and get json back. There is no query writing, and for most cases there is very little ops management. In most cases if a query is slow, you can fix that by adding an index, or moving to SSDs, only once you have exhausted these options do you really have to consider anything else.
FullContact also has a similar story : http://www.fullcontact.com/blog/mongo-to-cassandra-migration...
tl;dr Mongo was great for getting the product up and iterating quickly, but then they moved once they thought they needed too. Its my opinion that its far easier to get started with MongoDB that it is to get started with Postgres/MySQL.
Lastly, damn the technical reasons why its so popular, Mongo/10Gen used be a huge marketing engine around ~1.6/8. They captured a lot of developer mindshare and I'd attribute that to why its so popular now as well. Wasn't much longer after that when they naysayers & those hurt by the initial hype came out of the woodwork and we got the now infamous "MongoDB is webscale" video.
It allows you to query json documents in a way similar to sql. redis is key/val and sits in ram, couch requires complicated design documents to query and is better as a key/val, but I'm unfamiliar with riak.
I think trello uses mongo primarily for production. technically it's feasible but I've found it to be more trouble than it's worth to scale -- too many machines are required per shard. I'm currently looking into rethink db as a replacement now though.
I think it wasn't 'ready' at that point in time, and the json based query language was closer to what we needed.
The real problem was that the data was being imported in bulk by the user, from a many-meg-sized csv . It would grind couchdb to a halt trying to build views, so having elasticsearch be a separate process that could work through it made a lot of sense.
Thanks for answering. I have used elastic search before and I was very impressed by it. Now I am trying to evaluate couchdb-lucene to see if it can prove to be a good alternative.
i think a lot of the reasons I've seen come down to business reasons, not technical. Someone wants an app fast, like now, and MongoDB is fast to setup and get running with.
I guess I'll be able to confirm once I'm forced to build something in it, but I don't think it can really be faster to setup and get running with than CouchDB.
Usually first thing you need to do is write a REST layer on top of it, and with CouchDB that part is just done already.
Obviously there's certain kinds of data I wouldn't put in Couch, or any kind of NoSQL database.
You need to know what the right tool for the job is,but I just want to figure out when that tool is mongo.
Why would that tool ever be mongo ? I think Mongo is a thing because people coming from Rails ORM libraries feel like "wow I can jump on the no-sql bandwagon just by using a library that feels kind of like the ActiveRecord I'm used to".
It's only popular because there's less of a conceptual gap between mongo and the relational database tools that a lot of people are used to.
Couchdb on the other hand requires you to actually learn and use map/reduce.. which is a pain for people who don't feel like having to learn something new, but Couchdb is MUCH MUCH better in a lot of ways and Mongo is pretty much fundamentally flawed in my opinion
I do wish rackspace luck though with their offering. I think it was smart of them to create this mongodb product for one simple reason: a good number of people are already using mongodb so it makes sense to help them get the most out of it.
I managed one of the largest MongoDB installations and I can safely say running it at scale is extremely difficult.
They've made a few changes, like not hardcoding the maximum number of connections and shards anymore, which helps but overall the big problems like database-level locking are serious problems even a year later.
The reasons for choosing it were very simple - the lead developer was familiar with JSON and liked using it for queries, and he liked the "schema-less" nature of document storage. No consideration was given to performance or scaling issues, it was purely a comfort level decision.
I'm sorry, but this seems a bit of a lazy question.
There are many large-scale deployments of MongoDB - a simple Google search will yield you results.
Off the top of my head - FourSquare, Stripe, ServerDensity, eBay (non-site) etc.
MongoDB (the company) also uses it for MMS - their cloud-based monitoring system, which probably handles hundreds of thousands of metrics every second from tens of thousands of hosts.
So yes, there is a lot of FUD about "it doesn't scale" etc.
Most of the FUD seems to originate from people not reading the manual, and completely misconfiguring things, and wondering why it doesn't work.
To be fair, most competing products (Riak, Couch etc.) will scale enough for most people. So this is sort of a red herring. (And by the point that you are as big as FourSquare, the assumption is you'll probably hire engineers who will read the manual =) ).
So the decision boils down to other things - how easy is the query language, do you need GeoJSON support, do you need aggregations, how mature is the overall ecosystem etc.
And that's why people are picking MongoDB - not really the WOAH, LOOK AT THE OPS PER SECOND!.
> Most of the FUD seems to originate from people not reading the manual, and completely misconfiguring things, and wondering why it doesn't work.
Most of the FUD comes from the deceptive marketing 10gen used to promote MongoDB. It now has a well deserved bad reputation that will never go away,no matter how much startup choosed it.
For example, there was noise before about how MongoDB was allegedly tweaking benchmarks.
The funny thing was, from what I've read, they've always had a policy of never ever publishing official benchmarks. Their line was, read the manual, and try it with your own data.
Maybe you misunderstood what I was asking. I know MongoDB can scale out to very large sizes, but it becomes more of a effort to scale it out correctly ( 3 config servers and all shards replicated ). My question is can I just turn a dial at rack space and now I have my seven servers all provisioned and configured correctly, since that take the dead simple single node MongoDB deployment and extends it to the sharded replicated cluster.
Seems a bit expensive assuming each shard gets you three more servers. Based on RS pricing you could build your own instances with 1TB SSDs for like $500 each, so for the same price that you'd get 100GB x 3 shards you could do 1TB x 3 shards on your own. I guess with the RS pricing you also get managed backups but IMO that's not worth the price difference. If you figure you build your own, 3 shards x 3 RS members x 1TB is 9 servers at $500 each or $4500. With this service... if 100GB is $1599 you have to imagine 1TB has to be at least on the order of $10k, so you're looking at $30k total. $25k a month for managed backups and infrastructure just doesn't seem worth it to me. Maybe what they're using is more powerful than the instances we're on but I still have trouble reconciling the pricing. And if that 1TB is 1TB total and not 1TB per shard/replica set member then the pricing looks way, way worse.
That's an awesome link, thanks. Does a lot to clarify where the pricing disconnect is between spinning up my own servers in the RS cloud and using ObjectRocket.
IMO RS's ObjectRocket pages should do a better job showing what you're getting on top just deploying to a bunch of RS cloud servers with SSD's in them.
Mason55, I work for Rackspace. The ObjectRocket service does not run on the cloud servers you get from RS. It's actually a different architecture just for MongoDB: flash drives all over, containers, pretty much tuned for MongoDB across the stack.
While we don't describe the full details of the architecture, you have a very valid recommendation that we will take into consideration to make sure it is clear what you are getting for your money. Thanks.
Sort of ... but there are pretty serious limitations.
1. You are locked into Rackspace as your provider. MongoHQ provides multiple cloud providers.
2. They force you to shard. This increases operational and application complexity ... and may not offer any real advantages for the amount of data that you have.
I'm optimistic that pg9.4 will be our migration path. But regardless, tokumx has given us the breathing room to defer the decision.