SQL, short for “Structured Query Language”, is the dominant database technology in the web development game – and has been for years. Since “years” translates to “eons” in web development, many people felt it was time for something new, and being especially creative people, they came up with NoSQL, the current hype.
But what is NoSQL, why is it everybody’s darling at the moment, and how do I use it, especially if I’ve been working with SQL up to now? In this post, I’ll try to share what little knowledge I’ve gathered while working with several NoSQL technologies here at ICANS during the last couple of months.
First of all, what classifies a storage technology as NoSQL? From a pragmatic point of view, any storage, that doesn’t support SQL is in fact a NoSQL storage. The greatest difference between No- and SQL is that the latter relies upon relations between datasets (you may have heard the term “relational database” before). These relations are at the very heart of the database systems we have grown accustomed to, and are absolutely necessary for SQL to work in the way we expect it to. Think about joins and related subqueries – relations are really what the language is all about. When in ancient times the elders of the internet decided to create databases, they recognized the fact that in reality many datasets are interconnected and even meaningless on their own. The classic textbook SQL example is that of an ordering system for a shop, with customer, product, and order entities, all interconnected by various relations. And in fact it is hard to imagine and order that is not in any way related to a customer or a product. Relational databases and SQL offer an ideal toolset to tackle problems of this sort.
However these “relational” benefits come at a cost: You need to predefine the datastructure before you actually have any data and it becomes very difficult to change it later on. You also need to define very precisely what each data record will look like and all records of a type will have the same data fields. All in all, this “schema” (as the datastructure is commonly called in SQL circles) is very stiff and inflexible. Also, due to the great number of interconnections between datasets, many relational databases have problems when it comes to scalability.
People found, that often relational databases are used in environments, where the relational aspects of data don’t really matter. So why put up with all the downsides of relational systems, when you don’t need relations? This is where NoSQL comes in: They main idea is to use a set of tools that fit your specific requirements, and not blindly ake use of the toolbox, that you’ve come accustomed to use.
Now, given that NoSQL mainly defines itself by lack of SQL support, what technologies are commonly labled “NoSQL”? Document based databases spring to mind, where the restriction of a fixed schema is lifted, and data records are very free in their structural definition, even within collections. BigData, Google’s storage engine, and the related MapReduce algorithm free you from scalability restrictions, by allowing to shard data over millions of machines. Or even Solr, ElasticSearch and the whole Lucene based search engine lot, that allow for very performant execution of complex queries – all of these technologies allow you to bypass the restrictions of traditional relational databases. You could even go as far as to call your computer’s file system a NoSQL storage, but luckily, the respective marketing departments haven’t found that out yet :)
NoSQL for the SQL minded Dev
When I first came across NoSQL based databases here at ICANS, I found it incredibly hard to lay of the traditional way of thinking about web-app related problems. Looking at data, I would automatically create models of entities with relation, map out joins and define indexes for efficient queries – all without really checking whether all this was necessary. However, the resulting datamodels would not really fit to a NoSQL storage, querying would seem akward and the expected performace boosts simply did not set in. I think this is largely due to inexperience with working with the new tools in the shed, but also due to the fact, that we’ve been working with a single tool for so long. Everything looks like a nail when all you’ve got is a hammer…
So what I’ll try to outline in the next couple of paragraphs are a few techniques I’ve come across while working with a set of NoSQL tools. Developing software systems is artisanship, and as an artisan, you need to master your tools!
MongoDb – Do’s and Don’ts
MongoDB is a document – based storage engine, which means it is designed for use cases where the storage item’s internal structure may vary and inter-object relationships are simple. It provides scalability via a transparent sharding mechanism, and allows for HA setups with master/slave replication.
But the real difference is the way you structure your data in such a database. When it comes to relations, you have two options to choose from in mongo. You can either reference objects by their Ids, similar to what you would do in a RDBMS, but you have to be aware of the fact that any joins you would want, will have to be done in your application. You can not easily get referenced objects out of the DB with a single call!
The other option are embedded documents: think of a blog post and it’s attached comments. Mongo allows you to store all comments within the blog post itself, but is still able to do queries over all comments. This allows for a lot of flexibility, but comes with the cost, that any fetch of a blog post will also load all attached comments.
A technique that can come in handy is bucketing: instead of storing all comments within a document you could create documents that represent common query paths. For instance you could create a so-called bucket for all comments attached to a blog post that where created on a specific day. This allows to load smaller subsets of data that would normally come in large numbers.
There is one thing to keep in mind when working with Mongo: even though it sounds like you don’t need to specify your data’s structure (and despite the official website saying otherwise), Mongo isn’t “schemaless”. It is true that you don’t have to provide a database structure or field definitions, but this doesn’t free you from think about your data structure before implementing it, and coming up with typical use cases and query paths.
Redis – you can do that with a key value store?
Redis is an in memory key value store. In this regard it is very similar to memcached, and in fact you can use it in most cases, where you would normally use memcached. But Redis provides a lot more features: first of all it allows for disk persistence, which makes it a viable tool for actual storage.
But the really cool parts of Redis aren’t immediately obvious. For instance lists: a simple datatype, allowing you to store values in an ordered fashion. But in combination with the “blocking right pop” command BRPOP, use can use lists to create simple job queues, with multiple workers attached to one queue.
Another often overlooked feature set are bit operations on keys. In most high level or intpreted languages you don’t often come across bit masks anymore, but they are incredibly memory efficient, and bit operations are generally extremely fast as they map directly to CPU instructions. Especially the memory efficiency becomes interesting when you have to work with large amounts of data, for instance in analytics.
For example, imagine you would like to track whether a user was logged in on a specific day. What you could do in Redis is set a bit on a date-specific key, that corresponds to the userid.
SETBIT loggedIn:2012-06-13 1
The BITCOUNT operation will give you the number of logged-in users per day.
Redis allows for keys of up to 4 gig length, so unless you have more than four billion users on your site, this will work smoothly.
There are tons of tools out there that are labeled NoSQL. Each of them is tailored to a very specific need, and will give you great flexibility, when used correctly. But you should always check whether the tool you are looking at will actually suit your needs. Working with NoSQL requires you to think about data in a different way than in the RDBMS world, but once you get the hang of it, it’s a lot of fun, trust me on that ;)