Non-relational bear-traps

Problems people run into when they use a non-relational store (particularly when they don't have strong domain knowledge)

One of the most common objections many people have to relational databases is something along the lines of "SQL databases make you design your database up front".

People often feel (and I agree) that at the start of the project they know the least about what their future requirements will be - both in terms of the data they will want to keep and in their access patterns. After all, everyone has been in the situation where some extra, unforeseen part of the domain has popped up and you suddenly need to record it somewhere.

The confounding thing is that making a fundamental change to either an access pattern or data layout is in fact considerably harder in a non-relational database than it is in a relational database. There are a few key problems that crop up repeatedly.

Problem 1: Difficulty changing primary key

Changing the primary key of a table is a surprisingly common activity. In truth, it's pretty easy to pick something that initially looks like it will be unique but which later turns out to not be unique. For example, I was recently working on some software for tracking electricity meters and it turns out that individual meters can move from one property to another sometimes. That means you can't use a premises id to refer to a meter you need to use the meter serial number directly. A small lesson in the domain caused a material change to the schema.

Such situations are common, particularly because at the start of a project you often just need to have some minimal data layout for the prototype to work before a subject matter expert turns up and explains to you how things really work.

Unfortunately, in many non-relational database systems the primary key is "special". For example, Dynamo-style systems will use the primary key to decide which of the partitions the record will go on. In key-value stores the primary key is the only key and so the sole handle on a record. There are other patterns1.

This specialness tends to cause problems when you want to change that key. Often, changing the key entails a full rewrite of your data - to be re-organised under your new choice of key. It's usually difficult to do a full rewrite without planned downtime or severely degraded performance. In fact, many non-relational databases provide no built-in mechanism at all to change primary key - so you'll need to do that yourself.

So when working with an non-relational database, don't make a mistake with your primary key. You'll need to design that part up front.

Problem 2: Difficulty changing access pattern

Changing primary keys in a non-relational database is, if relatively common in new code-bases, a bit less common in established code-bases. One problem that does affect established code-bases is a change of access pattern - aka, a new query.

Let's imagine your database is backing an e-commerce website and that there is a new query in which are going to need to check whether the user has something in their shopping basket and if so how many items they have. This is so that the web front-end can display a basket with a number next to it in the navbar of your site. You are going to need to do this on every page load so it needs to be extremely fast. You only have a budget of 500ms to load the whole page and your query is just one of the things that need to be calculated as part of that.

Well, the basket item count probably isn't a part of the primary key of your user data table so you now need to have a secondary index. For a surprising number of non-relational stories this is either not directly possible (for example in key-value stores) or performs badly. For example, in many database systems your secondary index cannot be used to answer a query on it's own but instead is limited to pointing at a database page where the relevant row is held. This isn't what you really want - what you want is a "covering" index.

A covering index is an index that holds all the data required to answer a particular query. If you had an index that simply held the user id and the items in their basket that would allow you to resolve the query immediately upon looking up that user id in the index - without having to conduct a secondary (and typically slower) lookup in the main database.

Covering indexes are essential to fast database queries as doing an index lookup and then returning straight away is always faster than doing an index lookup and then a main database lookup before returning. The difference can often be an order of magnitude (or more if the index is small enough to be held in memory when the database isn't).

So when working with a non-relational database, make sure you know about the access patterns up front. If your database doesn't support secondary indices then you probably need to have just one main access pattern and make that access pattern your primary key.

Problem 3: Inability to manage the hacks added to fix problems 1 and 2

Usually the two issues above are in fact surmountable. They are surmounted with hacks.

Perhaps the primary key is successfully change via an application level function which writes to the existing database in the existing format and a new database in the desired format. This works ok when on the "happy path" but is fairly fragile: errors in writing to the second database necessitate cleanup/rollback of the change made in the first database (which, let's be honest, you probably won't code).

Perhaps the inability to add a secondary index is resolved by adding a cache somewhere which is updated when a new row is written. If all you have a key-value store, everything tends to look like a hash table.

The trouble is that while you can add a few of these hacks to the database they become hard to manage. Your application code now has to be a little clever about how the database is used in existing code and every new place in the code where you do anything to do with the database has to be clever too. This eventually becomes difficult to manage and error-prone.

One of the operative purposes of a relational database is that there is one way to lay out your data and you can do it fairly mechanically. You do not have to be a rocket scientist to layout your data in 3NF. Secondary school pupils do it in their Information Technology qualifications. Once you've done the mechanical work of normalising to 3NF you can then use controlled denormalisation (aka: secondary indices) to duplicate data for your various access patterns. A primary key is hopefully just a normal b-tree index with affectations.

Relational databases were created as a reaction against programmers having to know their primary keys or access patterns ahead of time. If you're worried that your understanding of the domain is limited you should be using a relational database - they intentionally provide the most flexibility.


Further reading

  1. One relational MySQL: in the InnoDB backend for MySQL the primary key is special. InnoDB uses an index-organised table. What that means is that effectively the rows are stored on disk in order of primary key. Primary key changes (at least historically) required a full table rewrite. I think that index-organised tables are a great option but a terrible default.