The troublesome "Active Record" pattern

March 2020

How inappropriate data access patterns massively slow down programs and how the same problems arise with RESTful APIs.

Most SQL databases make roughly the same implementation choices. MySQL, Oracle, DB2 and PostgreSQL all use heaps and b-trees to represent data tables. While database implementations are similar, application access patterns can differ.

A home truth of software development is that one pattern, Active Record, is completely unworkable.

For the avoidance of doubt, this article is about the Active Record-the-pattern and not about the Ruby library which later took the same name but which apparently no longer follows the pattern¹.

The Active Record pattern, described

"Object relational mappers" (ORMs) exist to bridge the gap between the programmers' friend (the object), and the database's primitive (the relation). The reasons for these differing models are as much cultural as functional: programmers like objects because they encapsulate the state of a single thing in a running program. Databases like relations because they better suit whole-dataset constraints and efficient access patterns for the entire dataset.

The simplest and most intuitive manner of translation is for programmers to just map a single class of objects onto a single table. Here's an indicative code snippet:

# Create our book object
book = Book(
    isbn="978-0099466031",
    title="The Name of the Rose",
    author="Umberto Eco")
# Record it to our database
book.save()

# ...later, get a book by ISBN
book1 = Book.get("978-0099466031")
# And print the title
print(book1.title)

The Active Record pattern of data access is marked by:

A whole-object basis
Access by key (mostly primary key)

Problems arise

This pattern of access is simple and intuitive but causes huge problems in even small projects. Retrieving whole rows is hugely wasteful when only part of the row is required to resolve a user request. The issue becomes pronounced when: retrieving sub-parts of the data (projection), consulting multiple tables (joins) or digesting the dataset (aggregation).

To get all ISBNs of all deceased Italian authors in the Active Record pattern:

isbns = []
# for each author
for author in Author.get(nationality="Italian", state="deceased"):
    # Get their books
    for book in Book.get(author=author.name):
        # And append the ISBN to our list
        isbns.append(book.isbn)

Notice that:

The entire author object for each dead Italian had to be retrieved
The entire book object for each book had to be retrieved
Nested for loops prelude much parallelism

The above code snippet read intermediate data about the author into memory as well as completely irrelevant non-ISBN data about the book. For n authors with m books each:

n*m queries were performed
- with all the overhead that implies²
n*m serialisations by the database
- probably in a fairly quick runtime like C
n*m deserialisations done by the program
- probably in quite a slow runtime

For a dataset with just a thousand or so authors the above nested for loop is significantly slower than the equivalent SQL:

SELECT isbn
FROM book
JOIN author using(author_name)
WHERE author.nationality = 'Italian'
AND author.state = 'deceased'

The above SQL:

resolves the same question in a single query
serialises/deserialises only the ISBNs
given sufficient indexes on author_name, nationality and state it will be able to avoid a direct read of the author table
- most people would have these on a schema by default
given a composite index (author_name and isbn) on book it would also be able to avoid a direct read of the author table
- not the kind of thing you put on a schema by default but easily added if it helps a common query

Given that each of these four improvements would probably increases performance by a factor of ten over the above Active Record style it wouldn't surprise me if the query was ten thousand times faster than the Active Record style access³.

The available mitigations don't work well in practice

Most practical Active Record style ORMs will have some method of more efficient access - but these methods will be idiosyncratic and they will breach the central abstractions of whole-object basis and access by key. Likely they will require increased understanding of the specific ORM being used and they will probably be a lot less appealing to most developers than the primary affordance of the library: the Active Record pattern.

One common response to this is to contend that Active Record is suitable as a first implementation with the plan of retreating to a more efficient mechanism later. Unfortunately Active Record access patterns break down almost immediately - typically as soon as there is a requirement to: "count all entries where...", "get all distinct..." or "return true if there are any entries such that..." will entail such a slow implementation that it will cause production problems.

The second problem is that once the Active Record pattern is adopted refactoring is not usually easy. Problem queries are only rarely presented as apparently as nested for loops right next to each other. Often single conceptual queries span abstraction boundaries, are in different parts of a tall call-stack or are just separated by program context⁴. Once you design your abstractions in (wilful) ignorance of what is an effective access pattern it's very hard to row back.

The viable alternative: first class queries

The only workable alternative is to make queries⁵ first class objects.

This doesn't always have to be done via SQL literals. It can be done by composing objects: many non-Active Record ORMs offer these "querybuilders".

Here's an example of creating an EXISTS query (normally very quick but hard to express in the Active Record pattern) with a querybuilder:

def wrote_in_english(session: DBSession, author_name: str) -> bool:
    """Returns whether the author wrote in English"""
    exists: bool = (
        # our subquery for a matching row - never read into program memory
        session.query(Author)
        .join(Book)
        .filter(Book.language=="english")
        .filter(Author.author_name==author_name)
    ).exists()  # just get a bool
    return exists  # and return it

The difference here is that you aren't working on either a whole-object basis or on a key-based access basis. Any element of data can be retrieved, without having also to retrieve intermediate or irrelevant data, and queries can be based on any criteria.

Also necessary: first class transactions

For read-only operations, first class queries are enough to allow for effective data access patterns. For write operations another first class object is required: the transaction or "database session". Here is some sample problem code:

book = Book.get(isbn="978-0099466031")
book.copies_in_stock -= 1
book.save()  # WRONG! irreparable data loss here

The problem in the above code is that any other program which incremented the sales number for that book between the get and the save had its addition implicitly deleted when our save was applied.

These "lost update" bugs are extremely common as it is not intuitive that if you plan to update a row based on what you've read from it you need to lock it prior to the read. In the above case the read is partly disguised by the decrement operator, -=.

The majority of Active Record style data access patterns in the wild are non-transactional, even when working on single columns in individual rows. The "lost update" problem is actually much wider than just SQL ORMs and really applies to any store which doesn't provide for locking data that has been read prior to an update. Programs naively reading and writing to key-value stores without a transaction/lock facility are often riddled with lost update bugs.

What's required is something like this:

with session:  # our transaction object
    book = book_repository.get(session, isbn="978-0099466031")  # row lock starts
    book.copies_in_stock -= 1
    book_repository.save(session, book)  # OK - row lock still held
    session.commit()  # explicit commits preferred for clarity

Note that the session is controlled explicitly by the code that is organising the update. This is the essential complexity of doing a data update and cannot be "abstracted" away (that is, "removed" to make things simpler).

The analogous problem with RESTful APIs

REST APIs are also, unfortunately, based on the same principles as the Active Record pattern. Instead of a whole-object basis, REST operates on a whole-"resource" basis. Access is again done by key: URL.

RESTful API design consequently suffers the same problems as Active Record ORMs. The problem is most severe when fundamentalist RESTafarians get involved, as they advocate using hypertext-style links between API endpoints - HATEOAS ("Hypermedia As The Engine Of Application State"). Under HATEOAS, applications "browse" the API, moving from link to link to accomplish their goal.

HATEOAS is of course totally inimical to effective access to data. The largest class of API clients is separated from the server by high latency network links: mobile phones. Latency over a good 3G link is commonly 300ms. The latency from continental Europe to California is an additional 140ms. This means that accessing multiple API endpoints to resolve a single user request can take seconds - an appreciable annoyance to any user.

Bandwidth to transfer irrelevant or intermediate data is at an even higher premium over the internet as instead of the gigabit networking common in datacentres the effective bandwidth from phones or home computers is often less than a megabyte per second.

When using REST inside the datacentre you don't have high latency links or bandwidth limitations. However the requirements for data manipulation are typically highest inside the datacentre and few datasets are so small that it's effective to repeatedly read large fractions of them into program memory. One telltale sign of the problem: services that spend a large percentage of their CPU time parsing and emitting JSON.

However, REST APIs do have one option for mitigation that ORMs don't: they can vary their endpoints to suit their clients' needs. Making endpoints coarser, to allow resolving a whole user request in a single round-trip, is a key tactic. Another tactic is offering alternative "views"⁶ of whole sets of data via different endpoints.

There has been a reaction against REST. One new alternative is GraphQL which is a query language and server that is able to combine REST endpoints into a single result - making the query a first class object. The second is a growing movement for RPC-style APIs - for example Slack's API is based on RPC instead of REST.

My recommendations

Take data access patterns very seriously! Poor patterns of access can make software unusable quickly. With bad abstractions it's easy to turn something that should be logarithmic time into polynomial time. This change in complexity class will stop your program working as the dataset grows.

From ORMs: demand first class queries and transactions. Avoid Active Record style access patterns whether in ORMs or elsewhere.

Use REST judiciously: only where the desired API surface really is just simple state exchange with either no prospect of lost update or where they don't matter. Always consider whether RPC is a better option.