The argument against clearing the database between tests

Some reasons why you might not want to remove data from the database between automated tests: speed, correctness, data growth issues and parallelism advantages

I'm of the school of thought that most useful "unit"1 tests should involve the database. Consequently I don't mock out, fake or stub the database in tests that I write.

On every project, I had a small piece of test harness code that cleans the database between tests2:

def clean_session:
    for table in all_tables():
        db.session.execute("truncate %s;" %
    return db.session

The reason for this was that it seemed obvious that each test should start with a completely clean slate in order to make it a fair test. All other data should be deleted so that nothing from other tests that can get conflated and cause the test to go wrong somehow - either spuriously passing or failing.

Recently I've come to the conclusion that it can (but not always) make sense to run the tests with a dirty database - not only not cleaning between individual tests but also not cleaning between whole test runs - and keeping all the old test data around on a (near)3 permanent basis.

The required mentality change

Upturning the base assumption that "the database is clean" in a test requires a small adjustment to your mindset.

You can't write tests that assume that data it has created is the only thing present, such as this one:

def test_adding_a_user():
    # won't work, assumes the only user is the one added above
    assert db_session.query(User).count() == 1

Instead each test needs to be written to assert only on the basis of data it has created (or has not created). It should handle data created by others. A corrected example:

def test_adding_a_user_2(session):
    user = make_user()
    # this is safe, doesn't assume no other users exist
    assert db_session.query(User).get(user.user_id) is not None

This isn't an easy change to make in existing tests. The assumption of clean data is tricky to refactor away. There are reasons to consider it though.

The advantages

It's quicker

Tearing down data between tests or schemas between test runs is not free of (computational) charge.

The time taken to clean the database is usually proportional to the number of tables and while this cost is small to begin with it can grow over time4.

When you have a large number of tests this per-test overhead becomes a problem. My personal experience is that the problem starts to get serious when you have around one hundred tests. For the typical test suite tearing down data can take anywhere from one to ten percent of total runtime, depending on how efficiently it's done.

There are ways to be quicker, here are a few:

That all aside, the fact remains that tearing down the database is never as fast as not tearing it down.

More realistic data shape

One perennial problem with code that uses data is that when the volume of data grows the performance can change considerably: when there are just three rows in a table a logarithmic time operation (fast) is indistinguishable from a polynomial time operation (slow) 5.

It's a sad fact that the majority of tests and indeed most development time is spent with the database in an empty or nearly-empty state. As a result there is a loss of feedback. Without the daily experience of running with realistic data sets, detecting a slow data access pattern requires thoughtful analysis and/or experience.

When your tests don't clean the database your test database will slowly fill with data which is, although not a complete match with the shape of production, presumably along similar lines. Sometimes you will notice problem queries before finding out in production and without doing analysis.

Better yet, when your tests don't assume they're starting from a clean sheet, you can run your tests with a dump from production loaded into your database. This can help confirm that data access patterns you're using will work when the database, as a whole, is at production size.

More realistic data

Testing-by-example (as opposed to property-based testing) is based on the programmer coming up with examples of inputs and asserting that when given those inputs the program produces the right outputs - or at least - the right sort of outputs. There is also the implicit assertion (present in all automated tests) that the program does not raise an exception or crash.

Hopefully, at least some of the residual data your tests leave behind is highly contrived and contains lots of "bad" data and error cases. For example: users that have started the order process but who haven't gotten as far as entering their email address, users who have been marked as duplicates, users whose names contain semi-colons and so on.

Running your tests in the presence of all this realistic wonky data can help tease out real bugs. If one of your tests inserts customer names based on the Big List of Naughty Strings then you might find more bugs in other areas when other tests exercise different parts of the same system.

Of course - you shouldn't rely on such "atmospheric" bad data as an aid to correctness. Each time you find a new bug based on bad data left lying around from another test or loaded in from production a new, specific, test should be added for that condition. However having a load of crap data loaded can help uncover issues that might not otherwise have been uncovered at the development stage.

Test parallelism becomes a smaller step

Test parallelism is often discussed but my experience is that relatively few projects ever implement it, even though quite a few would benefit.

The problem is as follows. At first your tests are fast because there aren't many of them and so most teams put off parallel tests until "later". When "later" arrives, the build is now slow and test parallelism would help but it's usually not easy to adapt existing serial tests to run in parallel.

This is because tests typically manipulate state in odd ways and if each test assumes they are the only process manipulating state they tend to need complete isolation to be able to run in parallel - separate database instances, separate S3 test doubles - even separate filesystems occasionally. Complete isolation is expense, hassle and more moving parts (8 SQL databases for an 8-process test suite is no fun).

Tests written to assume the presence of irrelevant data are much easier to parallelise - usually it can all be done within one environment. This makes it easier to do and so much more likely to happen.

Disadvantages (and some mitigations)

Debugging is harder

When a test is failing for a reason that isn't understood the debugging method is simple: run only that test, in isolation, and narrow it down until the issue is understood.

This is more difficult when the database is full of background data - you aren't starting from a "clean" state. I don't think this is an insurmountable problem - in this case, just change to running with a clean database until you can diagnose the problem. Perhaps it's worth backing up the original dataset so you can refer to it later.

Running tests designed for "dirty" datasets with clean datasets is not a problem - it's easy to switch back to using a clean dataset. Going in the other direction is much harder.

Precondition clashes

Some tests have very different preconditions - one test might require that a certain type of data is absent while other tests will add this data (or will have added it in previous runs).

These types of tests are difficult to adapt but hopefully fairly rare. If they can't be reworked and really are essential they can be run against a second instance of the data store in question that is cleaned between test runs.

How far I've gotten

I haven't used this technique in many places and have only use it myself for a short time. The idea for it came to me when I saw a traditional, PHP-style, hand-crafted test database that wasn't being reset between tests (but which is reset between test runs). While I'm not a fan of that approach it got me thinking. I've since tried not tearing down the database on a side project, to some success.

I'm ready now to try this strategy in more places, bolstered by the knowledge that if tests are written for a dirty database it is very easy to change them to running with a clean database later (usually you don't have to do anything).


  1. Whatever "unit" means - the distinction between a unit test and an integration test is one for the philosophers at this point. 

  2. It's best to do this before each test because doing it afterward can interfere with debugging failing tests. If a test fails it's quite useful to be able to browse whatever database state it had when it failed. 

  3. Maybe not strictly forever - you might trim it occasionally. 

  4. One handy (postgres-specific?) tip: DELETE is faster than TRUNCATE for small tables.

    The real clean_session looks like this:

    sut: ModuleType # sut is system code
    def clean_session(app, config):
        for table in reversed(sut.Base.metadata.sorted_tables):
            sut.db.session.execute("delete from %s;" %
        return sut.db.session

  5. In fact, when the dataset is small, things can get very confusing: the polynomial operation can be faster. This is because with small datasets the constant factors in algorithms matter more and so slower algorithms can do better on small inputs.