S3 is files, but not a filesystem

March 2024

"Deep" modules, mismatched interfaces - and why SAP is so painful

a box labelled: CAL'S MISC — My very own "object store"

Amazon S3 is the original cloud technology: it came out in 2006. "Objects" were popular at the time and S3 was labelled an "object store", but everyone really knows that S3 is for files. S3 is a cloud filesystem, not an object-whatever.

I think the idea that S3 is really "Amazon Cloud Filesystem" is a bit of a load bearing fiction. It's sort of true: S3 can store files. It's also a very useful belief in getting people to adopt S3, a fundamentally good technology, which otherwise they might not. But it's false: S3 is not a filesystem and can't stand in for one.

What filesystems are about, and module "depth"

The unix file API is pretty straightforward. There are just five basic functions. They don't take many arguments.

Here are (the Python versions of) these five basic functions:

# open a file
open(filepath) # returns a `file`

# read from that file (moving the position forward)
file.read(size=100) # returns 100 bytes

# write to that file (moving the position forward)
file.write("hello, world")

# move the position to byte 94
file.seek(94)

# close the file
file.close()

Well, perhaps I should add an asterisk: I am simplifying a bit. There are loads more calls than that. But still, those five calls are the irreducible nub of the file API. They're all you need to read and write files.

Those five functions handle a lot of concerns:

buffering
the page cache
fragmentation
permissions
IO scheduling
and whatever else

Even though the file API handles all those concerns, it doesn't expose them to you. A narrow interface handling a large number of concerns - that makes the unix file API a "deep" module.

Deep modules are great because you can benefit from their features - like wear-levelling on SD cards - but without bearing the psychic toll of thinking about any of it as you save a jpeg to your phone. Happy days.

But if the file API is "deep", what sorts of things are "shallow"?

A shallow module would have a relatively large API surface in proportion to what it's handling for you. One hint these days that a module is shallow is that the interface to it is YAML. YAML appears to be a mere markup language but in practice is a reusable syntax onto which almost any semantics can be plonked.

Often YAML works as the "Programming language of DevOps" and programming languages provide about the widest interface possible. Examine your YAML micro-language closely. Does it offer a looping construct? If so, it's likely Turing complete.

But sometimes it is hard to package something up nicely with a bow on top. SQL ORMs are inherently a leaky abstraction. You can't use them without some understanding of SQL. So being shallow isn't inherently a criticism. Sometimes a shallow module is the best that can be done. But all else equal, deeper is better.

What S3 is about (it is deep too)

The unix file API was in place by the early 1970s. The interface has been retained and the guts have been re-implemented many times for compatibility reasons.

But Amazon S3 does not re-implement the unix filesystem API.

It has a wholly different arrangement and the primitives are only partly compatible. Here's a brief description of the calls that are analogous to the above five basic unix calls:

# Read (part) of an object
GetObject(Bucket, Key, Range=None) # contents is the HTTP body

# Write an (entire) object
PutObject(Bucket, Key) # send contents as HTTP body

# er, that's it!

Two functions versus five. That's right, the S3 API is simpler than the unix file API. There is one additional concept ("buckets") but I think when you net it out, S3's interface-to-functionality ratio is even better than the unix file API.

But something is missing. While you can partially read an object using the Range argument to GetObject, you can't overwrite partially. Overwrites have to be the whole file.

That sounds minor but actually scopes S3 to a subset of the old use-cases for files.

Filesystem software, especially databases, can't be ported to Amazon S3

Databases of all kinds need a place to put their data. Generally, that place has ended up being various files on the filesystem. Postgres maintains two or three files per table, plus loads of others for bookkeeping. SQLite famously stores everything in a single file. MySQL, MongoDB, Elasticsearch - whatever - they all store data in files.

Crucially, these databases overwhelmingly rely on the ability to do partial overwrites. They store data in "pages" (eg 4 or 8 kilobytes long) in "heap" files where writes are done page by page. There might be thousands of pages in a single file. Pages are overwritten as necessary to store whatever data is required. That means partial overwrites are absolutely essential.

diagram of a database heap file — A heap file is full of pages (and empty slots). Pages are overwritten individually as necessary.

Some software projects start with a dream of storing their data in a 'simple' way by combining two well tested technologies: Amazon S3 and SQLite (or DuckDB). After all, what could be simpler and more straightforward? Sadly, they go together like oil and water.

When your SQLite database is kept in S3, each write suddenly becomes a total overwrite of the entire database. While S3 can do big writes fast, even it isn't fast enough to make that strategy work for any but the smallest datasets. And you're jettisoning all the transactional integrity that the database authors have painstakingly implemented: rewriting the database file each time throws out all that stuff. On S3, the last write wins.

What S3 is good at and what it is bad at

The joy of S3 is that bandwidth ("speed") for reads and writes is extremely, extremely high. It's not hard to find examples online of people who have written to or read from S3 at over 10 gigabytes per second. In fact I once saturated a financial client's office network with a set of S3 writes.

But the lack of partial overwrites isn't the only problem. There are a few more.

S3 has no rename or move operation. Renaming is CopyObject and then DeleteObject. CopyObject takes linear time to the size of the file(s). This comes up fairly often when someone has written a lot of files to the wrong place - moving the files back is very slow.

And listing files is slow. While the joy of Amazon S3 is that you can read and write at extremely, extremely, high bandwidths, listing out what is there is much much slower. Slower than a slow local filesystem.

But S3 is much lower maintenance than a filesystem. You just name the bucket, name the key and the cloud elves will sort out everything else. This is worth a lot as setting backups, replicating offsite, provisioning (which, remember is for IO ops as well as capacity) is pure drudge-work.

Module depth is even more important across organisations

In retrospect it is not a surprise that S3 was the first popular cloud API. If deep APIs are helpful in containing the complexity between different modules within a single system (like your computer) they are even more helpful in containing the complexity of an interaction between two different businesses, where the costs of interacting are so much higher.

Consider a converse example. Traditionally when one business wants to get its computers working with those of another they call it "integration". It is a byword for suffering. Imagine you are tasked with integrating some Big Enterprise software horror into your organisation. Something like SAP. Is SAP a deep module? No. The tragedy of SAP is that almost your entire organisation has to understand it. Then you have to reconcile it with everything you're doing. At all times. SAP integration projects are consequently expensive, massive and regularly fail.

There isn't much less complexity in S3 than there is in a SAP installation. Amazon named it the "Simple Storage Service" but the amount of complexity in S3 is pretty frightening. Queuing theory, IO contention, sharding, the list of problems just goes on and on - in addition too all the stuff I listed above that filesystems deal with. (And can you believe they do it all on-prem?)

The "simple" in S3 is a misnomer. S3 is not actually simple. It's deep.

Contact/etc

Other notes

I don't mean to suggest in any way via this article that S3 is not overpriced for what it is. To rephrase a famous joke about hedge funds, it sometimes seems like The Cloud is a revenue model masquerading as a service model.

The concept of deep vs shallow modules comes from John Ousterhout's excellent book. The book is effectively a list of ideas on software design. Some are real hits with me, others not, but well worth reading overall. Praise for making it succinct.

A few databases are explicitly designed from the start to use the S3 API for storage. Snowflake was. So it's possible - but not transparently. But snowflake is one of the few I'm aware of (and they made this decision very early, at least by 2016). If you know of others - let me know by email.

It isn't just databases that struggle on S3. Many file formats assume that you'll be able to seek around cheaply and are less performant on S3 than on disk. Zipfiles are a key example.

Other stuff about S3 that is a matter for regret

I genuinely like S3 so did not want to create the wrong impression by including a laundry list of complaints in the middle of the post but anyway here are the other major problems I didn't mention above:

The S3 API is only available as XML. JSON was around in 2006 but XML was still dominant and so it's probably not a surprise that Amazon picked XML originally. It is a surprise that Amazon never released a JSON version though - particularly when they made the switch from SOAP to REST, which would have been a good time.
It's also a matter for regret that Amazon gave up on maintaining the XSD schema as this is one of the key benefits of XML for APIs. The canonical documentation is just a website now.
Criminally, Amazon - like many cloud service providers - have never produced any kind of local test environment. In Python, the more diligent test with the moto library. moto is maintained by volunteers which is weird given that it's a testing tool for a commercial offering.
Amazon S3 does support checksums. For whatever reason they are not turned on by default. Amazon makes many claims about durability. I haven't heard of people having problems but equally: I've never seen these claims tested. I am at least a bit curious about these claims.
For years Amazon S3 held one other trap for the unwary: eventual consistency. If you read a file, then overwrote it, you might read it back and find it hadn't changed yet. Particularly because it only happened sometimes, for short periods of time, this caused all manner of chaos. Other implementers of S3 didn't copy this property and a few years ago Amazon fixed it in their implementation.