Sceptre - I Built a Database So I Could Finally Understand One

00 - the itch

There is a list. Most programmers carry it quietly, somewhere between imposter syndrome and genuine curiosity. It doesn't live on a resume. It lives in the back of your head.

Build a compiler. Write an OS. Implement a database from scratch.

These aren't things you build to get a job. They're things you build because a question keeps nagging you - do I actually understand what's happening underneath all this abstraction? Or have I just been calling functions this whole time?

I'd been writing code for a while now and am pretty comfortable. Comfortable enough, actually, that comfortable started feeling like a warning sign. I could wire up a Postgres instance, write migrations, tune an index, get things working. But if you asked me what an index actually was - not the concept, but the physical thing on disk, as bytes - I'd have given you a decent-sounding answer with a hollow center.

That hollow center kept me up at night more than I'd like to admit.

So I built Sceptre. A small embedded relational database, written in Go, from the storage layer up to SQL queries. Not to replace SQLite. Not to ship anywhere. Just to make the hollow center solid.

Repo: vennictus/sceptre

This is that story. Let's go for a walk.

01 - where it started

A book, and then a question

It started with a book - Build Your Own Database From Scratch in Go. Genuinely one of the best technical things I've read. Not because it hands you a finished database, but because it hands you the vocabulary. Pages. B+ trees. Key comparison. The stuff you need to even begin thinking clearly about how storage works.

I followed along. I built what the book asked me to build. And then I finished it, closed the laptop, and sat with the familiar post-project question: now what?

The easy answer was to move on. Feel smart for a few days, add it to GitHub, continue with life. But something about that felt wrong. I had a foundation. And I kept thinking - what if I kept going and made it mine? Not toward performance. Not toward production readiness. Toward something different: transparency.

What if I built a database you could actually look inside? Where query plans would tell you what they were going to do before they did it. Where you could inspect raw pages on disk and understand what you were seeing. Where crashing the database at different points had predictable, explainable recovery behavior.

That became the design identity of Sceptre. Transparency over performance. Inspectability over speed. One call, and it shapes every decision that follows.

I didn't try to build a SQLite replacement. I tried to build a database whose moving parts I could inspect and explain.

Enough context. Let's get into how it actually works - starting from the very bottom.

02 - the foundation

What even is a database file?

Before trees, indexes, queries - there's a more basic question worth sitting with: what is a database file? Like, literally, what is it?

The naive answer is: it's bytes. You append a row when someone inserts. You scan through when someone queries. Simple enough.

This works fine for a toy. But the moment your use case grows even slightly - you need to jump directly to one row without reading everything before it, or you delete something and want to reuse that space, or your process crashes mid-write and you need to recover cleanly - the "just bytes" model falls apart. There's no address system. No way to say "give me the thing at position X." No way to reclaim space without reorganizing everything.

The solution databases landed on is splitting the file into fixed-size pages. In Sceptre, each page is 4096 bytes - the same size as a memory page on most systems, not a coincidence. Each page gets a stable ID: page 0, page 1, page 2, and so on. The whole file becomes a numbered sequence of equally-sized slots.

Think of it like a notebook where every page is the same size. You can say "page 7" and know exactly where in the notebook that is - no scanning required. You can tear out a page and hand it to someone, and they can read it without the rest of the book. That stability - every page addressable, every page self-contained - is the foundation everything else rests on.

an example database file as a numbered sequence of fixed-size pages

an example database file as a numbered sequence of fixed-size pages

Pages 0 and 1 are reserved before anything else gets a chance to claim space. They're not for tree data. They're for something that has to come first - the root of the entire database. Without them, you have a pile of addressable bytes with no way in.

Which raises the obvious question: why two of them?

03 - durability begins here

Two meta pages, and why one isn't enough

Before I explain the two meta pages, let me explain what a meta page even does, because it's carrying a lot of weight here.

The meta page is the root of everything. It contains the page ID of the B+ tree root, the page ID of the freelist, how many pages exist in the file, the format version, and a checksum. From just that small header, you can reconstruct the entire database. Follow the root page ID to the tree, follow the freelist page ID to the freelist. Everything is reachable.

Which means if you lose the meta page - or worse, if it gets partially overwritten - you lose the database. Not the data bytes, they might still be sitting on disk somewhere, but any ability to find them.

Imagine you only had one. You start writing an updated meta page - new root, updated page count - and your process crashes halfway through. Now page 0 is a partial write. The old data is gone. The new data is incomplete. There's no going back.

Two slots solve this with one elegant rule: always write to the slot that isn't currently active. Every write gets a higher generation number. On startup, both slots are read, both are validated, and whichever valid one has the higher generation wins.

commit sequence

// step by step
1. write new tree pages to disk
2. write new freelist pages to disk
3. fsync          ← make sure those pages are actually durable
 
4. publish new meta to the inactive slot and fsync it  ← the commit point
 
// crash before line 4  →  old meta wins, old version intact
// crash after line 4 succeeds  →  new meta wins, new version visible
// torn/corrupt meta write      →  checksum rejects it, old meta wins

That's the whole trick. The new data pages can exist on disk before the meta page is updated - they're just unreachable, invisible to anyone opening the database. The moment the new meta page is written, the new version becomes real. Before that moment, the old version is untouched.

The crash boundary is completely unambiguous. No partial state to reason about. Either the commit finished or it didn't, and the answer is always clear.

This was the first deep thing Sceptre taught me: durability isn't something you add at the end. It's a property of the write protocol. You design the sequence of writes around crash points from the beginning, or you don't have durability at all.

04 - structure

The tree that holds everything

So we have pages. We have a root page that anchors them. The question now is: what data structure actually lives on those pages? How do you store keys and values in a way that supports looking up a specific row, scanning a range of rows, iterating everything in order?

You could reach for a hash map. Hash maps are fast for point lookups - "give me the row where id equals 42." But the moment someone asks "give me every row where age is between 25 and 35," a hash map has no answer. It has to look at everything. Order doesn't exist in a hash map, and a database needs order constantly.

The data structure databases landed on is the B+ tree. It keeps keys sorted at all times. A point lookup descends from the root to the right leaf. A range scan seeks to the start of the range and walks forward - you never touch anything outside it.

In Sceptre, the B+ tree is purely a byte-key store. It knows nothing about tables, columns, types, or SQL. It only knows one thing: here is a key, here is a value, keep them sorted. That's the entire contract.

internal nodes guide the search · leaves hold the actual data

internal nodes guide the search · leaves hold the actual data

Each layer is completely ignorant of everything above it. The tree doesn't care what the bytes represent. The table layer doesn't care how the tree is arranged on disk. That clean separation is what makes the whole thing navigable - you reason about one layer at a time.

layer contract

// each layer only knows what it needs to
B+ tree:   "here is a byte key and a byte value, keep it sorted"
table:     "this byte key represents row users(id=42)"
SQL:       "this row satisfies WHERE age = 31"

But here's the thing the tree doesn't know: how to compare those bytes meaningfully. It compares them as raw unsigned bytes, left to right. And that creates a problem the next layer has to solve.

05 - the subtle part

Making bytes sort like real values

The B+ tree compares raw bytes. That's the rule. Which means if you hand it a signed 64-bit integer encoded naively, the sort order will be wrong - silently, with no error anywhere.

Here's why. Computers store signed integers in two's complement. Negative numbers have their highest bit set to 1. Which means in a raw unsigned byte comparison, negative numbers look "larger" than positive ones. -1, encoded naively, would sort after every positive number. Your integers would be completely out of order and you'd never know unless you checked.

The fix is one operation: flip the sign bit, then write big-endian bytes.

encoding.go

// turning a signed int64 into sortable bytes
encoded = uint64(value) XOR (1 << 63)
 
// before: -1 looks "huge" in unsigned space
// after:  -1 maps near-zero, positives map high
// the full int64 range now sorts correctly as raw bytes

One XOR, and the tree compares integers correctly without knowing anything about integers.

Strings have a different problem. Indexes often cover multiple columns - (city, age), say. Which means you need to concatenate two encoded values into one key. But if you just smash them together, how does anything know where the city ends and the age begins? A city value that happens to contain bytes that look like an age value creates ambiguity.

string encoding

// encoding a byte string so it's unambiguous in a composite key
literal 0x00  →  0x01 0x01     ← escaped
literal 0x01  →  0x01 0x02     ← escaped
end of value  →  0x00          ← terminator
 
// 0x00 now unambiguously means "this value ends here"
// concatenate as many values as you want, decode left to right

This is the subtlest lesson in the whole project. The B+ tree's byte comparison isn't a limitation - it's the foundation you build ordering on top of. Get the encoding wrong, and your database silently returns results in the wrong order. No crash, no error, just wrong data.

Now that the encoding is solid, we can talk about what a row key actually looks like.

06 - rows as keys

A row is just a carefully designed key

We have a tree that stores sorted byte keys. We have an encoding layer that turns typed values into those bytes correctly. Now the question is: what does a row actually look like inside the tree?

Every row gets a key built from three things, concatenated together:

row key

// row key structure
0x10                    ← "this is a row, not an index entry"
<4-byte table prefix>  ← "this row belongs to the users table"
<encoded primary key>  ← "specifically, id = 42"

The table prefix is a unique numeric ID assigned to each table when it's created. Every row from the same table shares that prefix, which means they're physically grouped together in the B+ tree. A full table scan becomes: seek to the start of the prefix, read forward until the prefix changes. You never accidentally read another table's rows.

The primary key columns live in the key rather than the value because they define both the row's identity and its sort order. Every other column - name, age, city, whatever - lives in the value payload. When you decode a row, you pull the primary key out of the key bytes and everything else out of the value bytes, then combine them.

This makes primary key lookups nearly free. Encode the target ID, construct the row key, do a single point lookup in the tree. One operation, no scan. The structure of the key is the optimization.

But users don't always query by primary key. They want rows where age equals 31, or city equals Delhi, or email matches a specific address. For those queries, without something extra, you're back to scanning every row in the table. That extra thing is the secondary index.

07 - looking up by anything else

Secondary indexes: the key inside the key

A secondary index on age sounds simple - just store the age value as a key, and point it at the row. But there's an immediate problem: multiple rows can have the same age. If you use age alone as a key, you get collisions. The second row with age 31 overwrites the first.

The fix is to include the primary key in the index key as well:

index key

// secondary index key structure
0x11                      ← "this is an index entry"
<4-byte index prefix>    ← "this belongs to users_age"
<encoded age value>      ← the thing we're indexing
<encoded primary key>   ← makes every entry unique + points back to the row
 
// the value for an index entry is empty
// the key itself contains everything needed

The primary key suffix solves two problems at once. It makes every index entry unique - two rows with age 31 get distinct keys because their primary keys differ. And it tells the engine exactly how to fetch the base row once the index lookup finds a match.

An index lookup on age = 31 works like this: encode 31, construct the prefix, seek the iterator to the first matching key, walk forward while keys stay inside the prefix, decode the primary key from each entry, fetch the base row. That's the whole operation.

One more thing about indexes: when you create one on a table that already has rows, you can't just update the schema. The index has to be backfilled - every existing row needs an index entry created for it, in the same atomic commit that updates the schema. Otherwise you'd have an index that only covers rows inserted after it was created. The consistency checker later validates this: if the schema says an index exists, every single row must have a corresponding entry.

Building this made me realise how much of what a database does is just careful key design. The tree is almost incidental - the real work is deciding what the keys look like.

We now have rows, we have indexes, we have a tree that keeps them sorted. What we don't have yet is any of this surviving a restart. That's next.

08 - making it permanent

Writes that survive a restart

Up to this point, we've described the shape of the data. Now comes the harder question: how does a new shape become permanent?

Making writes durable is where things get genuinely interesting, and where the two meta page trick from section 03 pays off fully.

Sceptre uses copy-on-write commits. During a commit, you never overwrite the active tree pages. Instead, you write a completely fresh set of pages for the new state, then atomically switch the meta page to point at them.

commit protocol

// what a commit actually does
1. apply all mutations to the in-memory tree
2. snapshot the new tree state
3. remap snapshot onto fresh page IDs
4. write new tree pages to disk
5. write new freelist pages to disk
6. fsync         ← "please actually persist these"
 
7. publish new meta page  ← the moment the new version becomes real
 
// crash before step 7: old meta still valid, old version visible
// crash after step 7:  new meta is durable, new version visible

Before step 7, the new pages exist on disk but nobody can find them. The old meta page is still active, pointing at the old tree. The new pages are ghosts - real bytes on disk, completely invisible.

Step 7 flips the switch. If it writes successfully, the new version is the database. If the process crashes mid-write, the old meta page survives and wins on reopen. There is no in-between state to deal with.

For this design, I don't need a write-ahead log to recover the latest committed root. You open the database, read both meta pages, pick the valid one with the higher generation, and you're done. The commit either happened or it didn't.

The cost is real - every commit writes a fresh remapped tree, not just the changed pages. That's more I/O than an in-place update strategy. For a learning project, that's a fine tradeoff. The mental model you get in exchange is worth every extra write.

But copy-on-write creates a different problem. Every commit retires the old tree pages. What happens to them?

09 - nothing is wasted

The freelist: giving old pages a second life

After a commit, the previous version's tree pages are no longer reachable from the new meta root. They're just sitting there - taking up space, contributing nothing. If you never reclaimed them, the file would grow with every single commit, forever.

The freelist is the solution. It's a linked list of pages stored inside the database file itself, tracking page IDs that are safe to reuse. When old tree and freelist pages become unreachable after a commit, they get retired into the next freelist so future commits can reclaim them.

The allocator has one priority rule: use a page from the freelist if one is available, otherwise extend the file by one page at the tail. Reusable pages always come first.

allocation

// what happens to old pages after a commit
new commit finishes
  → old tree pages become unreachable
  → old freelist pages become unreachable
  → all of them retire into the next freelist
 
next commit needs a page
  → check freelist first
  → if empty, extend file tail

In practice, once a database reaches a stable size - inserts and deletes roughly balanced - the file stops growing. Every delete frees pages that future inserts reuse. The size stabilizes.

There's one invariant the consistency checker enforces hard: no page can appear in both the reachable tree and the freelist at the same time. That would mean the same physical page is considered live data and also up for grabs. The next commit would silently overwrite live data. Finding that in production would not be a good day.

With durability and space reuse in place, we have a working database. Rows go in, rows survive restarts, space gets reclaimed. What we don't have yet is any intelligence about how to retrieve those rows efficiently. That's the planner's job.

10 - queries

The planner: deciding how much to scan

Every SELECT has to answer the same question: how do I get the candidate rows? The naive answer is always available - scan the entire table, check every row against the WHERE clause, return the ones that match. It works. It's just slow when the table is large and you only want a few rows.

The planner's job is to find a better answer when one exists. Sceptre's planner is deliberately simple - no cost estimation, no statistics, no query trees. It recognises patterns and maps them to access paths.

access paths

// four options, in priority order
primary_key_lookup      ← equality on all PK columns → single point lookup
secondary_index_lookup  ← equality on all indexed columns → index seek
primary_key_range       ← range on single-column PK → bounded scan
table_scan              ← everything else → read the whole table

The conditions that match the access path get consumed. Whatever's left becomes the residual filter - evaluated row by row after candidates are loaded.

the planner splits the WHERE clause - access path conditions vs residual filter

the planner splits the WHERE clause - access path conditions vs residual filter

The question you want to ask of any query plan is always: how many rows does the access path force me to examine before the filter even runs? The lower that number, the better the plan.

The best way to see this in action is to actually measure it - which is exactly what the next layer is for.

11 - the payoff

Seeing the work

This is the section the whole project was pointed at.

Every concept from the previous sections - pages, the meta page, the B+ tree, encoding, indexes, copy-on-write, the planner - these all exist in production databases. PostgreSQL does all of this and more. But in PostgreSQL, you can't easily watch them work. You submit a query, rows come back, the machinery is somewhere below you doing something opaque.

Sceptre is built to show the machinery. You can inspect the raw meta page state. You can walk every physical page in the file and see what type it is, what it holds, whether it's reachable or free. And you can run queries that show you exactly what happened.

explain-analyze

// without an index on age
$ sceptre explain-analyze app.db "select * from users where age = 31"
 
plan:          table_scan
rows scanned:  10000
rows matched:  12
rows returned: 12
 
// create the index, run the same query
$ sceptre explain-analyze app.db "select * from users where age = 31"
 
plan:          secondary_index_lookup (users_age)
rows scanned:  12
rows matched:  12
rows returned: 12

Ten thousand rows scanned versus twelve. The database is doing 830 times less work and you can see the exact number. Not an estimate. Not a relative improvement. The actual rows touched.

This is the thing I wanted from the beginning. Not just that the index helps - I already knew that in the abstract - but being able to watch it happen with real numbers attached. The difference between knowing something conceptually and watching it play out in front of you is larger than I expected.

The consistency checker sits alongside this. It validates every invariant the database is supposed to maintain - unique table prefixes, every indexed row has an index entry, every index entry points to a real row, no page appears in both the tree and the freelist, all page IDs are within the file's bounds. If anything is wrong, it tells you what and why.

A good database isn't just about writing data correctly. It's about knowing what "correct" means and being able to check it.

The crash test suite is the last piece. It injects failures at each stage of the commit protocol - after pages are written, after they're synced, after the meta page is published - then reopens the database and checks whether the expected version is visible. The results always match exactly what the copy-on-write design predicts. Before meta publish, old version. After meta publish, new version. Every time, deterministically.

Running those tests for the first time and watching them pass felt different from passing a unit test. It felt like proof. Like the design was actually correct, not just probably correct.

Sceptre deliberately stops short of being a production database. It does not have joins, MVCC, a WAL, a cost-based optimizer, or SQL transaction statements. That is not the point. The point was to build one vertical slice deeply enough that every layer became inspectable.

12 - what it actually changed

The hollow center, filled

I can list the concrete things I now understand. Why B+ trees won over hash maps for storage engines. Why pages are 4096 bytes. What fsync is actually buying you and why skipping it is a lie you're telling yourself about durability. What an index scan does at the byte level. Why two's complement encoding breaks sort order and how one XOR fixes it.

But the more useful change is harder to describe.

When I add an index to a query now, I'm not hoping it helps. I understand the mechanism - the access path that gets chosen, the rows that stop being scanned, the encoding that makes the seek possible. When I think about crash recovery, I'm not trusting the database to handle it. I understand the write protocol that makes it work, and I know what would break if you violated it.

The abstraction became load-bearing. That's the only way I know how to say it. Before, Postgres was a black box I trusted. Now it's a set of design decisions I understand - decisions I happen to know work, because I made similar ones myself and watched them succeed and fail.

The core lesson of Sceptre, if I had to compress it into one thing, is that a database is a series of translations:

the chain

SQL statement
  → parsed into an AST
  → turned into a query plan
  → mapped to table operations
  → encoded as byte keys and values
  → written into a B+ tree
  → snapshotted onto disk pages
  → made permanent by a meta page publish

Each translation has rules. Each rule creates invariants. When something goes wrong in a real database - wrong results, corruption, a recovery that didn't recover - it's because one of those rules got violated somewhere. Understanding the rules is the only way to know where to look.

I still use Postgres. I still call standard libraries. I'm not writing storage engines for production. But something changed in how I sit with these systems. They're not opaque anymore. They're just layers, each doing one thing, each earnable from first principles if you're willing to build them yourself.

That's what was on the list all along. Not the database. The understanding you can't unfeel once you have it.