← Back to all posts
January 29, 2025 15 min read

Building an LSM Tree Storage Engine from Scratch in Go

Go Databases Systems

I wanted to understand how databases actually store data on disk. Not the SQL layer, not the query optimizer — the raw bytes. So I built an LSM tree from scratch.

What is an LSM Tree?

LSM stands for Log-Structured Merge Tree. It's the storage engine behind LevelDB, RocksDB, Cassandra, and many modern databases. The key insight: sequential writes are fast, random writes are slow.

Instead of updating data in place (like B-trees), LSM trees batch writes in memory and flush them sequentially to disk.

The Architecture

My implementation has four components:

1. Write-Ahead Log (WAL) — Every write goes here first. If the process crashes, we replay the WAL to recover.

2. Memtable — An in-memory sorted map (I used a red-black tree). Writes accumulate here until it reaches a size threshold.

3. SSTables — When the memtable is full, it flushes to disk as a Sorted String Table. These are immutable files with sorted key-value pairs.

4. Compaction — Background process that merges SSTables to reclaim space and improve read performance.

The Write Path

Put(key, value)
    │
    ▼
Write to WAL (durability)
    │
    ▼
Insert into Memtable
    │
    ▼
Memtable full? ──yes──► Flush to SSTable

The WAL is append-only, so writes are sequential and fast. The memtable keeps everything sorted in memory.

The Read Path

Reading is trickier. The data might be in the memtable or any SSTable:

Get(key)
    │
    ▼
Check Memtable ──found──► return
    │
    not found
    ▼
Check SSTables (newest to oldest) ──found──► return
    │
    not found
    ▼
return nil

Each SSTable has an index block, so we binary search rather than scanning the whole file.

SSTable Format

I kept it simple:

[data block][data block][...][index block][footer]

Data block: [key_len][key][value_len][value]...
Index block: [key][offset] pairs for binary search
Footer: offset to index block

Compaction

Without compaction, you'd have thousands of SSTables and reads would be slow. Compaction merges multiple SSTables into one, removing deleted keys and old versions.

I implemented simple size-tiered compaction: when you have N SSTables of similar size, merge them into one larger SSTable.

What I Learned

Durability is hard. The WAL needs to fsync to actually survive crashes. I learned the difference between "written" and "durable."

File formats matter. Spent hours debugging because I forgot to track byte offsets correctly. Off-by-one errors in binary formats are painful.

Reads vs writes is a tradeoff. LSM trees optimize for writes. B-trees optimize for reads. There's no free lunch.

What I Skipped

Production LSM trees have bloom filters (skip SSTables that definitely don't have a key), compression, leveled compaction, and concurrent access. Mine is single-threaded and uncompressed.

But that's fine. The goal was understanding the core pattern, not building a production database.

The Code

The full implementation is on GitHub. About 1500 lines of Go with no dependencies beyond the standard library.

If you want to understand database internals, build one. There's no substitute for writing the bytes yourself.

Thanks for reading. Feel free to reach out via email for any questions.