Building a database (part 1): Memory tables and writing to disk

In almost every project I ever built, a database was a central component. Often I’d read online about what specific database engines excelled in, and what they lacked. However, I never fully understood what this meant exactly. I think the best way to learn more about a topic is to try to build something myself.

In this and the following posts I will implement a simple database engine from scratch. My intent is to document the process, which means each new post will add incremental improvements. In this first post, I’ll go over database basics and implement some building blocks.

This post is the first part of building a simple database engine from scratch. It covers:

Requirements for a simple database
How to store data on disk
Key components of a database engine (WAL, Memtable, SSTables)
First steps in implementation (in Go)

All the code from this part can be found here

In follow up posts I will try and optimize the database and also expand upon its functionality:

Part 2 can be found here

Requirements

Before starting with anything, it is important to determine the database requirements. Clear requirements can direct the design while also preventing unnecessary work.

Data written must be persistent (not lost on system restarts).
The database must be durable (Data must never be lost if the database reports that it’s inserted, even if the system crashes during a write).
Insertion must be atomic.
There should be a basic schema mechanism to define the structure of the data.
There should be a way of interfacing with the database (inserting, reading, updating and deleting data).
There should be a primary index to speed up finding data.

Out of scope:

Relations, this won’t be a relational database.

Writing data to disk

A core part of a database is being able to store data persistently (on disk) and being able to search it.

There are many ways of achieving this. Modern databases often use one of two data structures to store data: B-Trees or SSTables. While B-Trees are very good at searching through the data quickly, it is inefficient to update a B-Tree stored on disk. SSTables excel at writing quickly, but are less optimized for reading. However, read speeds can be improved by optimization techniques like bloom filters and indexes. They also lend themselves better for distributed systems, as they are easier to reason about. SSTables are used to power popular databases like CassandraDB B-Tree’s are often used in relational databases or databases where reading just as or more important than writing. Examples are PostgreSQL and MySQL.

This project will use an architecture that is heavily inspired by CassandraDB’s SSTable architecture.

SSTables in practice

SSTables (sorted string tables) are effectively a sorted list of strings that can be stored on disk. Because it’s a sorted list, it’s very efficient to search through the data quickly. When data is written to the database, it is initially stored in an in memory data structure. When the in memory data structure reaches a threshold size, it is written to disk as an SSTable. The SSTable is immutable, and cannot be updated. This means mutations on existing records must be written to a new SSTable. Altered records become new entries with the same key in the memtable, which are later flushed to new SSTables, deleted records also become new entries and are tombstoned (marked as deleted).

Imagine a key value table written and stored on disk.

Table 1:

Key, Deleted, Value

1, False, "a"
2, False, "b"
3, False, "c"
6, False, "d"

At this point the user adds the record (4, ‘a2’), removed the record with id 2 and update the record 6 to value ‘d2’. These changes are written to a new table on disk:

Table 2:

Key, Deleted, Value

2, True,  "b"
4, False, "a2"
6, False, "d2"

When the database is queried, we first look into table 2, and then table 1. This ensures the database always returns the most up to date information. A downside of this approach is that if a key is queried that is not in the database, all tables will have to be searched.

Compaction

With the SSTables method of storing data, a lot of potential duplication is introduced. In order to mitigate this, tables are compacted in the background over time. For example, the two tables above could be compacted into the following single table:

Table 3:

Key, Deleted, Value

1, False, "a"
3, False, "c"
4, False, "a2"
6, False, "d2"

By using a merge sort, it is efficient to merge 2 sorted lists into 1 sorted list.

SSTables enable quick writes, as the database only ever has to append to a file, which is much quicker than writing to a file in a random location.

Blocks

It’s impossible to predict the length of an entry as it depends on what the user is storing. This means that if multiple entries are written to a SSTable, we cannot know where entries start and end without scanning the file from the beginning. To address this, SSTables are typically structured into blocks of fixed size. This opens the door for many possible search optimizations. Let’s investigate how these blocks are structured.

First we determine a static size for each block, typically around 2KB. We adhere to the following rules when writing to block:

Each block always starts with a new entry to ensure entries are never split across block boundaries.
If an entry cannot be added to a block without exceeding the remaining space in the block, the entry is added to a new block. The current block is padded with empty bytes. Padding is important because we want blocks of fixed size, but this also wastes some space on disk.
To keep things simple if an entry is larger than the block size, it will never fit. The database will throw an error.

The image below shows a possible block layout on disk with a 10 byte block size. The first 2 entries are placed in the first block, which leaves 2 bytes left. The next entry is 6 bytes and does not fit. The current block is padded with 2 bytes (marked in blue), and the new entry is added to the next block.

Image displaying multiple blocks

In a later part, we can use the predictable block structure to build an index that contains information about what keys are in which block. This will will greatly increase search speed.

Database design

The SSTable provides a structure for storing data on disk optimized for insertion. However to have a working database, we need a few more components. In this section we cover the high level database design:

A memory table that stores recent operations on the database
A WAL (Write Ahead Log) to guarantee persistence of the memory table in case the computer shuts down
SSTables that store ordered rows.

When a user inserts a new record, the action is written to the WAL and inserted into the memory table (memtable). The memory table is ordered by the primary key, but the WAL is ordered by time. If the process crashes, all data in the memtable is lost. After restarting, the database process can replay the WAL to restore the memtable.

As more data is inserted over time, the memtable grows. When it has reached a certain size (e.g: 2MB). The table is written to a SSTable on disk. Over time the SSTables are compacted in the background.

Image displaying database design

Structuring files and tables

Each table in the database is given a separate directory with SSTables. The database also maintains a separate memtable per table. The WAL however is unified and contains the operations on all tables.

Of course the database needs to keep track of what tables exist. When the database is initialized, it will create a ‘book keeping’ table to store this information.

Database operations

Inserting data

If a user inserts a record, we write the insertion to the WAL and insert the row in the memtable in its appropriate ordered location.

Updating data

If a user updates a record, we update the data in the memtable, but if the record is not in there (because it’s been flushed to disk) it is simply added to the memtable as if it is an insertion.

Reading data

When a user queries a specific primary key, the database engine searches the memtable. If it isn’t there it then starts to search the most recent SSTable and all following SSTables until it finds it. This means that if the key is not in the database the process has to search everything. This part can be sped up by an indexing strategy.

Deleting data

Like updating we update the record by setting a deleted flag to true

A first implementation

Let’s start implementing some of the basic building blocks of the database. I have decided to use Golang as I’m familiar with it and it’s easy to read.

Memory table

First we implement the memtable component. We need the memtable to be able to support different types of table structures later on, so let’s design it with that in mind.

We’ll maintain an ordered list of entries. Each entry has a primary key, a deleted flag and value. For now the primary key will be encoded as bytes, and the values will be a list of byte arrays. This allows us to store an type of data in the database, later on we might change this to better support our schema mechanism.

type Memtable struct {
	entries *list.List
}

type Entry struct {
	id      []byte
	value   []byte
	deleted bool
}

The memtable has functions for insert / update / get

func newMemtable() Memtable {
	return Memtable{
		entries: list.New(),
	}
}

func (m *Memtable) get(id []byte) *Entry {
	for e := m.entries.Front(); e != nil; e = e.Next() {
		if bytes.Equal(id, e.Value.(Entry).id) {
			entry := e.Value.(Entry)
			return &entry
		}
	}
	return nil
}

func (m *Memtable) update(id []byte, value []byte) {

	entry := Entry{
		id:      id,
		value:   value,
		deleted: false,
	}

	for e := m.entries.Front(); e != nil; e = e.Next() {
		if bytes.Equal(e.Value.(Entry).id, id) {
			e.Value = entry
			return
		}
	}
}

func (m *Memtable) insert(entry Entry) {
	for e := m.entries.Front(); e != nil; e = e.Next() {
		next := e.Next()
		if next != nil && bytes.Compare(entry.id, next.Value.(Entry).id) == -1 {
			m.entries.InsertBefore(entry, next)
			return
		}
	}

	m.entries.PushBack(entry)
}

Serializing the data

In order to store the data on disk and in the WAL the data needs to be serialized. We’ll convert an entry into a byte array with the following format:

Image displaying entry layout on disk

func (entry *Entry) serialize() (int, []byte) {
	var header bytes.Buffer
	var content bytes.Buffer

	header.WriteByte(byte(len(entry.id)))
	header.Write(entry.id)

	if entry.deleted {
		content.WriteByte(1)
	} else {
		content.WriteByte(0)
	}

	content.WriteByte(byte(len(entry.value)))
	content.Write(entry.value)

	bytes := append(header.Bytes(), content.Bytes()...)

	return len(bytes), bytes
}

To read it back into the data structure:

func (entry *Entry) deserialize(entryBytes []byte) error {
	buf := bytes.NewBuffer(entryBytes)

	idLen, err := mustReadByte(buf)
	if err != nil {
		return err
	}

	id, err := mustReadN(buf, int(idLen))
	if err != nil {
		return err
	}
	entry.id = id

	deleted_i, err := mustReadByte(buf)
	if err != nil {
		return err
	}
	entry.deleted = false
	if int(deleted_i) == 1 {
		entry.deleted = true
	}

	valueLen, err := mustReadByte(buf)
	if err != nil {
		return err
	}

	value, err := mustReadN(buf, int(valueLen))
	if err != nil {
		return err
	}
	entry.value = value

	return nil
}

To test whether the serialization is working, we write some unit tests.

func TestSerializeDeserializeStr(t *testing.T) {
	e := Entry{
		id:      []byte("abc"),
		value:   []byte("abcd"),
		deleted: false,
	}

	_, s := e.serialize()
	e1 := Entry{}
	e1.deserialize(s)

	if !reflect.DeepEqual(e, e1) {
		t.Errorf("Deserialized struct does not match original.\n Expected \n%v \n got \n%v", e, e1)
	}
}

func TestSerializeDeserialize(t *testing.T) {
	e := Entry{
		id:      intToBytes(368),
		value:   intToBytes(368),
		deleted: false,
	}

	_, s := e.serialize()
	e1 := Entry{}
	e1.deserialize(s)

	if !reflect.DeepEqual(e, e1) {
		t.Errorf("Deserialized struct does not match original.\n Expected \n%v \n got \n%v", e, e1)
	}
}

Creating the SSTable

When the memtable is full, we want to store it to disk. This is where the SSTable comes in. To build the table, we serialize each entry in the list. We then need to write to the block structure as defined before. For now we’ll only implement a single table, so bookkeeping will be very simple.

First we’ll create a custom SSTableWriter and Reader. These will use a bufio Reader/Writer to access the files. Bufio is much quicker than writing to files directly, and also allows us to write to in memory buffers, which is helpful for testing.

type SSTableWriter struct {
	buffer          *bufio.Writer
	currentBlockLen int //The length of the current block we're writing to
}

type SSTableReader struct {
	buffer    *bufio.Reader
	rawBuffer *bytes.Buffer
	file      *os.File
}

func newSSTableReader(rawBuffer *bytes.Buffer) SSTableReader {
	return SSTableReader{
		buffer:    bufio.NewReader(rawBuffer),
		rawBuffer: rawBuffer,
	}
}

func newSSTableReaderFromPath(path string) SSTableReader {
	fd, err := os.Open(path)
	panicIfErr(err)
	return SSTableReader{
		buffer: bufio.NewReader(fd),
		file:   fd,
	}
}

func newSSTableWriter(buffer *bufio.Writer) SSTableWriter {
	return SSTableWriter{
		buffer:          buffer,
		currentBlockLen: 0,
	}
}

func newSSTableWriterFromPath(path string) SSTableWriter {
	fd, err := os.OpenFile(path, os.O_CREATE|os.O_RDWR, 0644)
	panicIfErr(err)
	return SSTableWriter{
		buffer:          bufio.NewWriter(fd),
		currentBlockLen: 0,
	}
}

Next we’ll make a function for writing a single entry to the buffer

func (writer *SSTableWriter) writeSingleEntry(entry *Entry) error {
	blockSize := 100

	size, serialized_entry := entry.serialize()
	//Check to see if there is enough place in the block to add the entry
	if size > (blockSize - writer.currentBlockLen) {
		if size > blockSize {
			//Will never fit
			return errors.New("entry larger than max block size")
		}

		//Pad remainder of block
		padding := blockSize - writer.currentBlockLen
		_, err := writer.buffer.Write(make([]byte, padding))
		panicIfErr(err)

		writer.currentBlockLen = 0
	}

	writer.currentBlockLen += size
	_, err := writer.buffer.Write(serialized_entry)
	panicIfErr(err)
	writer.buffer.Flush()
	return nil
}

And finally a function for iterating through the memtable and writing each entry

func (writer *SSTableWriter) writeFromMemtable(memtable *Memtable) error {
	for e := memtable.entries.Front(); e != nil; e = e.Next() {
		entry := e.Value.(MemtableEntry)
		writer.writeSingleEntry(&entry)
	}
	return nil
}

Searching in the SSTable

We can now do a very simple lookup in the table. This implementation is just to try it out, later we’ll replace this with a faster solution. It would be nice if we could reuse the deserialize function we wrote earlier. However, that function takes a byte slice. It would be better if it takes a buffer, so we can just borrow our reader buffer to it. This is actually very easy, the function already uses a buffer internally!

The first few lines of the function right now are:

func (entry *Entry) deserialize(entryBytes []byte) error {
	buf := bytes.NewBuffer(entryBytes)

Let’s change to that

func (entry *Entry) deserialize(buf *bufio.Reader) error {

We also update the mustReadN and mustReadByte function signatures:

func mustReadN(buf *bufio.Reader, n int) ([]byte, error)
func mustReadByte(buf *bufio.Reader) (byte, error)

(I also changed some of the tests to support this new function signature, but you can check out the final code to see those (minor) changes.)

This is great, now we can reuse the deserialize function much better. Let’s move on to the search function. We’ll create a function that reads the next entry from a SSTable. To prevent having to load the entire file in from memory and read efficiently from disk we use the bufio package again.

func (reader *SSTableReader) readNextEntry() (Entry, error) {
	for { //If the size is 0, it's padding in a block. Keep looking until a new block or EOF
		idSize, err := reader.buffer.Peek(1)

		if err != nil {
			return Entry{}, err
		}

		if idSize[0] == byte(0) {
			reader.buffer.ReadByte() //Consume the zero byte
			continue
		}

		reader.buffer.Peek(config.BlockSize)
		break
	}

	entry := Entry{}
	entry.deserialize(reader.buffer)

	return entry, nil
}

func scanSSTable(buffer *bufio.Reader, searchId []byte) (*MemtableEntry, error) {
	reader := newSSTableReader(buffer)
	for {
		entry, err := reader.readNextEntry()
		if checkEOF(err) {
			return nil, nil
		}
		if !bytes.Equal(entry.id, searchId) {
			continue
		}

		return &entry, nil
	}
}

To wrap up I’ll also write a unit test to see if the SSTable is working as expected.

func TestSSTable(t *testing.T) {
	memtable := newMemtable()

	for id := range 3000 {
		memtable.insertRaw(intToBytes(id), intToBytes(id))
	}

	buffer := bytes.Buffer{}
	writer := newSSTableWriter(bufio.NewWriter(&buffer))
	writer.writeFromMemtable(&memtable)

	reader := bufio.NewReader(&buffer)

	for id := range 3000 {
		result, _ := scanSSTable(reader, intToBytes(id))

		entry := Entry{
			intToBytes(id),
			intToBytes(id),
			false,
		}

		if !reflect.DeepEqual(&entry, result) {
			t.Errorf("Result does not match query. \nExpected: \n%v\n Got:\n %v", entry, result)
			return
		}
	}
}

Conclusion

In this post we went over some fundamental concepts for database design and implemented the basic building blocks. In the next post we’ll expand the database by implementing the WAL and managing multiple SSTable file on disk.