Databases
Scale your data layer. Learn about B-Trees vs LSM-Trees, replication models, the CAP/PACELC theorems, sharding strategies, and distributed database challenges.
Database Architecture
Scaling the application layer is simple—you spin up more stateless servers behind a load balancer. However, scaling the data layer is one of the most difficult challenges in distributed systems. Databases must maintain state, ensure consistency, and prevent data loss, even during hardware crashes and network partitions.
1. Storage Engines: B-Trees vs. LSM-Trees
How a database writes data to the disk determines its performance characteristics. The two dominant engine designs are:
B-Trees (Read-Optimized)
B-Trees organize data into fixed-size pages (typically 4KB) on disk. They maintain a sorted, balanced tree structure.
- How it works: Writes modify pages in place. If a page is full, it splits into two.
- Pros: Fast, constant-time random reads ($O(\log N)$). Great for relational systems with complex index queries (e.g., PostgreSQL, MySQL).
- Cons: Slow writes. Modifying a record requires reading the page, modifying it, and writing it back to disk (random disk I/O), plus keeping indexes in sync.
Log-Structured Merge-Trees (LSM-Trees - Write-Optimized)
LSM-Trees append all writes to an in-memory sorted buffer (MemTable) and a sequential append-only log (Write-Ahead Log).
- How it works: When the MemTable is full, it is flushed to disk as a sorted, immutable file called a Sorted String Table (SSTable). Background threads merge and deduplicate SSTables (Compaction).
- Pros: Blazing fast writes. All writes are sequential disk appends. Perfect for write-heavy workloads (e.g., Cassandra, RocksDB, ScyllaDB).
- Cons: Slower reads. Reading requires checking the MemTable and scanning multiple SSTables (optimized using Bloom filters).
2. Database Replication Models
Replication is the practice of keeping copies of the same data on multiple physical machines. This provides high availability (if one node dies, read from another) and increases read capacity.
1. Leader-Follower (Master-Slave) Replication
All write operations go to a single designated node (the Leader). The leader writes the data locally and broadcasts the change log to all Followers (Read Replicas). Clients read from any node.
- Synchronous Replication: The leader waits for the follower to confirm the write before responding to the client.
- Trade-off: Guaranteed consistency, but writes are slow (latency of the slowest replica).
- Asynchronous Replication: The leader responds to the client immediately after writing locally and replicates to followers in the background.
- Trade-off: Fast writes, but followers may lag. A read immediately after a write might return stale data (Eventual Consistency).
2. Multi-Leader (Active-Active) Replication
Multiple database nodes act as leaders, accepting write operations. They synchronize changes with each other asynchronously.
- Pros: Survival of local data center outages; handles writes close to users globally.
- Cons: Extremely complex conflict resolution (e.g., what if two leaders update the same row at the same time?).
3. Leaderless (Dynamo-style) Replication
There is no central leader. Clients write to and read from multiple nodes in parallel. It uses quorums to guarantee consistency.
3. Distributed Consistency: CAP vs. PACELC Theorems
When you distribute a database across multiple servers, you are governed by the laws of network physics.
The CAP Theorem
The CAP theorem states that in a distributed data store, you can guarantee at most two of the following three characteristics:
- Consistency (C): Every read receives the most recent write or an error.
- Availability (A): Every non-failing node returns a non-error response (without guarantee that it contains the latest write).
- Partition Tolerance (P): The system continues to operate despite arbitrary packet loss or network splits.
[!IMPORTANT] Network partitions are inevitable. Therefore, in system design, you must choose between Consistency (CP) or Availability (AP) during a network partition:
- CP (Choose Consistency): Reject writes or delay reads to guarantee data correctness across nodes. Used in financial systems (e.g., Spanner, etcd).
- AP (Choose Availability): Accept writes on any reachable node. Replicas will sync when the partition heals. Used in social media feeds (e.g., DynamoDB, Cassandra).
The PACELC Theorem
The CAP theorem only applies when there is a partition. PACELC expands CAP to describe normal operations:
- If there is a Partition, choose between Availability or Consistency.
- Else (during normal operations), choose between Latency or Consistency.
| System | Classification | Behavior |
|---|---|---|
| MongoDB | PA/EL | During partition, chooses availability. Otherwise, prioritizes low latency. |
| Spanner | PC/EC | Prioritizes consistency at all times, accepting latency penalties. |
4. Horizontal Partitioning: Database Sharding
When your database size exceeds the storage of a single machine, or when write traffic saturates the disk controller, you must partition your database horizontally across multiple servers (Shards).
Sharding Strategies
1. Range-Based Sharding
Data is partitioned based on ranges of a key value (e.g., Shard A stores users starting with A-G, Shard B stores H-P, etc.).
- Pros: Easy to query ranges of data sequentially.
- Cons: Severe hot-spot issues. If 80% of users have last names starting with S, Shard C will crash under load while Shards A and B sit idle.
2. Hash-Based Sharding
A hash function is applied to the shard key, and the modulo of the number of shards determines the target database.
Shard ID = Hash(Shard Key) % Number of Shards- Pros: Even data distribution; prevents hot spots.
- Cons: Adding or removing a shard invalidates the modulo calculation, requiring you to migrate up to 90% of your data to new servers (solved by Consistent Hashing).
3. Directory-Based Sharding
A lookup service (or routing registry) stores the mapping between shard keys and physical database servers.
- Pros: Complete flexibility. You can move individual user records between shards without rehashing.
- Cons: The directory database is a single point of failure and adds network latency to every query.
5. Major Challenges of Sharded Databases
Sharding is an architectural one-way street. Once you shard, operations that were simple in a monolithic database become highly complex:
- Cross-Shard Joins: You cannot perform a
JOINquery across tables located on different shards. You must fetch the data from each shard separately and join them in application memory, which is slow and resource-intensive. - Distributed Transactions: Maintaining ACID transactions across multiple databases requires protocols like 2-Phase Commit (2PC) or Saga patterns, which introduce significant latency and network overhead.
- Resharding: When database size grows, you must split an existing shard into two. This requires copying gigabytes of data over the network while the database is live, causing CPU spikes and latency degradation.
6. How to Select a Shard Key
The choice of the shard key determines the success of your architecture. A bad shard key is nearly impossible to change later.
- High Cardinality: Choose a key with millions of unique values (e.g.,
user_id,tenant_id). Do not shard bystatus(e.g.,active/inactive) orgender, as you will end up with only two unbalanced shards. - Even Distribution: Avoid keys that create hot spots. For example, using a
timestampas a shard key means 100% of today's writes will hit the newest shard, while yesterday's shards remain idle. - Matches Query Patterns: Shard by the field most frequently used in your query filters. If 99% of your queries are
SELECT * WHERE user_id = ?, thenuser_idis your ideal shard key. Every query will hit a single database instead of broadcasting to all shards (Scatter-Gather query).