Key-Value Stores

Deep dive into distributed key-value store architecture. Learn about LSM-Trees, consistent hashing, replication quorums, vector clocks, and Gossip protocols.

Key-Value Stores

A Key-Value Store is a non-relational database that stores data as a collection of key-value pairs. Keys are unique strings or hashes, and values can be simple types (strings, integers) or complex blobs (JSON, serialized objects). Key-value stores provide sub-millisecond, constant-time $O(1)$ lookups and high write throughput, serving as a critical building block for distributed systems.


1. High-Level Distributed Architecture

To scale a key-value store beyond a single machine, we partition keys across a ring of coordinator nodes using consistent hashing, replicating each key to multiple physical nodes for high availability.


2. Under the Hood: The LSM-Tree Storage Engine

Key-value stores optimized for high write performance (like Cassandra or RocksDB) use a Log-Structured Merge-Tree (LSM-Tree) engine instead of modifying files in place.

The Write Path (Blazing Fast)

  1. Write-Ahead Log (WAL): The write is appended to a sequential log on disk. If the server crashes, this log is replayed to recover lost memory data.
  2. MemTable: The data is written to an in-memory sorted data structure (typically a Red-Black Tree or Skip List).
  3. Flush: When the MemTable reaches a threshold (e.g., 64MB), it is flushed to disk as an immutable SSTable (Sorted String Table).

The Read Path (Optimized)

  1. Check the MemTable. If the key is there, return it.
  2. If not, check the Bloom Filter (an in-memory, space-efficient probabilistic data structure that can tell if a key is definitely not in the SSTables).
  3. If the Bloom filter indicates the key might exist, scan the SSTables from newest to oldest.

3. Distributed Partitioning & Consistent Hashing

Consistent hashing maps keys to nodes on a circular hash space (the "Hash Ring"), preventing massive data migration when nodes are added or removed.

  • Virtual Nodes (Vnodes): Instead of mapping a physical server to a single token on the ring, each server is assigned multiple tokens (e.g., 256 virtual nodes). This ensures that if Node A has double the hardware capacity of Node B, Node A can be assigned twice as many virtual nodes, balancing the database load proportionally.

4. Tunable Consistency (Dynamo-style)

In distributed key-value stores, you can tune consistency on a per-request basis by configuring three variables:

  • $N$: Replication Factor (number of nodes storing a copy of the data).
  • $W$: Write Quorum (number of replica nodes that must acknowledge a write before it is marked successful).
  • $R$: Read Quorum (number of replica nodes that must respond to a read query before returning the result to the client).
Code
W + R > N  ──► Strong Consistency (Guarantees reading the latest write)
W + R <= N ──► Eventual Consistency (Fast, but read may return stale data)

Common Quorum Configurations (where $N = 3$)

Configuration$W$$R$CharacteristicsUse Case
Balanced Quorum$2$$2$Strong consistency. Decent read/write speed. Can tolerate $1$ node failure.Default transactional system.
Write-Optimized$1$$3$Extremely fast writes. Slow reads (must query all nodes).IoT metrics logging.
Read-Optimized$3$$1$Extremely fast reads. Slow writes (must wait for all replicas).Product Catalog caching.

5. Conflict Resolution: Vector Clocks

In a masterless system (e.g., Cassandra or DynamoDB), network partitions can result in concurrent writes to different replicas, creating conflicts.

To track causality and detect concurrent updates, distributed stores use Vector Clocks. A vector clock is a list of [node, counter] pairs associated with a record.

  • Causally Dependent: If Clock 1 has all counters greater than or equal to Clock 2 (e.g., [A: 2, B: 1] vs. [A: 1, B: 1]), Clock 1 is newer and overwrites Clock 2.
  • Concurrent (Conflict): If one clock has a higher counter on Node A but a lower counter on Node B (e.g., [A: 2, B: 1] vs. [A: 1, B: 2]), a conflict has occurred. The system returns both versions to the application client to resolve (e.g., merging shopping cart items).

6. Anti-Entropy & Node Synchronization

Distributed nodes must detect when data drifts between replicas (due to temporary network drops) without scanning the entire database.

Gossip Protocol

Nodes continuously exchange cluster state metadata (e.g., node health, token ring membership, vector clocks) by periodically pinging a few random nodes. This decentralizes membership management and ensures that membership updates propagate exponentially across the cluster.

Merkle Trees

To resolve differences between two replica databases, nodes compare Merkle Trees (cryptographic hash trees).

  • Nodes compare the Root Hash. If it matches, the databases are identical.
  • If the root hash differs, they compare the children. They traverse down the tree to isolate the exact leaf keys that differ, only replicating the modified records over the network.

7. Memory Management & Eviction Policies

When acting as a cache (like Redis), the data size may exceed physical RAM, requiring an Eviction Policy to purge older data:

PolicyFull NameMechanismBest For
LRULeast Recently UsedEvicts the key that has not been accessed for the longest time.General-purpose caching.
LFULeast Frequently UsedEvicts the key with the lowest access count.Retaining "hot" items (e.g., homepage banners).
FIFOFirst In, First OutEvicts the oldest key in the store, regardless of access patterns.Time-series stream buffers.
Volatile-TTLVolatile Time-To-LiveEvicts keys with the shortest remaining TTL first.Temporary session tokens.
No-EvictionNo EvictionRejects new writes with an error once memory is full.Critical database stores where loss is unacceptable.