Key-Value Store
Overview
Designing a distributed Key-Value Store (like Redis or DynamoDB) is a staple system design question at companies building large-scale systems. Amazon, Google, and Meta all operate massive KV stores, and understanding partitioning, replication, consistency models, and failure handling is essential. This question tests whether you can reason about CAP theorem, eventual vs strong consistency, and how to scale storage horizontally. It matters in interviews because KV stores underpin nearly every high-performance system—caching, session storage, feature flags—and the trade-offs you discuss (consistency vs availability, read vs write scaling) apply broadly across distributed systems.
Requirements
Functional
- Put(key, value) and Get(key) operations
- Delete(key) support
- Support for key expiration (TTL)
- Range queries or prefix scans (optional)
- Atomic compare-and-set or conditional writes
Non-Functional
- Low latency — single-digit ms for get/put
- High throughput — millions of ops per second
- Durability — data survives node failures
- Scalability — petabyte-scale storage
Capacity Estimation
Assume 1B keys, 1KB avg value = 1PB. 100K writes/s, 1M reads/s. Replication 3x → 3PB total storage.
Architecture Diagram
Component Deep Dive
API Layer
Handles client connections, routes requests to appropriate partition. Supports REST or custom binary protocol.
Partitioning Service
Consistent hashing or range-based partitioning. Maps key to partition/shard. Handles rebalancing on scale-out.
Storage Node
Stores key-value pairs. Uses LSM tree or B-tree. Handles compaction, garbage collection.
Replication Layer
Replicates data across nodes. Leader-follower or multi-leader. Handles failover.
Gossip / Membership
Cluster membership, failure detection. Nodes discover each other and detect failures.
Coordinator
Handles cross-partition transactions if needed. Coordinates quorum reads/writes.
Database Design
Each storage node uses an LSM-tree (LevelDB, RocksDB) or B-tree. Partition key determines shard. Replication via WAL + async/sync replica. Metadata (partition map) in ZooKeeper or etcd.
API Design
| Method | Path | Description |
|---|---|---|
PUT | /v1/keys/{key} | Store key-value. Body: {value, ttl?}. Returns 200. |
GET | /v1/keys/{key} | Retrieve value. Returns 200 + value or 404. |
DELETE | /v1/keys/{key} | Delete key. Returns 200. |
GET | /v1/keys?prefix={p} | List keys with prefix (if supported). |
Scalability & Trade-offs
- Consistency vs availability: Strong consistency (linearizable) requires quorum and adds latency; eventual consistency allows higher availability.
- LSM vs B-tree: LSM has better write throughput; B-tree has better read latency and no compaction stalls.
- Partitioning: Consistent hashing minimizes rebalancing; range partitioning allows efficient range queries.
Related System Designs
Distributed Cache
Designing a Distributed Cache (like Memcached or Redis Cluster) tests your understanding of caching strategies, consiste...
StorageFile Storage (Dropbox)
Designing a File Storage system (Dropbox, Google Drive) tests your understanding of sync, conflict resolution, and effic...
InfrastructureURL Shortener (TinyURL)
The URL Shortener (TinyURL-style) system design is a classic interview question that tests your understanding of distrib...