Distributed Message Queue (Kafka)
Overview
Designing a Distributed Message Queue (Kafka, RabbitMQ) tests your understanding of pub/sub, partitioning, replication, and consumer groups. The core challenges include: ordering guarantees (per-partition), durability (replicated log), and scaling consumers. This design matters in interviews because message queues are the backbone of event-driven architecture—used in notification systems, data pipelines, and microservices. Demonstrating you understand topics, partitions, offsets, consumer groups, and how to achieve exactly-once semantics shows you can design systems for reliable, scalable message delivery. Companies like LinkedIn (Kafka), Uber, and Netflix rely on these systems at massive scale.
Requirements
Functional
- Produce messages to topics
- Consume messages (push or pull)
- Message ordering per partition
- Consumer groups for parallel consumption
- Retention and replay (read from offset)
- Exactly-once semantics (optional)
Non-Functional
- High throughput — millions of messages/sec
- Durability — replicated, persisted
- Low latency — produce ack <5ms
- Scalability — add partitions, add consumers
Capacity Estimation
Assume 1M msg/sec, 1KB avg = 1GB/sec ingest. Retention 7 days = 600TB. 1000 partitions per topic. Replication factor 3.
Architecture Diagram
Component Deep Dive
Broker
Stores partitions (log segments). Handles produce/consume. Leader for some partitions, replica for others.
ZooKeeper / KRaft
Cluster metadata, leader election, partition assignment. Kafka is moving to KRaft (no ZooKeeper).
Producer
Sends messages to broker. Specifies partition (key hash or manual). Waits for ack (0, 1, all).
Consumer
Pulls from broker. Consumer group shares partitions. Commits offset for resume.
Log Storage
Append-only log per partition. Segmented by size/time. Replicated across brokers.
Coordinator
Assigns partitions to consumers in group. Handles rebalance on join/leave.
Database Design
No traditional DB. Log segments on disk. Metadata (partition assignment, offsets) in ZooKeeper or KRaft. Consumer offsets in __consumer_offsets topic.
API Design
| Method | Path | Description |
|---|---|---|
POST | /topics/{topic}/messages | Produce. Body: {key?, value, partition?}. Returns offset. |
GET | /topics/{topic}/messages | Consume. Query: partition, offset, limit. Returns messages. |
POST | /consumer/commit | Commit offset. Enables resume after restart. |
Scalability & Trade-offs
- Pull vs push: Pull gives consumer control (backpressure); push is simpler, lower latency. Kafka uses pull.
- Ordering vs parallelism: Ordering per partition; more partitions = more parallelism but no cross-partition order.
- Retention: Longer retention enables replay but increases storage. Tiered storage (hot/cold) reduces cost.
Related System Designs
Notification System
A Notification System design question evaluates your ability to design for high fan-out, multiple channels (push, email,...
InfrastructureURL Shortener (TinyURL)
The URL Shortener (TinyURL-style) system design is a classic interview question that tests your understanding of distrib...
InfrastructureRate Limiter
A Rate Limiter system design question tests your understanding of distributed systems, consistency, and real-time decisi...