Chat/Messaging System (WhatsApp)
Overview
Designing a Chat/Messaging System (WhatsApp, Slack, or iMessage) is a challenging system design question that tests real-time delivery, offline support, and consistency. The core challenges include: delivering messages with low latency, handling millions of concurrent connections (WebSockets/long polling), storing and syncing message history, and supporting group chats with fan-out. This design matters in interviews because it combines WebSockets, message queues, databases, and caching—and requires careful thinking about message ordering, idempotency, and read receipts. Companies like Meta, Google, and Slack build these systems at massive scale, and demonstrating you understand the full flow from sender to recipient shows senior-level systems design skills.
Requirements
Functional
- Send and receive 1:1 and group messages
- Message delivery status (sent, delivered, read)
- Offline message storage and sync when user comes online
- Message history and search
- Media attachments (images, files)
- Typing indicators and presence
Non-Functional
- Low latency — message delivery <100ms
- High availability — 99.99% uptime
- Consistency — messages in order, no duplicates
- Scalability — millions of concurrent connections
Capacity Estimation
Assume 500M users, 100B messages/day. 1.2M msg/sec. 50K concurrent connections per server. 100B * 1KB = 100TB message storage/year.
Architecture Diagram
Component Deep Dive
Connection Manager
Maintains WebSocket/long-poll connections. Routes messages to correct connection. Load balanced.
Message Service
Receives messages, validates, stores, publishes to queue. Handles idempotency.
Message Queue
Kafka. Fan-out to online users' connection managers. Persists for offline delivery.
Message Store
Cassandra/Scylla. Stores messages by chat_id, message_id. Supports range queries for history.
Presence Service
Tracks user online/offline status. Redis with heartbeat. Informs connection manager.
Media Store
Object store (S3) for attachments. Messages store URLs.
Sync Service
For offline users: on connect, fetches messages since last_seen. Handles conflict resolution.
Database Design
Messages: chat_id (PK), message_id (CK), sender_id, content, created_at. User_chats: user_id, chat_id, last_read. Cassandra for messages; Redis for presence; MySQL for user metadata.
API Design
| Method | Path | Description |
|---|---|---|
POST | /api/messages | Send message. Body: {chat_id, content, attachments?}. Returns message_id. |
GET | /api/chats/{id}/messages?before=&limit= | Get message history. Paginated. |
POST | /api/messages/{id}/read | Mark as read. Updates read receipt. |
GET | /api/chats | List user's chats with last message. |
Scalability & Trade-offs
- WebSocket vs long polling: WebSocket is efficient for real-time; long polling works everywhere, simpler fallback.
- Store-and-forward vs direct: Store ensures delivery when offline; direct is lower latency when both online.
- Consistency: Causal consistency for chat is usually enough; full linearizability is costly.
Related System Designs
Ride Sharing (Uber)
Designing a Ride Sharing system (Uber, Lyft) is a challenging question that tests real-time matching, geolocation, and d...
InfrastructureURL Shortener (TinyURL)
The URL Shortener (TinyURL-style) system design is a classic interview question that tests your understanding of distrib...
InfrastructureRate Limiter
A Rate Limiter system design question tests your understanding of distributed systems, consistency, and real-time decisi...