Web Crawler
Overview
Designing a Web Crawler (like Googlebot) tests your understanding of distributed crawling, politeness policies, URL deduplication, and handling the scale and unpredictability of the web. Search companies and content aggregators rely on crawlers, and the question reveals how you think about rate limiting, robots.txt, duplicate detection, and frontier management. This design matters in interviews because it combines BFS/DFS graph traversal with distributed systems—URLs as nodes, links as edges—and requires careful handling of politeness (don't overwhelm servers), prioritization (important pages first), and fault tolerance (billions of URLs, many failures).
Requirements
Functional
- Fetch web pages given seed URLs
- Extract and follow links from pages
- Respect robots.txt and crawl-delay
- Deduplicate URLs (avoid crawling same page twice)
- Store raw HTML and extracted metadata
- Support prioritization (e.g., sitemap first)
Non-Functional
- Politeness — rate limit per domain
- Scalability — billions of pages
- Fault tolerance — handle timeouts, 404s, redirects
- Freshness — recrawl periodically
Capacity Estimation
Assume 1B pages to crawl, 1000 pages/sec crawl rate. 1B URLs in frontier. Bloom filter for dedup: ~1GB. Storage: 100KB/page = 100TB.
Architecture Diagram
Component Deep Dive
URL Frontier
Priority queue of URLs to crawl. Per-domain queues for politeness. Politeness delay between same-domain requests.
URL Filter / Dedup
Bloom filter or distributed set. Checks if URL already crawled. Normalizes URLs (remove fragments, lowercase).
Fetcher
HTTP client pool. Fetches pages, handles redirects, timeouts. Respects robots.txt.
Parser
Extracts links, metadata, content. Outputs new URLs to frontier, content to storage.
URL Store
Persistent storage for crawled URLs and metadata. Used for dedup and recrawl scheduling.
Content Store
Stores raw HTML or extracted text. Distributed file store (S3, HDFS).
Database Design
URL metadata in Cassandra/DynamoDB: url (PK), status, last_crawled, next_crawl. Content in object store. Bloom filter in Redis for fast dedup check.
API Design
| Method | Path | Description |
|---|---|---|
POST | /api/crawl | Submit seed URLs. Body: {urls[], priority?}. Returns job_id. |
GET | /api/crawl/{job_id} | Get crawl job status and stats. |
GET | /api/pages/{url_hash} | Retrieve stored page content (internal). |
Scalability & Trade-offs
- BFS vs DFS: BFS discovers broadly; DFS goes deep. For web, BFS with prioritization is common.
- Politeness vs speed: Per-domain rate limits reduce load on servers but slow overall crawl.
- Dedup: Bloom filter is space-efficient but has false positives; exact set is accurate but expensive.
Related System Designs
URL Shortener (TinyURL)
The URL Shortener (TinyURL-style) system design is a classic interview question that tests your understanding of distrib...
InfrastructureRate Limiter
A Rate Limiter system design question tests your understanding of distributed systems, consistency, and real-time decisi...
StorageKey-Value Store
Designing a distributed Key-Value Store (like Redis or DynamoDB) is a staple system design question at companies buildin...