Web Crawler

Medium Data Pipeline

Overview

Designing a Web Crawler (like Googlebot) tests your understanding of distributed crawling, politeness policies, URL deduplication, and handling the scale and unpredictability of the web. Search companies and content aggregators rely on crawlers, and the question reveals how you think about rate limiting, robots.txt, duplicate detection, and frontier management. This design matters in interviews because it combines BFS/DFS graph traversal with distributed systems—URLs as nodes, links as edges—and requires careful handling of politeness (don't overwhelm servers), prioritization (important pages first), and fault tolerance (billions of URLs, many failures).

Requirements

Functional

  • Fetch web pages given seed URLs
  • Extract and follow links from pages
  • Respect robots.txt and crawl-delay
  • Deduplicate URLs (avoid crawling same page twice)
  • Store raw HTML and extracted metadata
  • Support prioritization (e.g., sitemap first)

Non-Functional

  • Politeness — rate limit per domain
  • Scalability — billions of pages
  • Fault tolerance — handle timeouts, 404s, redirects
  • Freshness — recrawl periodically

Capacity Estimation

Assume 1B pages to crawl, 1000 pages/sec crawl rate. 1B URLs in frontier. Bloom filter for dedup: ~1GB. Storage: 100KB/page = 100TB.

Architecture Diagram

Seed URLsURL FrontierBloom FilterHTML ParserFetcher PoolURL StoreContent StoreDNS Resolver

Component Deep Dive

URL Frontier

Priority queue of URLs to crawl. Per-domain queues for politeness. Politeness delay between same-domain requests.

URL Filter / Dedup

Bloom filter or distributed set. Checks if URL already crawled. Normalizes URLs (remove fragments, lowercase).

Fetcher

HTTP client pool. Fetches pages, handles redirects, timeouts. Respects robots.txt.

Parser

Extracts links, metadata, content. Outputs new URLs to frontier, content to storage.

URL Store

Persistent storage for crawled URLs and metadata. Used for dedup and recrawl scheduling.

Content Store

Stores raw HTML or extracted text. Distributed file store (S3, HDFS).

Database Design

URL metadata in Cassandra/DynamoDB: url (PK), status, last_crawled, next_crawl. Content in object store. Bloom filter in Redis for fast dedup check.

API Design

MethodPathDescription
POST/api/crawlSubmit seed URLs. Body: {urls[], priority?}. Returns job_id.
GET/api/crawl/{job_id}Get crawl job status and stats.
GET/api/pages/{url_hash}Retrieve stored page content (internal).

Scalability & Trade-offs

Related System Designs