Case Study 3: Social Media Feed (Twitter)
News feed cho mạng xã hội với posts, likes, retweets, follows.
Bước 1: Thu thập yêu cầu
Functional requirements
- Post tweets: Text (280 chars), images, videos.
- Follow/Unfollow: User có thể follow others.
- News feed: Hiển thị tweets từ people user follow.
- Like/Retweet: Interactions với tweets.
- Search: Tìm tweets, users.
- Trending: Hashtags trending.
Non‑functional requirements
- Low latency: Feed load < 200ms.
- High availability: 99.9% uptime.
- Scalability: Hàng tỷ users, triệu tweets/ngày.
- Consistency: Feed nên relatively fresh.
Scale estimation
- Users: 500 triệu DAU.
- Tweets per day: 500 triệu.
- Follows per user: Average 200.
- Read/Write ratio: ~100:1 (nhiều reads hơn writes).
- Celebrities: Vài users có > 100M followers.
Bước 2: Ước lượng
Traffic estimates
- Tweet writes: 500M / 86400 ≈ 6,000 RPS (average).
- Feed reads: 500M DAU * 10 feeds/day / 86400 ≈ 60,000 RPS.
- Peak RPS: ~5x average → 30,000 write, 300,000 read RPS.
Storage estimates
- Tweet size: 1 KB (text + metadata).
- Daily storage: 500M * 1 KB ≈ 500 GB.
- Media storage: 500M * 100 KB ≈ 50 TB mỗi ngày.
- 5 years: ~100 TB (text), 100 PB (media).
Bandwidth estimates
- Upload: 6,000 RPS * 1 KB ≈ 6 MB/s (text).
- Download: 60,000 RPS * 100 tweets * 1 KB ≈ 6 GB/s.
Bước 3: Thiết kế High‑Level
Components chính
┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ Client │ ──→ │ Load │ ──→ │ Tweet │
│ │ │ Balancer │ │ Service │
└──────────┘ └─────────────┘ └──────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Search │ │ Database │ │ Cache │
│ (Elastic) │ │ (Cassandra) │ │ (Redis) │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Feed │
│ Generator │
└──────────────┘
Technology selection
- Load Balancer: AWS ELB hoặc NGINX.
- App Services: Microservices (Tweet, User, Feed, Search).
- Database: Cassandra cho tweets, MySQL cho user data.
- Cache: Redis cho feeds, trending, counts.
- Search: Elasticsearch cho full-text search.
- Media: S3 + CDN.
Bước 4: Thiết kế Chi tiết
Database Schema
Table: tweets
| Column | Type | Description |
|---|---|---|
| tweet_id | BIGINT | Primary key (Snowflake ID) |
| user_id | BIGINT | Author ID (indexed) |
| content | TEXT | Tweet text |
| media_urls | JSON | Optional media links |
| created_at | TIMESTAMP | Time (clustering key) |
| like_count | INT | Denormalized count |
| retweet_count | INT | Denormalized count |
Table: follows
| Column | Type | Description |
|---|---|---|
| follower_id | BIGINT | User who follows |
| followee_id | BIGINT | User being followed |
| created_at | TIMESTAMP | Follow time |
| Primary Key: (follower_id, followee_id) |
Table: user_feed (Redis)
| Key | Type | Description |
|---|---|---|
| feed:{user_id} | Sorted Set | Tweet IDs với score = timestamp |
Feed Generation Strategies
1. Pull Model (Lazy)
- Khi user request feed, query tweets từ tất cả followees.
- Pros: Đơn giản, real-time.
- Cons: Chậm với users follow nhiều people.
SELECT t.* FROM tweets t
JOIN follows f ON t.user_id = f.followee_id
WHERE f.follower_id = :user_id
ORDER BY t.created_at DESC
LIMIT 100
2. Push Model (Pre-computed)
- Khi user post tweet, push vào feed cache của tất cả followers.
- Pros: Feed load rất nhanh.
- Cons: Tốn storage, write amplification cho celebrities.
On tweet:
for each follower in followers[user_id]:
redis.zadd("feed:" + follower_id, tweet_id, timestamp)
3. Hybrid Model (Recommended)
- Normal users: Push model.
- Celebrities (> 1M followers): Pull model.
- Feed = Merge(precomputed_feed, celebrity_tweets).
API Design
POST /api/v1/tweets
{
"content": "Hello, world!",
"media_urls": ["https://..."]
}
GET /api/v1/feed?cursor=xxx&limit=20
Response:
{
"tweets": [
{
"tweet_id": 123456,
"user": { "id": 1, "name": "User" },
"content": "Hello!",
"created_at": "2024-01-01T12:00:00Z",
"like_count": 100,
"retweet_count": 50
}
],
"next_cursor": "yyy"
}
POST /api/v1/tweets/{id}/like
POST /api/v1/tweets/{id}/retweet
Data Flow
Post Tweet:
- Client POST /tweets.
- Validate, store in Cassandra.
- Lookup followers (from cache).
- If normal user: Push to followers’ feed cache.
- If celebrity: Store only, pull on read.
- Update search index (async via Kafka).
- Return tweet.
Load Feed:
- Client GET /feed.
- Load precomputed feed from Redis (50 tweets).
- Fetch recent tweets from celebrities (pull).
- Merge & sort by timestamp.
- Return feed.
- Cache result.
Bước 5: Bottlenecks & Tối ưu
Single Point of Failure
- Database: Cassandra replication factor 3.
- Cache: Redis cluster với sentinel.
- Feed Generator: Multiple instances.
Scalability Bottlenecks
- Celebrity fan-out: Justin Bieber (100M followers) → 100M writes!
- Solution: Hybrid model, celebrity tweets pulled on read.
- Feed cache memory: 500M users * 50 tweets * 8 bytes ≈ 200 GB.
- Solution: Only cache active users, LRU eviction.
Performance Optimization
- Pagination: Cursor-based pagination (dùng timestamp).
- Denormalization: Store like_count, retweet_count trong tweet.
- Async updates: Like/retweet counts update async qua Kafka.
- CDN: Cache media files, static content.
Bước 6: Trade‑offs
Consistency vs Availability
- AP system: Eventual consistency cho feed và counts.
- Like count có thể delay vài giây.
- Feed có thể thiếu tweets mới trong vài giây.
Latency vs Throughput
- Low latency: Precomputed feed (push model).
- High throughput: Batch feed updates mỗi 5-10s.
Push vs Pull Trade-off
| Model | Write Cost | Read Cost | Best For |
|---|---|---|---|
| Pull | O(1) | O(N * M) | Celebrities |
| Push | O(N) | O(1) | Normal users |
| Hybrid | O(min(N, M)) | O(log N) | Mixed |
N = followers, M = followees
Cost vs Performance
- Managed Cassandra (DataStax): Đắt nhưng auto-scaling.
- Self-hosted: Rẻ hơn nhưng cần ops expertise.
Kết luận
Social media feed là bài toán kinh điển về read-heavy system với challenges:
- Feed generation strategy: Push vs Pull vs Hybrid.
- Celebrity problem: Fan-out amplification.
- Real-time requirements: Feed phải fresh.
- Scale: Hàng tỷ users, triệu tweets/ngày.