Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Case Study 3: Social Media Feed (Twitter)

News feed cho mạng xã hội với posts, likes, retweets, follows.


Bước 1: Thu thập yêu cầu

Functional requirements

  • Post tweets: Text (280 chars), images, videos.
  • Follow/Unfollow: User có thể follow others.
  • News feed: Hiển thị tweets từ people user follow.
  • Like/Retweet: Interactions với tweets.
  • Search: Tìm tweets, users.
  • Trending: Hashtags trending.

Non‑functional requirements

  • Low latency: Feed load < 200ms.
  • High availability: 99.9% uptime.
  • Scalability: Hàng tỷ users, triệu tweets/ngày.
  • Consistency: Feed nên relatively fresh.

Scale estimation

  • Users: 500 triệu DAU.
  • Tweets per day: 500 triệu.
  • Follows per user: Average 200.
  • Read/Write ratio: ~100:1 (nhiều reads hơn writes).
  • Celebrities: Vài users có > 100M followers.

Bước 2: Ước lượng

Traffic estimates

  • Tweet writes: 500M / 86400 ≈ 6,000 RPS (average).
  • Feed reads: 500M DAU * 10 feeds/day / 86400 ≈ 60,000 RPS.
  • Peak RPS: ~5x average → 30,000 write, 300,000 read RPS.

Storage estimates

  • Tweet size: 1 KB (text + metadata).
  • Daily storage: 500M * 1 KB ≈ 500 GB.
  • Media storage: 500M * 100 KB ≈ 50 TB mỗi ngày.
  • 5 years: ~100 TB (text), 100 PB (media).

Bandwidth estimates

  • Upload: 6,000 RPS * 1 KB ≈ 6 MB/s (text).
  • Download: 60,000 RPS * 100 tweets * 1 KB ≈ 6 GB/s.

Bước 3: Thiết kế High‑Level

Components chính

┌──────────┐     ┌─────────────┐     ┌──────────────┐
│  Client  │ ──→ │ Load        │ ──→ │ Tweet        │
│          │     │ Balancer    │     │ Service      │
└──────────┘     └─────────────┘     └──────────────┘
                                            │
              ┌─────────────────────────────┼─────────────────────────────┐
              ▼                             ▼                             ▼
       ┌──────────────┐            ┌──────────────┐            ┌──────────────┐
       │    Search    │            │   Database   │            │    Cache     │
       │  (Elastic)   │            │ (Cassandra)  │            │   (Redis)    │
       └──────────────┘            └──────────────┘            └──────────────┘
                                            │
                                            ▼
                                   ┌──────────────┐
                                   │  Feed        │
                                   │  Generator   │
                                   └──────────────┘

Technology selection

  • Load Balancer: AWS ELB hoặc NGINX.
  • App Services: Microservices (Tweet, User, Feed, Search).
  • Database: Cassandra cho tweets, MySQL cho user data.
  • Cache: Redis cho feeds, trending, counts.
  • Search: Elasticsearch cho full-text search.
  • Media: S3 + CDN.

Bước 4: Thiết kế Chi tiết

Database Schema

Table: tweets

ColumnTypeDescription
tweet_idBIGINTPrimary key (Snowflake ID)
user_idBIGINTAuthor ID (indexed)
contentTEXTTweet text
media_urlsJSONOptional media links
created_atTIMESTAMPTime (clustering key)
like_countINTDenormalized count
retweet_countINTDenormalized count

Table: follows

ColumnTypeDescription
follower_idBIGINTUser who follows
followee_idBIGINTUser being followed
created_atTIMESTAMPFollow time
Primary Key: (follower_id, followee_id)

Table: user_feed (Redis)

KeyTypeDescription
feed:{user_id}Sorted SetTweet IDs với score = timestamp

Feed Generation Strategies

1. Pull Model (Lazy)

  • Khi user request feed, query tweets từ tất cả followees.
  • Pros: Đơn giản, real-time.
  • Cons: Chậm với users follow nhiều people.
SELECT t.* FROM tweets t
JOIN follows f ON t.user_id = f.followee_id
WHERE f.follower_id = :user_id
ORDER BY t.created_at DESC
LIMIT 100

2. Push Model (Pre-computed)

  • Khi user post tweet, push vào feed cache của tất cả followers.
  • Pros: Feed load rất nhanh.
  • Cons: Tốn storage, write amplification cho celebrities.
On tweet:
for each follower in followers[user_id]:
    redis.zadd("feed:" + follower_id, tweet_id, timestamp)
  • Normal users: Push model.
  • Celebrities (> 1M followers): Pull model.
  • Feed = Merge(precomputed_feed, celebrity_tweets).

API Design

POST /api/v1/tweets
{
  "content": "Hello, world!",
  "media_urls": ["https://..."]
}

GET /api/v1/feed?cursor=xxx&limit=20

Response:
{
  "tweets": [
    {
      "tweet_id": 123456,
      "user": { "id": 1, "name": "User" },
      "content": "Hello!",
      "created_at": "2024-01-01T12:00:00Z",
      "like_count": 100,
      "retweet_count": 50
    }
  ],
  "next_cursor": "yyy"
}

POST /api/v1/tweets/{id}/like
POST /api/v1/tweets/{id}/retweet

Data Flow

Post Tweet:

  1. Client POST /tweets.
  2. Validate, store in Cassandra.
  3. Lookup followers (from cache).
  4. If normal user: Push to followers’ feed cache.
  5. If celebrity: Store only, pull on read.
  6. Update search index (async via Kafka).
  7. Return tweet.

Load Feed:

  1. Client GET /feed.
  2. Load precomputed feed from Redis (50 tweets).
  3. Fetch recent tweets from celebrities (pull).
  4. Merge & sort by timestamp.
  5. Return feed.
  6. Cache result.

Bước 5: Bottlenecks & Tối ưu

Single Point of Failure

  • Database: Cassandra replication factor 3.
  • Cache: Redis cluster với sentinel.
  • Feed Generator: Multiple instances.

Scalability Bottlenecks

  • Celebrity fan-out: Justin Bieber (100M followers) → 100M writes!
    • Solution: Hybrid model, celebrity tweets pulled on read.
  • Feed cache memory: 500M users * 50 tweets * 8 bytes ≈ 200 GB.
    • Solution: Only cache active users, LRU eviction.

Performance Optimization

  • Pagination: Cursor-based pagination (dùng timestamp).
  • Denormalization: Store like_count, retweet_count trong tweet.
  • Async updates: Like/retweet counts update async qua Kafka.
  • CDN: Cache media files, static content.

Bước 6: Trade‑offs

Consistency vs Availability

  • AP system: Eventual consistency cho feed và counts.
  • Like count có thể delay vài giây.
  • Feed có thể thiếu tweets mới trong vài giây.

Latency vs Throughput

  • Low latency: Precomputed feed (push model).
  • High throughput: Batch feed updates mỗi 5-10s.

Push vs Pull Trade-off

ModelWrite CostRead CostBest For
PullO(1)O(N * M)Celebrities
PushO(N)O(1)Normal users
HybridO(min(N, M))O(log N)Mixed

N = followers, M = followees

Cost vs Performance

  • Managed Cassandra (DataStax): Đắt nhưng auto-scaling.
  • Self-hosted: Rẻ hơn nhưng cần ops expertise.

Kết luận

Social media feed là bài toán kinh điển về read-heavy system với challenges:

  • Feed generation strategy: Push vs Pull vs Hybrid.
  • Celebrity problem: Fan-out amplification.
  • Real-time requirements: Feed phải fresh.
  • Scale: Hàng tỷ users, triệu tweets/ngày.

← Chat Application | Xem tiếp: E‑commerce Platform →