A Large Scale Analysis of Hundreds of In-Memory Cache Clusters at Twitter

Back from FAST, this week I’m continuing my OSDI’20 series. I couldn’t update this week at my usual times as I have been crazy with meetings the whole week. Friday I only had 30 mins break the whole day while the rest is filled with meetings so I’m updating our weekly paper reading now on Saturady. I do understand why the morning paper discontinued and I actually do have the fear that this one won’t last long either… Anyways, today we will look at a less technical paper but I think would be valueable to researchers. Let’s look at the analysis of a large scale in-memory cache at Twitter.

How are in-memory caches deployed at Twitter?

One service one cache clsuter (100s clusters in total)
A cache cluster is composed by containers
In total they serve billions QPS and consume 100s TB DRAM
Method
- Week-long traces from one instance of each cluster (open sourced)
  - 700 billion requests, 80 TB in size
- Focus on 54 clsuters in this paper

How are cache clusters used?

Use cases

Caching for storage (KVStore)
Caching for computation
Intermediate data no persistency
- Workloads
35% of clusters are write-heavy with more than 30% writes (challenges: scalability, tail latency)
Object sizes are small
- 24% cluster mean object size < 100 bytes
- Median: 230 bytes
- Key size is large, Value/key size is small: 15% value size <= key size; 50% value size <= 5x key size (key compression)
- Sizes are changing as time goes by, and it can be very sudden with no pattern
Metadata size is large: 56 bytes per object

How long before objects expire? Time-to-live (TTL)

Objects are set to expire because of inconsistency, result update (refresh), etc.
TTLs are usually short: < 20 mins, only 25% TTLs are > 2 days
Bounded working set size (3-6GB) -> No need for huge cache size
The paper mentions multiple TTL methods and all are inefficient

Other statistics

Small cache miss and variations
Request spikes are not always caused by hot keys
Most are zipfian with large alpha with small deviations
Cache policy is workload denpendent and FIFO performs similar to LRU

Comment

This is another no innovation but high value paper. I would definately read again carefully and come back with more notes as I found more interesting things to log. The trace may be useful for my own future papers as well.