Common Mistakes
- Fetching all the records in memory
- 50M records (max 5 x 5GB of data) → not a good idea to have this in memory
- Even if we can, what if two requests come in parallel?
- 2x memory usage, won’t scale
- Optimization: Offload median calculation to the database itself
- Not using caching
- Caching is always an easy performance boost, consider using redis or memcached
- This prevents repetitive calculations, while cache invalidation is tricky in our case we can cache the response until the next ingestion
- Using OLTP database
- Not using proper index
- Actually calculating the median
- Calculating median over 50M records can be costly, we can approximate instead
My Solution
- Details
- Setup
- Ingestion
- Median
- Caching
- HTTP server
- Benchmark Tool
- CSV Generation Script
- Summary