With over a billion users, it comes as no surprise that WeChat manages extremely large data volumes. In some cases, single tables are growing by trillions of records daily and queries regularly scan over 1 billion records.
WeChat's business scenarios demand rapid end-to-end response times, with a query latency P90 target of under 5 seconds, and data freshness requirements that vary from seconds to minutes. This complexity is elevated by the need to process often more than 50 dimensions and 100 metrics at a time.
WeChat's legacy data architecture involved a Hadoop-based data lake system along with a variety of data warehouses. This resulted in significant operational overhead and data governance challenges including:
- Juggling multiple systems from separated real-time and batch analytics pipelines
- Maintaining data ingestion pipelines for data warehouses
- Governance challenges from managing multiple copies of the same data
- Managing incompatible APIs of different systems
- Challenges in standardizing data analysis processes