Scaling AI/ML Infrastructure at Uber

Scaling AI/ML Infrastructure at Uber: Optimizing for Efficiency and Growth

This blog post explores Uber’s journey in scaling its AI/ML infrastructure to support a rapidly evolving landscape of applications. As Uber’s models have grown in complexity, from XGBoost to deep learning and generative AI, the need for efficient and adaptable infrastructure has become crucial.

Optimizing Existing Infrastructure

The blog details several initiatives undertaken to maximize the utilization of existing on-premise infrastructure:

Unified Federation Layer: A centralized workload scheduler (Michelangelo Job Controller) streamlines job allocation across distributed Kubernetes clusters, improving resource utilization and addressing infrastructure complexity for engineers.
Network Upgrades for LLM Training: Enhanced network bandwidth and congestion control mechanisms significantly improve training efficiency for large language models (LLMs) by nearly doubling training speed.
Memory Upgrades: Doubling memory per GPU worker allows for better allocation rates and facilitates training of models with increased memory demands.

Building New Infrastructure

The blog also highlights Uber’s approach to building new cloud-based infrastructure:

Price-Performance Evaluations: Rigorous testing of various cloud SKUs across diverse workloads (deep learning, serving, LLMs) ensures cost-effective selection based on factors like feasibility, cost, throughput, and latency.
Leveraging Latest Hardware: Utilizing Nvidia’s H100 GPUs with their high TFlops and memory bandwidth caters to the stringent latency requirements of generative AI applications.
HW/SW Co-design: The blog emphasizes the importance of co-designing hardware and software to fully leverage hardware capabilities. This is demonstrated by the significant performance improvements achieved through framework optimizations for LLM serving.
LLM Training Efficiency with Memory Offload: Offloading optimizer states, parameters, and gradients from GPU memory to CPU memory or NVMe devices enables training of larger LLM models by boosting training efficiency and MFU (model flops utilization).

Conclusion

The blog concludes with three key takeaways:

Evolving Workloads: The need for workload-specific solutions and efficiency metrics to address the diverse requirements of various AI models.
Collaborative Design: The importance of HW/SW co-design across all system layers for optimal infrastructure efficiency.
Open Source Collaboration: A call for industry partnerships and engagement in open-source optimization to accelerate advancements in AI infrastructure scaling.

This blog post offers valuable insights for companies building and scaling their AI/ML infrastructure. By employing a combination of optimization techniques, workload-specific solutions, and collaborative design principles, organizations can ensure efficient and adaptable infrastructure to support the ever-growing demands of AI applications.

Reference to the Article- Uber

Scaling AI/ML Infrastructure at Uber