Engineering Blog

Effortless Scalability: Orchestrating Large Language Model Inference with Kubernetes

In the dynamic landscape of AI/ML, deploying and orchestrating large open-source inference models on Kubernetes has become paramount. This talk delves into the intricacies of automating the deployment of heavyweight models like Falcon and Llama 2, leveraging Kubernetes Custom Resource Definitions (CRDs) to manage large model files seamlessly through container images. The deployment is streamlined with an HTTP server facilitating inference calls using the model library. This session will explore eliminating manual tuning of deployment parameters to fit GPU hardware by providing preset configurations. Learn how to auto-provision GPU nodes based on specific model requirements, ensuring optimal utilization of resources. We’ll discuss empowering users to deploy their containerized models effortlessly by allowing them to provide a pod template in the workspace custom resource inference field. The controller dynamically, in turn, creates deployment workloads utilizing all GPU nodes.

Reference

https://www.youtube.com/@cncf

Previous Post