Scaling Kubernetes to 7,500 nodes

Scaling a Kubernetes cluster to this magnitude(7500 nodes) is a rare feat that demands careful consideration, but it offers the benefit of a simple infrastructure, empowering OpenAI’s machine learning teams to scale rapidly without altering their code. Following OpenAI’s previous update on scaling to 2,500 nodes, OpenAI has further developed its infrastructure, imparting valuable lessons to the Kubernetes community. This post summarizes those lessons so that others in the Kubernetes community can benefit from them, and it concludes with challenges OpenAI still faces and plans to tackle next.

Workload

OpenAI’s big machine learning tasks run efficiently by giving each computer full access to its hardware. They use Kubernetes, but some jobs can be demanding, especially when using MPI. If a part fails, they have to restart, but they try to do it without causing too much trouble. OpenAI doesn’t rely much on certain features like load balancing and A/B testing. They communicate in a specific way called MPI over SSH, keeping things simple. Their tasks often involve handling data and storage in a smart way. OpenAI is all about flexible and quick work, always ready to adapt to changes in the tech world.

Networking

OpenAI faced issues with Flannel as their clusters grew, so they switched to a different system for better performance. They use a smart way of handling a lot of IP addresses at once. By keeping their networking simple and avoiding certain complications, they can add features like VPN without making things more complex. They also use a tool to track how their network is used, helping them understand and fix any slowdowns in their experiments. They have specific rules to tell if the data is going within their system or outside, making things clear and organized.


iptables -t mangle -A INPUT ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A FORWARD ! -s 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-in"
iptables -t mangle -A OUTPUT ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"
iptables -t mangle -A FORWARD ! -d 10.0.0.0/8 -m comment --comment "iptables-exporter openai traffic=internet-out"

Once marked, iptables initiates counters to monitor the quantity of bytes and packets aligning with the specified rule. Users can check these counters using iptables itself.

% iptables -t mangle -L -v
Chain FORWARD (policy ACCEPT 50M packets, 334G bytes)
 pkts bytes target     prot opt in     out     source               destination
....
1253K  555M            all  --  any    any     anywhere            !10.0.0.0/8           /* iptables-exporter openai traffic=internet-out */
1161K 7937M            all  --  any    any    !10.0.0.0/8           anywhere             /* iptables-exporter openai traffic=internet-in */

OpenAI tracks packets with iptables-exporter, monitors clusters by exposing CIDR ranges, and uses a hub-and-spoke model for isolation. External traffic goes through a “NAT” host for flexible setups. They focus on cluster health, separating API servers and etcd nodes for reliability. OpenAI addresses challenges with EndpointSlices, reducing load. Cautious about scaling, they use caching services to prevent issues. They avoid problems during autoscaling by smoothing the process for added nodes.

Time-series metrics with Prometheus and Grafana

OpenAI uses Prometheus and Grafana to track and display metrics. Initially, they faced issues handling the growing number of metrics, but they fixed crashes and memory problems by making some adjustments. They’re still working on improving their monitoring system for better performance.

Healthchecks

Operating at this scale, OpenAI relies on automation to identify and eliminate malfunctioning nodes within the cluster. Over time, they have developed various health check systems for this purpose.

Passive healthchecks

OpenAI has a big cluster of computers, and they use automated checks to find and fix problems. These checks keep an eye on things like the network, disk, and GPUs. If an issue is spotted, they can reset GPUs, replace them, or take proactive steps like restarting virtual machines. These checks run all the time, stopping new tasks on failing computers. For serious problems, they try to move tasks to other places, and if needed, they shut down a computer after a certain period, following OpenAI’s rules.

Active GPU tests

OpenAI checks GPUs for issues using short tests during boot and periodic checks, ensuring everything works well. Successful tests remove restrictions, making the computer ready for use.

Quotas & resource usage

OpenAI faced challenges in allocating capacity as they expanded clusters. Inspired by traditional job scheduling systems, they integrated Kubernetes-native features to address the issue.

Team taints

OpenAI’s “team-resource-manager” organizes teams in clusters, marking certain nodes for specific teams. It also helps with flexible scheduling for different tasks, making it easier for teams to share resources without a lot of coordination.

CPU & GPU balloons

OpenAI optimizes cluster scaling with cluster-autoscaler, setting “min size” to zero and “max size” to available capacity. To prevent undesired scaling down due to idle nodes, a balloon Deployment with low-priority pods is introduced, using pod anti-affinity for even distribution. This resolves performance issues addressed in Kubernetes 1.18.

Gang scheduling

OpenAI uses special sets called StatefulSets for their experiments. They faced issues with Kubernetes not scheduling tasks properly, causing problems in training. To fix this, they now use a plugin called Coscheduling introduced in Kubernetes 1.18. It helps them coordinate tasks better and avoid conflicts.

Unsolved problems

Metrics

OpenAI faced challenges with Prometheus’s TSDB storage engine due to slow compaction and extended restart times. They’re transitioning to an alternative engine.

Pod network traffic shaping

OpenAI’s cluster scaling considers Internet bandwidth for each pod, but increased demands may strain external resources like datasets and software installations.

Conclusions

OpenAI has discovered Kubernetes to be an incredibly versatile platform for their research requirements, exhibiting scalability to handle demanding workloads. Despite its strengths, there are areas that require improvement, and OpenAI’s Supercomputing team remains committed to exploring the scalability of Kubernetes.