Scaling Kubernetes to 2,500 Nodes for Deep Learning at OpenAI

OpenAI, a pioneer in artificial intelligence, pushes the boundaries of Kubernetes by scaling it to manage massive deep learning workloads. While managing bare VMs remains an option for the largest tasks, Kubernetes shines for its rapid iteration cycles, reasonable scalability, and reduced development overhead. This blog dives into OpenAI’s journey building a 2,500-node Kubernetes cluster on Azure, sharing the challenges overcome and the solutions implemented along the way.

Challenges Emerged while Scaling

etcd

Scaling to 2,500 nodes in Kubernetes posed challenges for OpenAI. Beyond 500 nodes, timeouts occurred, temporarily addressed by adding Kube masters. Suspecting etcd issues, write latency spikes were observed despite using high-performance hardware. Benchmarking revealed etcd’s limited IOPS utilization due to latency, resolved by relocating the etcd directory to local SSDs. Beyond 1,000 nodes, high commit latency resurfaced, and Prometheus was employed to monitor apiservers. Default settings in Fluentd and Datadog processes querying apiservers were identified as root causes, now resolved by adjusting their polling behavior.

An additional adjustment involved segregating Kubernetes Events into a distinct etcd cluster to prevent any adverse impact on the performance of the primary etcd instances during spikes in Event creation. To implement this, the –etcd-servers-overrides flag was configured.

After surpassing 1,000 nodes, reaching etcd’s 2GB storage limit resulted in write failures, initiating a cascade of failures. All Kube nodes failed health checks, leading the autoscaler to terminate all workers. To address this, the max etcd size was increased using the –quota-backend-bytes flag. Furthermore, the autoscaler now incorporates a sanity check to prevent actions that would terminate more than 50% of the cluster.

Kube masters

OpenAI employs colocation of kube-apiserver, kube-controller-manager, and kube-scheduler processes on the same machines. To ensure high availability, a minimum of 2 masters is maintained, and the –apiserver-count flag is set according to the number of running apiservers to avoid confusion in Prometheus monitoring.

The primary use of Kubernetes at OpenAI is as a batch scheduling system, leveraging the autoscaler for dynamic cluster scaling. This approach reduces costs for idle nodes while ensuring low latency during rapid iterations. To diverge from the default kube-scheduler policy of evenly spreading load among nodes, OpenAI adopts a policy that facilitates the termination of unused nodes and expedites the scheduling of large pods. OpenAI transitioned to the subsequent policy:


{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
  {"name" : "GeneralPredicates"},
  {"name" : "MatchInterPodAffinity"},
  {"name" : "NoDiskConflict"},
  {"name" : "NoVolumeZoneConflict"},
  {"name" : "PodToleratesNodeTaints"}
  ],
"priorities" : [
  {"name" : "MostRequestedPriority", "weight" : 1},
  {"name" : "InterPodAffinityPriority", "weight" : 2}
  ]
}

OpenAI heavily depends on KubeDNS for service discovery. However, after introducing a new scheduling policy, reliability issues surfaced, particularly affecting specific KubeDNS pods. The new policy caused certain machines to host more than 10 copies of KubeDNS, leading to hotspots and exceeding the ~200QPS limit per Azure VM for external domain lookups.

To resolve this, OpenAI implemented an anti-affinity rule for their KubeDNS pods.


affinity:
 podAntiAffinity:
   requiredDuringSchedulingIgnoredDuringExecution:
   - weight: 100
     labelSelector:
       matchExpressions:
       - key: k8s-app
         operator: In
         values:
         - kube-dns
     topologyKey: kubernetes.io/hostname

Docker image pulls

OpenAI’s Dota project on Kubernetes faced delays in pod initialization, notably with a 17GB game image taking up to 30 minutes to pull on new nodes. They addressed this by modifying the kubelet’s –serialize-image-pulls flag from true to false and transitioning Docker to overlay2, expediting the process. Performance was further improved by relocating the Docker root to an instance-attached SSD.

Despite optimizations, cryptic “rpc error: code = 2 desc = net/http: request canceled” messages indicated pod failures due to canceled pulls resulting from slow progress or a backlog. To resolve this, they adjusted kubelet’s –image-pull-progress-deadline to 30 minutes and Docker daemon’s max-concurrent-downloads to 10, allowing parallel pulls.

Another Docker pull challenge emerged from the Google Container Registry, where the default kubelet pull from gcr.io was crucial for new containers. To prevent node incapacity from failed pulls, OpenAI preloaded the Docker image in the machine image for Kubernetes workers by using docker image save -o /opt/preloaded_docker_images.tar and docker image load -i /opt/preloaded_docker_images.tar and whitelisted common OpenAI-internal images like Dota image for enhanced performance.

Networking

As OpenAI’s experiments grow, they increasingly rely on complex distributed systems with heavy network dependence. Initially, they faced networking challenges, evident when Kube pods using Flannel showed lower throughput (~2Gbit/s) compared to direct machine-to-machine connections (10-15Gbit/s). Similar benchmarks by Machine Zone suggested an inherent issue rather than a configuration problem. To address this, users can disable Flannel for their pods using two settings: hostNetwork: true and dnsPolicy: ClusterFirstWithHostNet.

ARP cache

Despite DNS optimizations, OpenAI encountered sporadic resolution issues. An engineer reported that the command “nc -v” to their Redis server took over 30 seconds to establish a connection. The problem was traced to the kernel’s ARP stack. Initial investigation of the Redis pod’s host revealed significant network issues—communication delays on any port and failed DNS resolution via the local dnsmasq daemon. The “dig” command displayed a cryptic failure message socket.c:1915: internal_send: 127.0.0.1#53: Invalid argument. The dmesg log provided more details: “neighbor table overflow!” revealing that the ARP cache had reached its capacity. ARP maps a network address (e.g., IPv4) to a physical address (e.g., MAC address). OpenAI addressed this by adjusting settings in /etc/sysctl.conf, ensuring a simple resolution.


net.ipv4.neigh.default.gc_thresh1 = 80000
net.ipv4.neigh.default.gc_thresh2 = 90000
net.ipv4.neigh.default.gc_thresh3 = 100000

Reference

https://openai.com/research/scaling-kubernetes-to-2500-nodes