Engineering Blog

                            

Blog – 1 Column & Sidebar

Scaling Kubernetes to 7,500 nodes

Scaling a Kubernetes cluster to this magnitude(7500 nodes) is a rare feat that demands careful consideration, but it offers the benefit of a simple infrastructure, empowering OpenAI’s machine learning teams to scale rapidly without altering their code. Following OpenAI’s previous update on scaling to 2,500 nodes, OpenAI has further developed its infrastructure, imparting valuable lessons…

Connecting Kernel Panics to Kubernetes Pods: Keeping Track of Lost Nodes at Netflix

With a dedicated effort to enhance user experience on the Titus container platform, Netflix delved into the issue of “orphaned” pods – those left incomplete without a clear final status. Although this may not be a concern for Service job owners, it holds significant importance for Batch users. This blog post provides insights into our…

Slack’s journey to reliable and scalable cron execution at scale

Slack started with the classic “one box, one crontab” approach for cron jobs. Initially, it worked fine, but as the platform grew, so did the number of scripts and their processing demands. This led to several issues: Building a Better Way: Introducing Chronos Facing these challenges, Slack opted for a custom solution: Chronos. Here’s a…

A “Krispr” Approach to Kubernetes Infrastructure: Keeping Pods Fresh and Rolling Out Updates Smoothly

Introduction In the demanding world of modern service-oriented architectures, maintaining fresh and up-to-date infrastructure is crucial for optimal performance and security. Airbnb, with its hundreds of services relying on Kubernetes, faced challenges in efficiently updating shared infrastructure components within their platform. Their existing approach, heavily dependent on service owner upgrades, led to version fragmentation, complexity,…

Get ready for KubeCon + CloudNativeCon Europe –Cloud Innovation

The  KubeCon + CloudNativeCon Europe event is just around the corner! This is your chance to dive into the world of Kubernetes and cloud-native technologies, meet industry experts, and connect with fellow enthusiasts. Important Reminder: Reserve your spot now to lock in the current rates and save before prices go up! Don’t miss out on this…

Scaling Kubernetes to 2,500 Nodes for Deep Learning at OpenAI

OpenAI, a pioneer in artificial intelligence, pushes the boundaries of Kubernetes by scaling it to manage massive deep learning workloads. While managing bare VMs remains an option for the largest tasks, Kubernetes shines for its rapid iteration cycles, reasonable scalability, and reduced development overhead. This blog dives into OpenAI’s journey building a 2,500-node Kubernetes cluster…

Overusing getters and setters

Encapsulation is used to hide the values or state of a structured data object, preventing unauthorized parties’ direct access to them. In Golang there is no by default support of getters and setters, so it is optional. There are few advantage of using getters and setters event in golang and they are mentaion below :-…

Avoid any Type in TS (anti-pattern)

What are types in TS? Types in TS helps us understand what methods & properties are associated with a given value/variable in a program that can help us analyze our code for existing errors and prevent further errors. For example a value that is assigned a type of a string tells us that the value…