Engineering Blog

                            

Connecting Kernel Panics to Kubernetes Pods: Keeping Track of Lost Nodes at Netflix

With a dedicated effort to enhance user experience on the Titus container platform, Netflix delved into the issue of “orphaned” pods – those left incomplete without a clear final status. Although this may not be a concern for Service job owners, it holds significant importance for Batch users. This blog post provides insights into our exploration of kernel panics, their connection to Kubernetes (k8s), and the resulting impact on Titus operators, shedding light on the mystery behind orphaned pods.

Where Orphaned Pods Originate

The presence of orphaned pods stems from the disappearance of the underlying k8s node object, initiating a garbage collection process. To tackle this, Titus at Netflix implemented a custom controller to maintain a comprehensive history of Pod and Node objects. However, this setup lacked the necessary explanation to satisfy users.

Root of Lost Nodes

Nodes can vanish for various reasons, especially in cloud environments. Titus introduced a solution involving annotations to trace the cause. Annotating pods with termination reasons, such as hardware failure or preemption by a higher-priority job, provides users with valuable insights.

Capturing Kernel Panics

To effectively handle kernel panics, Titus configured netconsole, drawing inspiration from Google Spanner’s “last gasp” UDP packet approach. This configuration allows Linux servers to send a final message, facilitating the identification of kernel panic reasons.

Connecting to Kubernetes

The implementation of a k8s controller involved listening for netconsole UDP packets to identify kernel panics. Upon detection, it looks up the associated k8s node object, annotates and deletes bound pods, and subsequently annotates and deletes the node. This immediate response eliminates the need for a garbage collection process.

Conclusion

While marking a job failure due to a kernel panic may not offer complete satisfaction, the introduction of observability tools empowers Titus at Netflix to promptly address and resolve kernel panics, ensuring a more robust and reliable container platform.

Reference

https://netflixtechblog.com/kubernetes-and-kernel-panics-ed620b9c6225

Previous Post
Next Post