Rubix: Palantir’s Move to Kubernetes

In January 2017, Palantir commenced the Rubix project, aimed at rebuilding the cloud architecture around Kubernetes. With the majority of cloud instances dedicated to computation, the core objective was to establish a secure, scalable, and intelligent scheduling and execution engine for Spark and other distributed compute frameworks. Rubix has now been successfully rolled out to the majority of the fleet, and a series of blog posts will delve into the challenges faced in operating compute clusters on top of Kubernetes. Let’s begin!

Background

Palantir Foundry, a data management platform, facilitates data integration through authoring and executing transformation code and enables ad-hoc analytics. It operates on distributed frameworks like Apache Spark and supports collaborative work with multi-tenancy. As the platform transitioned from exploratory analysis to real-time decision-making, ensuring predictable performance became a priority, leading to a shift in deployment infrastructure focus.

Rubix

In 2017, as Kubernetes emerged as the standard deployment platform for modern PaaS systems, Palantir decided to migrate its in-house deployment infrastructure to Kubernetes. Given the platform’s similar design to Palantir’s existing systems, this decision seemed straightforward for applications and services.

However, the goal was to establish a unified deployment substrate for both applications and compute clusters. The critical question was whether Kubernetes-backed compute clusters could meet two crucial requirements: (1) ensuring multi-tenant security in the presence of user-authored code, and (2) delivering predictable performance. Palantir thoroughly evaluated Kubernetes alongside alternative options at the time, such as Apache YARN and Mesos, to address these challenges effectively.

Security

In Palantir’s initial cloud architecture, there were two methods for executing user-provided code: Spark applications ran on Apache YARN, while other user-authored code types were executed using an in-house container solution. With the maturation of containerization technology, Palantir aimed to leverage its security benefits for all user code within Foundry, not just those managed by the in-house solution. At the time, YARN’s container support was limited, prompting Palantir to explore Kubernetes for its robust security features. Kubernetes offered compelling benefits:

It provides a comprehensive set of features for running diverse workloads in containers. Mechanisms such as pod security contexts mirrored the features previously implemented with Palantir’s container solution.
Kubernetes’ security concepts govern both built-in resources like pods and resources managed by extensions, such as Palantir’s Spark-on-Kubernetes implementation. This allowed Palantir to establish a unified and consistent approach to security across all types of first-party and user-authored code.

Predictable performance and cost

Palantir’s customers wanted jobs to run consistently without paying for unnecessary resources. This meant moving away from fixed cluster sizes to dynamic ones and ensuring that resources were used efficiently. Kubernetes, along with other platforms, offered a way to adjust performance and cost as needed.

However, making this switch wasn’t easy. Palantir had to solve various technical challenges, like adjusting network sizes and minimizing the time it takes to start new computing instances. They also improved the Kubernetes scheduler to ensure jobs ran smoothly. The next blog post will talk more about Palantir’s work with Spark-on-Kubernetes and the challenges they faced in scheduling jobs.