Railyard: Accelerating Machine Learning Model Training Using Kubernetes

Stripe leverages machine learning in services like Radar and Billing, handling millions of daily predictions across diverse models trained with billions of data points. To simplify model training, they developed Railyard, an API and job manager on Kubernetes, enabling independent team training with scalability. Railyard’s API prioritizes flexibility and ease of use, supporting Python workflows and containerization with Docker and Kubernetes. Scaling Kubernetes enables handling tens of thousands of models, emphasizing centralized ML infrastructure’s importance for effective scaling and task focus. More insights on their ML architecture will be shared in the future.

Effective machine learning infrastructure for organizations

After running Railyard in production for 18 months and training thousands of models, here are our key findings:

Build a generic API, not tied to any single machine learning framework. Railyard’s versatility surprised users, extending to various applications beyond classifiers, like time series forecasting and word2vec embeddings.
A fully managed Kubernetes cluster reduces operational burden across an organization. While Railyard interacts directly with the Kubernetes API, another team manages the cluster, leveraging their expertise to ensure reliable operation.
The Kubernetes cluster offers excellent flexibility for scaling both vertically and horizontally. We can easily adjust cluster size to accommodate increased model training or incorporate new compute resources.
Centralized tracking of model state and ownership simplifies observation and debugging of training jobs. Shifting from searching for job outputs to identifying job IDs streamlines performance monitoring across the cluster.
Building an API for model training enables us to use it everywhere. Teams integrate our API into various services, schedulers, or task runners, including training models through Airflow task definitions within broader data job workflows.

The Railyard architecture

Railyard, a Scala service, manages job history and state in Postgres via a JSON API. It orchestrates job execution using the Kubernetes API across our diverse instance types. For instance, standard tasks run on high-CPU instances, data-intensive jobs on high-memory ones, and deep learning on GPU instances. We bundle Python training code with Subpar, a Google library, into a Docker container deployed on AWS ECR. Upon API request, Railyard initiates the corresponding job, with logs streamed to S3. Each job involves multiple steps, including data fetching, model training, and result serialization to S3 and Postgres, accessible via the Railyard API.

Railyard’s API design

The Railyard API enables users to define machine learning model training requirements, such as data sources and model parameters. In its design, we aimed to balance generality across multiple training frameworks with clarity and brevity for users.

Through collaboration with various internal teams, we explored different use cases. Some required ad-hoc training with SQL-based feature fetching, while others needed programmatic API calls for frequent S3-based feature retrieval. Our design iterations spanned from a custom DSL approach, integrating scikit-learn components directly into the API, to empowering users to write Python classes for their training code with defined input and output interfaces.

Ultimately, we found that a DSL-based approach was too restrictive, leading us to adopt a hybrid model. Our API offers flexibility for specifying data sources, filters, features, labels, and training parameters, while the core training logic remains in Python, providing users with greater adaptability and control.

Here is a sample API request to the Railyard service:

{
  // What does this model do?
  "model_description": "A model to predict fraud",
  // What is this model called?
  "model_name": "fraud_prediction_model",
  // What team owns this model?
  "owner": "machine-learning-infrastructure",
  // What project is this model for?
  "project": "railyard-api-blog-post",
  // Which team member is training this model?
  "trainer": "robstory",
  "data": {
    "features": [
      {
        // Columns we’re fetching from Hadoop Parquet files
        "names": ["created_at", "charge_type", "charge_amount",
                  "charge_country", "has_fraud_dispute"],
        // Our data source is S3
        "source": "s3",
        // The path to our Parquet data
        "path": "s3://path/to/parquet/fraud_data.parq"
      }
    ],
    // The canonical date column in our dataset
    "date_column": "created_at",
    // Data can be filtered multiple times
    "filters": [
      // Filter out data before 2018-01-01
      {
        "feature_name": "created_at",
        "predicate": "GtEq",
        "feature_value": {
          "string_val": "2018-01-01"
        }
      },
      // Filter out data after 2019-01-01
      {
        "feature_name": "created_at",
        "predicate": "LtEq",
        "feature_value": {
          "string_val": "2019-01-01"
        }
      },
      // Filter for charges greater than $10.00
      {
        "feature_name": "charge_amount",
        "predicate": "Gt",
        "feature_value": {
          "float_val": 10.00
        }
      },
      // Filter for charges in the US or Canada
      {
        "feature_name": "charge_country",
        "predicate": "IsIn",
        "feature_value": {
          "string_vals": ["US", "CA"]
        }
      }
    ],
    // We can specify how to treat holdout data
    "holdout_sampling": {
      "sampling_function": "DATE_RANGE",
      // Split holdout data from 2018-10-01 to 2019-01-01
      // into a new dataset
      "date_range_sampling": {
        "date_column": "created_at",
        "start_date": "2018-10-01",
        "end_date": "2019-01-01"
      }
    }
  },
  "train": {
    // The name of the Python workflow we're training
    "workflow_name": "StripeFraudModel",
    // The list of features we're using in our classifier
    "classifier_features": [
      "charge_type", "charge_amount", "charge_country"
    ],
    "label": "is_fraudulent",
    // We can include hyperparameters in our model
    "custom_params": {
      "objective": "reg:linear",
      "max_depth": 6,
      "n_estimators": 500,
      "min_child_weight": 50,
      "learning_rate": 0.02
    }
  }
}

We acquired some valuable insights while developing this API.

Building ML infrastructure

Establishing a shared machine learning infrastructure empowers teams at Stripe to work autonomously and focus on their individual ML objectives. Over the past year, Railyard has facilitated the training of numerous models across various applications, from forecasting to deep learning. This setup has led to the development of robust features for model evaluation and the creation of services to optimize hyperparameters efficiently.