Engineering Blog

AWS Lambda Under the Hood

Mike Danilov covers how Lambda is built and how they had to modify the architecture to support 10GiB payloads⁠.

AWS Lambda Overview

Lambda is a serverless computing system where you can run your code on-demand without managing servers. It supports various programming languages, scales rapidly to match demand, and is used by millions of users generating trillions of invocations monthly. Configuration is straightforward, and it offers two invocation models: synchronous and asynchronous. In synchronous invocation, requests are immediately executed, while asynchronous requests are queued. Today, we’ll focus on synchronous invocations, emphasizing key tenets like availability, efficiency, scalability, security, and performance, which guide technical decisions and design choices to ensure reliable, fast, and secure execution of your code.

Invoke Request Routing

This overview focuses on Lambda’s invoke routing and compute infrastructure. Invoke request routing links microservices, ensuring availability and scalability. To illustrate, imagine assisting Alice in running her code in the cloud. We first establish a configuration service and a frontend to handle requests. However, in an on-demand system, delays may occur due to the time needed to initialize sandboxes. To address this, we introduce a placement system to create sandboxes as needed. To expedite the process, we implement a worker manager, enabling the reuse of existing sandboxes and aiming to eliminate cold starts. In real-world scenarios, Lambda’s routing includes multiple availability zones and a frontend load balancer for load distribution. Instead of a worker manager, we utilize an assignment service, offering reliable, distributed storage to track sandbox states across the region. This enhances system availability and fault tolerance, particularly during availability zone events. By adopting distributed storage and a leader-follower architecture, we ensure rapid failovers and enhance system efficiency and performance, emphasizing the importance of consistent state management in improving system reliability and performance.

Compute Fabric

Now, let’s turn our attention to the compute fabric, the backbone of our code execution infrastructure:

At its core, we have the worker fleet, composed of EC2 instances where execution environments are spun up and code is run. Managing this fleet is the capacity manager, tasked with optimizing its size based on demand and ensuring the health of its workers by replacing any unhealthy instances promptly.

Recall the placement service from our discussion on invoke routing. It’s responsible for creating execution environments, but before it does so, it selects the most suitable worker based on real-time signals analyzed by our data science team, which aids both placement and capacity management in making informed decisions.

Considering data isolation, when Bob requests to run his code alongside Alice’s, we explore the separation of their processes using the operating system’s boundaries to prevent interference. While initially loading Bob’s code in the same runtime process as Alice’s might seem intuitive, it poses security risks, leading us to believe that only virtual machine isolation can ensure adequate protection in a multi-tenant compute system.

Initially, Lambda utilized EC2 VMs for isolation, but this resulted in resource wastage and performance issues. Collaborating with the EC2 team, we developed Firecracker, a virtualization technology tailored for serverless compute. With Firecracker, each sandbox is encapsulated within a microVM, ensuring robust isolation and efficient resource utilization.

Transitioning to Firecracker significantly reduced overhead, but challenges remain, especially regarding code size and startup time. To address this, we explore the use of VM snapshots to minimize overhead, allowing for rapid VM resumption. However, we must ensure strict memory isolation to prevent security threats, leading us to develop a copy-on-read mechanism.

Furthermore, to maintain uniqueness among cloned VMs, we implement measures to restore randomness and identifiers at various system layers, collaborating with communities and introducing patches where necessary to ensure that resumed VMs are distinct and secure.

Snapshot Distribution

Let’s address snapshot distribution. Snapshots, often sizable at up to 30 gigabytes, pose challenges for efficient retrieval. To expedite the process, we draw inspiration from video streaming, splitting large snapshots into smaller, 512-kilobyte chunks. By adopting this approach, we only download essential chunks to resume a VM and initiate the invoke, with the remainder fetched on demand. This strategy offers two key benefits: it spreads out download time and minimizes unnecessary data transfer, enhancing overall efficiency.

Here’s how it works: when a process accesses VM memory mapped to the snapshot file, it retrieves data from memory if available; otherwise, it fetches the required chunk. This process is facilitated by our indirection layer, which checks local and distributed caches before retrieving from S3 if necessary. To optimize cache hit rates, we aim to share common chunks across functions. By creating layered incremental snapshots encrypted with different keys, we ensure security while maximizing cache efficiency.

In scenarios where the file’s origin is unknown, such as read-only disks, we utilize convergent encryption to deduplicate common bits, reducing latency overhead and enhancing cache efficiency.

In our production system, we’ve streamlined this process by replacing the indirection layer with a sparse file system, optimizing snapshot distribution and system efficiency.

Have we resolved the cold starts?

Now that we’ve implemented snapshot distribution and VM resumption, we expect everything to run smoothly. However, we still encounter delays with some cold invokes. To understand this issue, let’s revisit page caching and memory mapping. Normally, when a process reads a single page in memory, the operating system anticipates sequential reads and preloads multiple pages ahead (known as read-ahead) for efficiency. However, since we’re using memory mapping instead of traditional file access, which is random, the operating system ends up loading the entire snapshot file unnecessarily for just a few pages. This inefficiency is reflected in our latency distribution graph.

To address this, we analyze the memory access patterns across VMs and create a page access log that accompanies each snapshot. With this log, we can anticipate and load only the necessary pages in the correct order during snapshot resumption, significantly speeding up the process. By implementing this optimization, we resolve issues with cold starts. You can experience this improvement firsthand by enabling Lambda SnapStart on your Java function and observing how VM snapshots operate more efficiently.

Watch Video

Previous Post
Next Post