Setting Up a Secure Python Sandbox for LLM Agents

Anton Shemyakov

May 5th 2025

As large language models become more integrated into computational systems, their role in enhancing application efficiency and accuracy grows. However, this expanded capability brings new risks when executing autonomously generated code.

This blog post explores how to establish a secure Python sandbox for LLM agents. We will cover the threats involved with LLM-generated code and introduce a sandbox solution using gVisor and Jupyter Notebook.

LLM Agents

Incorporating LLMs into software applications can be achieved in several ways, which lie on the agency spectrum. At one end of this spectrum is the simple use of LLMs, where the software makes API calls and parses responses. While straightforward, this approach is vulnerable to errors and hallucinations. Towards the high agency end are more sophisticated agentic systems where LLMs have the autonomy to use tools to achieve tasks. These systems stand out for their ability to navigate scenarios lacking a predetermined workflow, which is often the case in real-world applications.

While some agentic systems define custom workflows by deciding to use one of the predefined functions called tools with specific parameters, at the high end of the agency spectrum LLM agents can write and execute their own code. This capability is particularly useful in dynamic environments or for complex tasks like creating custom data visualizations, where precise and customized solutions are necessary. By generating and executing code, these agents can adapt to the specific requirements of each problem, achieving a level of customization that standard functions cannot provide.

Sandboxing Code

With increased agency in LLM systems comes increased risk. Executing potentially unsafe code generated by these agents can expose systems to security issues, including arbitrary code execution (os.system, subprocess, etc.), resource exhaustion (Denial-of-service attack via CPU, memory or disc overload), file system access (unauthorized reads/writes to files) and many others. Implementing a secure method to execute this code is crucial.

Mitigating these risks can be achieved through the implementation of a secure Python sandbox. The essential goal of such sandbox is to manage resources and create safe execution environments that encapsulate potentially harmful code, preventing it from affecting the broader system.

The Demo Solution

One potential solution to securely execute Python code remotely consists of a FastAPI server that runs a jupyter notebook kernel inside a gVisor container. Here is how different components of the solution work together:

Jupyter Notebook allows to run interactive code notebooks. Jupyter kernels support different environments, including Python, R, Julia, JavaScript, and others. Jupyter kernels are isolated and have limited permissions but do not offer other security features. In our solution Jupyter Notebook plays the role of a code execution environment that works out of the box.
FastAPI is a modern web framework for building APIs with Python. FastAPI serves as the interface between the LLM agent and the Jupyter kernel, allowing the agent to send code for execution over the network and receive results. FastAPI helps us to decouple the agent and the execution environment, which is important for resource management and sandbox scaling.
gVisor is a user-space kernel that provides a secure environment for running untrusted code. It acts as a barrier between the code and the host operating system, preventing unauthorized access to system resources. gVisor intercepts system calls made by the code and enforces security policies, ensuring that only safe operations are allowed. This is a crucial layer of protection for the host system from potential threats posed by executing arbitrary code.

The following code runs FastAPI sandbox server:

# ./main.py
import asyncio
from asyncio import TimeoutError, wait_for
from contextlib import asynccontextmanager
from typing import List

from fastapi import FastAPI, HTTPException
from jupyter_client.manager import AsyncKernelManager
from pydantic import BaseModel

app = FastAPI()

allowed_packages = ["numpy", "pandas", "matplotlib", "scikit-learn"]
installed_packages: List[str] = []


class CodeRequest(BaseModel):
    code: str


class InstallRequest(BaseModel):
    package: str


class ExecutionResult(BaseModel):
    output: str


@asynccontextmanager
async def kernel_client():
    km = AsyncKernelManager(kernel_name="python3")
    await km.start_kernel()
    kc = km.client()
    kc.start_channels()
    await kc.wait_for_ready()
    try:
        yield kc
    finally:
        kc.stop_channels()
        await km.shutdown_kernel()


async def execute_code(code: str) -> str:
    async with kernel_client() as kc:
        msg_id = kc.execute(code)
        try:
            while True:
                reply = await kc.get_iopub_msg()
                if reply["parent_header"]["msg_id"] != msg_id:
                    continue
                msg_type = reply["msg_type"]
                if msg_type == "stream":
                    return reply["content"]["text"]
                elif msg_type == "error":
                    return f"Error executing code: {reply['content']['evalue']}"
                elif msg_type == "status" and reply["content"]["execution_state"] == "idle":
                    break
        except asyncio.CancelledError:
            raise
    return ""


async def install_package(package: str) -> None:
    if package not in installed_packages and package in allowed_packages:
        async with kernel_client() as kc:
            try:
                kc.execute(f"!pip install {package}")
                while True:
                    reply = await kc.get_iopub_msg()
                    if (
                        reply["msg_type"] == "status"
                        and reply["content"]["execution_state"] == "idle"
                    ):
                        break
                installed_packages.append(package)
            except Exception as e:
                raise HTTPException(status_code=500, detail=f"Error installing package: {str(e)}")


@app.post("/install")
async def install(request: InstallRequest):
    try:
        await wait_for(install_package(request.package), timeout=120)
    except TimeoutError:
        raise HTTPException(status_code=400, detail="Package installation timed out")
    return {"message": f"Package '{request.package}' installed successfully."}


@app.post("/execute", response_model=ExecutionResult)
async def execute(request: CodeRequest) -> ExecutionResult:
    try:
        output = await wait_for(execute_code(request.code), timeout=120)
    except TimeoutError:
        raise HTTPException(status_code=400, detail="Code execution timed out")
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

    return ExecutionResult(output=output)


if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="127.0.0.1", port=8000)

This minimalistic sandbox implementation exposes two endpoints: /execute for executing code and /install for installing whitelisted packages. Code execution is performed in a separate Jupyter kernel, which is managed by the AsyncKernelManager, and the console output text is returned to the client. The server is designed to handle timeouts and exceptions gracefully.

The following Dockerfile builds the container image for the sandbox server:

# Dockerfile
FROM jupyter/base-notebook

WORKDIR /app
COPY main.py /app/main.py
COPY requirements.txt /app/requirements.txt
RUN pip install -r requirements.txt

# Switch to jovyan non-root user defined in the base image
USER jovyan

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Although this dockerfile is very simple, it enables deployment of the sandbox solution in a containerized environment. The container runs as a non-root user, which is a good security practice.

At dida we use Google Kubernetes Engine to manage our Kubernetes clusters, which natively supports gVisor as a container runtime. To enable deployment of gVisor protected workloads, we first need to create a node pool that enables GKE sandbox. Note that in order to turn this security feature on the cluster should have a second standard node pool because GKE-managed system workloads must run separately from untrusted sandboxed workloads.

Once the node pool is created, we can deploy the sandbox container image to the cluster with the following Kubernetes manifest:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-sandbox
  namespace: demos
  labels:
    app: agent-sandbox
spec:
  replicas: 1
  selector:
    matchLabels:
      app: agent-sandbox
  template:
    metadata:
      labels:
        app: agent-sandbox
    spec:
      runtimeClassName: gvisor
      containers:
        - name: agent-sandbox
          image: "${IMAGE_REGISTRY}/${IMAGE_REPOSITORY}:${IMAGE_TAG}"
          ports:
            - name: http
              containerPort: 8000
              protocol: TCP
          resources:
            requests:
              memory: "250Mi"
              cpu: "250m"
            limits:
              memory: "500Mi"
              cpu: "500m"

Note that the runtimeClassName field is set to gvisor, which instructs Kubernetes to use gVisor as the container runtime for this deployment. To control sandbox resource allocation, we set resource requests and limits for CPU and memory. This ensures that the sandbox container has sufficient resources to operate while preventing it from consuming excessive resources that could affect other workloads in the cluster.

Capabilities of the Demo Solution

The demo solution is easy to deploy and manage, making it suitable for various use cases. The interface is accessible via a REST API, which is framework-agnostic and can be integrated with any LLM agent. The solution is designed to be extensible, allowing for the addition of new features and enhancements as needed. For example, one can add support for additional programming languages or integrate with other tools and services. In addition, the solution can be easily scaled to handle increased workloads by deploying multiple instances of the sandbox container in a Kubernetes cluster. Containerization minimizes performance overhead compared to traditional virtual machines, making it suitable for high-performance applications.

While being a proof of concept for code sandbox, the demo showcases the following security features:

A standalone containerized sandbox provides isolation and minimizes dependencies between agents.
Python imports are limited, reducing risks associated with dependency threats.
The following security features are provided by using gVisor as the container runtime:
- Isolation of the execution environment from the host system.
- Sandboxing gVisor itself from the host kernel.
- Running the container with least amount of privileges.
- Continuous development and maintenance of gVisor by security experts, ensuring up-to-date security features.
Kubernetes enables efficient CPU, memory, and storage resource management.

Limitations of the Demo Solution

The following limitation of the demo should be addressed before it can be used in production:

At the moment every request to the sandbox creates a new Jupyter kernel, which is not efficient. This can be improved by reusing existing kernels or implementing a more sophisticated kernel management strategy.
In addition to managing the lifecycle of Jupyter kernels, the solution should also handle session and state management. This includes authentication, authorization, and maintaining user sessions to ensure secure access to the sandbox environment.
It might be beneficial to LLM agents to generate responses that include non-textual elements, in particular images. The current solution does not support these types of responses, even though image output is supported by Jupyter.
Filter sandbox ingress and egress traffic to prevent data exfiltration and unauthorized access to external resources.

Conclusion

The demo solution includes features like easy deployment, framework-agnostic integration, and scalability through Kubernetes. It effectively isolates execution environments using gVisor, ensuring robust security with minimal performance overhead. However, some limitations need addressing for production use, such as optimizing Jupyter kernel management, enabling authentication and authorization, and enforcing strong network security controls.

By leveraging code sandboxes teams can build advanced LLM solutions with high agency, allowing these applications to autonomously execute tasks while minimizing security risks. As the technology behind LLMs continues to advance, keeping pace with robust and flexible security measures will be essential for utilizing their full potential in innovative and impactful ways.