You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Benchmarking Agent Prompts for Single-Cell Data Analysis
1
+
# Benchmarking and Evolving Agent Prompts for Single-Cell Data Analysis
2
2
3
-
**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iterationand testing of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
3
+
**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iteration, testing, evaluation, and evolution of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
4
4
5
5
## Overview
6
6
7
7
This framework provides the necessary tools to:
8
8
9
9
1.**Discover and Download Datasets:** Browse and fetch datasets (specifically from the CZI CELLxGENE Census) along with their metadata.
10
-
2.**Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox).
11
-
3.**Agent Interaction:** Orchestrate a "one-shot" interaction between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox, allowing the agent a limited number of attempts to execute code for analysis.
12
-
4.**Prompt Iteration:** Easily test different agent prompts (pasted, from files, or from folders) against the same dataset and sandbox setup.
10
+
2.**Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox). The sandbox now runs a Jupyter kernel managed by a **FastAPI service** , providing a stable HTTP interface for code execution.
11
+
3.**Agent Interaction & Testing (`OneShotAgentTester.py`):** Orchestrate interactions between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox (via the FastAPI service). Allows testing prompts with limited code execution attempts.
12
+
4.**Results Conversion (`output_to_notebook.py`):** Convert the detailed JSON logs from test runs into Jupyter Notebooks (`.ipynb`) for easier review and analysis reproduction.
13
+
5.**AI-Powered Evaluation (`evaluator.py`):** Use an LLM (like GPT-4o) to automatically evaluate the performance of the agent based on the conversation logs, assigning a grade and providing comments.
14
+
6.**Automated Prompt Evolution (`prompt_evolver.py`):** Iteratively refine an initial agent prompt based on an objective, test results, and AI evaluation feedback to automatically discover more effective prompts.
13
15
14
16
## Components
15
17
@@ -19,115 +21,146 @@ The framework consists of the following main components:
19
21
*`make_benchmarking_env.sh`: An interactive script to securely prompt for and save your OpenAI API key.
20
22
*`.env`: The file (created by the script) storing the `OPENAI_KEY`. This file should be added to your `.gitignore`.
21
23
***`tools/czi_browser.py`:**
22
-
* A CLI tool (with an interactive mode) for listing available CZI CELLxGENE Census versions and datasets.
23
-
* Allows downloading specific datasets (`.h5ad` format) and their corresponding metadata (`.json` format) to the `datasets/` directory.
24
+
* A CLI tool for listing CZI CELLxGENE Census versions and datasets.
25
+
* Allows downloading specific datasets (`.h5ad`) and metadata (`.json`) to the `datasets/` directory.
24
26
***`sandbox/`:** Contains the code execution environment.
25
-
*`Dockerfile`: Defines the Docker image based on `e2bdev/code-interpreter`, adding necessary Python/system dependencies (`pandas`, `anndata`, etc. via `requirements.txt`).
26
-
*`requirements.txt`: Lists Python packages installed *inside* the sandbox container.
27
-
*`startup.sh`: Script run when the container starts (e.g., to initialize Jupyter kernel if needed by the base image).
28
-
*`benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) to build the sandbox image, start/stop the container, and execute Python code strings within it via `docker exec`.
27
+
*`Dockerfile`: Defines the Docker image based on a Python base, adding necessary Python/system dependencies, Jupyter components, FastAPI, Uvicorn, and the application code.
*`kernel_api.py`: The FastAPI application running inside the container. It receives code execution requests via HTTP, interacts with a local Jupyter kernel using `jupyter_client`, captures results (stdout, stderr, errors, display data), and returns them as JSON.
30
+
*`start_kernel.py`: A simple script used internally by `start.sh` to launch the Jupyter kernel process with specific arguments (e.g., listening IP, ports).
31
+
*`start.sh`: The main startup script run by the container (managed by `tini`). It launches the Jupyter kernel in the background and then starts the Uvicorn server to run the `kernel_api.py` FastAPI app.
32
+
*`benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) primarily used for building the sandbox image and manually starting/stopping the container (which runs the API service). Direct kernel interaction commands have been removed.
29
33
***`datasets/`:** (Created by `czi_browser.py`)
30
-
* Stores downloaded `.h5ad` data files and `.json` metadata files. This directory should likely be added to your `.gitignore` if datasets are large or numerous.
34
+
* Stores downloaded `.h5ad` data files and `.json` metadata files.
35
+
***`outputs/`:** (Created automatically)
36
+
* Default directory for storing JSON logs from `OneShotAgentTester.py` and `PromptEvolver.py`, evaluation results from `evaluator.py`, and potentially generated notebooks/images.
31
37
***`OneShotAgentTester.py`:**
32
-
* The main orchestrator script.
33
-
* Prompts the user to select an agent prompt input method (paste, file, folder).
34
-
* Lists datasets available in the `datasets/` directory and prompts for selection.
35
-
* Asks for the maximum number of code execution attempts allowed for the agent.
36
-
* Manages the interaction loop: starts the sandbox, copies data, sends prompts and history to the OpenAI API, executes code returned by the agent in the sandbox, and feeds results back to the agent until the attempt limit is reached.
38
+
* Orchestrates a single test run for one or more prompts against a dataset.
39
+
* Starts the sandbox container (via `SandboxManager`).
40
+
* Copies the dataset into the running container.
41
+
* Checks if the internal API service is responsive.
42
+
* Manages the interaction loop with the OpenAI API (specified agent model).
43
+
* When the agent generates code, it sends the code to the sandbox's FastAPI `/execute` endpoint using the `requests` library.
44
+
* Formats the JSON response (stdout, stderr, errors, display data) from the API and feeds it back to the agent.
45
+
* Saves the full conversation log for the test run(s) to a JSON file in the `outputs/` directory.
46
+
***`output_to_notebook.py`:**
47
+
* An interactive script that takes a results JSON file (from `OneShotAgentTester` or `PromptEvolver`) as input.
48
+
* Converts the conversation log, including code cells and their outputs (stdout, stderr, errors, display data), into a Jupyter Notebook (`.ipynb`) file.
49
+
* Saves the `.ipynb` file in the same directory as the input JSON.
50
+
***`evaluator.py`:**
51
+
* An interactive script that processes results JSON files from a specified input directory (defaults to `outputs/`).
52
+
* For each test run in the JSON, it formats the conversation and sends it to an OpenAI model (specified evaluator model) with instructions to evaluate the agent's performance (0-100 grade and comments) based on defined criteria (e.g., correctness, efficiency, clarity).
53
+
* Saves the evaluations (grade and comments) to JSON files (either aggregated or individually) in a specified output location (defaults to the input directory).
54
+
***`prompt_evolver.py`:**
55
+
* An orchestrator script for automatically refining prompts.
56
+
* Takes an initial prompt, an objective, a dataset, and the number of iterations.
57
+
* In each iteration:
58
+
* Runs the current prompt using the testing logic (`run_single_test_iteration`).
59
+
* Evaluates the result using the evaluation logic (`call_openai_evaluator`).
60
+
* Calls another OpenAI model (specified evolver model) to generate an improved prompt based on the objective, previous prompt, conversation summary, and evaluation feedback.
61
+
* Uses the evolved prompt for the next iteration.
62
+
* Saves a detailed log of the entire evolution process (prompts, test data, evaluations) and the final evolved prompt.
63
+
***`requirements.txt`:** (Top-level)
64
+
* Lists Python packages required for the *host* scripts (`OneShotAgentTester.py`, `evaluator.py`, `prompt_evolver.py`, `czi_browser.py`, etc.). Key dependencies include `openai`, `python-dotenv`, `requests`, `docker`, `rich`, `nbformat`.
37
65
38
66
## Setup
39
67
40
68
1.**Prerequisites:**
41
-
42
-
* Python (3.8+ recommended)
69
+
* Python (3.10+ recommended)
43
70
*`pip` (Python package installer)
44
71
* Docker Desktop or Docker Engine (must be running)
45
72
* Git (for cloning the repository)
46
-
2.**Install Top-Level Python Dependencies:**
47
-
73
+
2.**Install Host Python Dependencies:**
48
74
* Create and activate a Python virtual environment (recommended):
49
-
```bash
75
+
```
50
76
python -m venv venv
51
77
source venv/bin/activate # Linux/macOS
52
78
# venv\Scripts\activate # Windows CMD
79
+
53
80
```
54
81
* Install required packages for the host scripts:
55
-
```bash
56
-
pip install -r requirements.txt
57
82
```
58
-
3. **Set OpenAI API Key:**
83
+
pip install -r requirements.txt
59
84
60
-
* Make the script executable:
61
-
```bash
62
-
chmod +x make_benchmarking_env.sh
63
-
```
64
-
* Run the script and enter your key when prompted:
65
-
```bash
66
-
./make_benchmarking_env.sh
67
85
```
86
+
3. **Set OpenAI API Key:**
87
+
* Make the script executable: `chmod +x make_benchmarking_env.sh`
88
+
* Run the script and enter your key when prompted: `./make_benchmarking_env.sh`
68
89
* This creates the `.env` file. **Ensure `.env` is listed in your `.gitignore` file.**
69
-
70
90
4. **Prepare Sandbox Requirements:**
71
-
72
-
* Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`).
91
+
* Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`). Ensure these are compatible with the base Python version in the `Dockerfile`.
73
92
74
93
## Usage
75
94
76
95
1. **Download a Dataset:**
77
-
78
-
* Navigate to the `tools/` directory: `cd tools`
79
-
* Run the CZI browser interactively:
80
-
```bash
81
-
python czi_browser.py
82
-
```
83
-
* Use the `list-versions` and `list-datasets` commands to find a dataset ID.
84
-
* Use the `download <version><dataset_id>`command to download the data and metadata to the `benchmarking/datasets/` directory.
85
-
* Return to the main `benchmarking/` directory: `cd ..`
86
-
2. **Run the Agent Tester:**
87
-
88
-
* Execute the main tester script:
89
-
```bash
90
-
python OneShotAgentTester.py
91
-
```
92
-
***Follow the prompts:**
93
-
* Choose how to input your agent prompt (paste, single file, folder).
94
-
* Select the downloaded dataset you want the agent to analyze.
95
-
* Specify the maximum number of code execution attempts allowed.
96
-
* The script will then start the sandbox, copy the data, and begin the interaction loop with the OpenAI API, displaying the conversation and code execution results.
97
-
* If you provided a folder of prompts, it will iterate through each one.
98
-
3. **Manage Sandbox Manually (Optional):**
99
-
100
-
* Navigate to the `sandbox/` directory: `cd sandbox`
101
-
* Use `benchmarking_sandbox_management.py`for manual control:
* Run interactively: `python benchmarking_sandbox_management.py`
108
-
* Return to the main `benchmarking/` directory: `cd ..`
109
-
110
-
## File Structure
96
+
* Use the `tools/czi_browser.py` script (run `python tools/czi_browser.py` for interactive mode) to find and download a dataset to the `datasets/` directory.
97
+
2. **Test a Prompt (`OneShotAgentTester.py`):**
98
+
* Run the script: `python OneShotAgentTester.py`
99
+
* Follow prompts to select the prompt source (paste, file, folder), dataset, and max code attempts.
100
+
* The script starts the sandbox, runs the test(s) by communicating with the internal API, and saves the results to a JSON file in `outputs/`.
101
+
3. **Convert Results to Notebook (`output_to_notebook.py`):**
102
+
* Run the script: `python output_to_notebook.py`
103
+
* Enter the path to a results JSON file (e.g., `outputs/benchmark_results_....json`).
104
+
* An `.ipynb` file will be generated in the same directory.
105
+
4. **Evaluate Results (`evaluator.py`):**
106
+
* Run the script: `python evaluator.py`
107
+
* Enter the path to the folder containing results JSON files (defaults to `outputs/`).
108
+
* Enter the desired output location for evaluation files.
109
+
* The script calls OpenAI to evaluate each test run and saves the grades/comments.
110
+
5. **Evolve a Prompt (`prompt_evolver.py`):**
111
+
* Run the script: `python prompt_evolver.py`
112
+
* Enter the overall objective for the prompt.
113
+
* Provide the initial prompt (paste or file path).
114
+
* Select the dataset.
115
+
* Enter the number of evolution iterations.
116
+
* Specify the output directory for logs.
117
+
* The script runs the test-evaluate-evolve loop and saves the full log and the final prompt.
118
+
6. **Manage Sandbox Manually (Optional):**
119
+
* Use `sandbox/benchmarking_sandbox_management.py` for basic container control:
0 commit comments