Skip to content

Commit 7be1fe5

Browse files
committed
updated README
1 parent 4b0f98b commit 7be1fe5

1 file changed

Lines changed: 107 additions & 74 deletions

File tree

benchmarking/README.md

Lines changed: 107 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
1-
# Benchmarking Agent Prompts for Single-Cell Data Analysis
1+
# Benchmarking and Evolving Agent Prompts for Single-Cell Data Analysis
22

3-
**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iteration and testing of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
3+
**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iteration, testing, evaluation, and evolution of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
44

55
## Overview
66

77
This framework provides the necessary tools to:
88

99
1. **Discover and Download Datasets:** Browse and fetch datasets (specifically from the CZI CELLxGENE Census) along with their metadata.
10-
2. **Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox).
11-
3. **Agent Interaction:** Orchestrate a "one-shot" interaction between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox, allowing the agent a limited number of attempts to execute code for analysis.
12-
4. **Prompt Iteration:** Easily test different agent prompts (pasted, from files, or from folders) against the same dataset and sandbox setup.
10+
2. **Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox). The sandbox now runs a Jupyter kernel managed by a **FastAPI service** , providing a stable HTTP interface for code execution.
11+
3. **Agent Interaction & Testing (`OneShotAgentTester.py`):** Orchestrate interactions between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox (via the FastAPI service). Allows testing prompts with limited code execution attempts.
12+
4. **Results Conversion (`output_to_notebook.py`):** Convert the detailed JSON logs from test runs into Jupyter Notebooks (`.ipynb`) for easier review and analysis reproduction.
13+
5. **AI-Powered Evaluation (`evaluator.py`):** Use an LLM (like GPT-4o) to automatically evaluate the performance of the agent based on the conversation logs, assigning a grade and providing comments.
14+
6. **Automated Prompt Evolution (`prompt_evolver.py`):** Iteratively refine an initial agent prompt based on an objective, test results, and AI evaluation feedback to automatically discover more effective prompts.
1315

1416
## Components
1517

@@ -19,115 +21,146 @@ The framework consists of the following main components:
1921
* `make_benchmarking_env.sh`: An interactive script to securely prompt for and save your OpenAI API key.
2022
* `.env`: The file (created by the script) storing the `OPENAI_KEY`. This file should be added to your `.gitignore`.
2123
* **`tools/czi_browser.py`:**
22-
* A CLI tool (with an interactive mode) for listing available CZI CELLxGENE Census versions and datasets.
23-
* Allows downloading specific datasets (`.h5ad` format) and their corresponding metadata (`.json` format) to the `datasets/` directory.
24+
* A CLI tool for listing CZI CELLxGENE Census versions and datasets.
25+
* Allows downloading specific datasets (`.h5ad`) and metadata (`.json`) to the `datasets/` directory.
2426
* **`sandbox/`:** Contains the code execution environment.
25-
* `Dockerfile`: Defines the Docker image based on `e2bdev/code-interpreter`, adding necessary Python/system dependencies (`pandas`, `anndata`, etc. via `requirements.txt`).
26-
* `requirements.txt`: Lists Python packages installed *inside* the sandbox container.
27-
* `startup.sh`: Script run when the container starts (e.g., to initialize Jupyter kernel if needed by the base image).
28-
* `benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) to build the sandbox image, start/stop the container, and execute Python code strings within it via `docker exec`.
27+
* `Dockerfile`: Defines the Docker image based on a Python base, adding necessary Python/system dependencies, Jupyter components, FastAPI, Uvicorn, and the application code.
28+
* `requirements.txt`: Lists Python packages installed *inside* the sandbox container (e.g., `anndata`, `scanpy`, `matplotlib`).
29+
* `kernel_api.py`: The FastAPI application running inside the container. It receives code execution requests via HTTP, interacts with a local Jupyter kernel using `jupyter_client`, captures results (stdout, stderr, errors, display data), and returns them as JSON.
30+
* `start_kernel.py`: A simple script used internally by `start.sh` to launch the Jupyter kernel process with specific arguments (e.g., listening IP, ports).
31+
* `start.sh`: The main startup script run by the container (managed by `tini`). It launches the Jupyter kernel in the background and then starts the Uvicorn server to run the `kernel_api.py` FastAPI app.
32+
* `benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) primarily used for building the sandbox image and manually starting/stopping the container (which runs the API service). Direct kernel interaction commands have been removed.
2933
* **`datasets/`:** (Created by `czi_browser.py`)
30-
* Stores downloaded `.h5ad` data files and `.json` metadata files. This directory should likely be added to your `.gitignore` if datasets are large or numerous.
34+
* Stores downloaded `.h5ad` data files and `.json` metadata files.
35+
* **`outputs/`:** (Created automatically)
36+
* Default directory for storing JSON logs from `OneShotAgentTester.py` and `PromptEvolver.py`, evaluation results from `evaluator.py`, and potentially generated notebooks/images.
3137
* **`OneShotAgentTester.py`:**
32-
* The main orchestrator script.
33-
* Prompts the user to select an agent prompt input method (paste, file, folder).
34-
* Lists datasets available in the `datasets/` directory and prompts for selection.
35-
* Asks for the maximum number of code execution attempts allowed for the agent.
36-
* Manages the interaction loop: starts the sandbox, copies data, sends prompts and history to the OpenAI API, executes code returned by the agent in the sandbox, and feeds results back to the agent until the attempt limit is reached.
38+
* Orchestrates a single test run for one or more prompts against a dataset.
39+
* Starts the sandbox container (via `SandboxManager`).
40+
* Copies the dataset into the running container.
41+
* Checks if the internal API service is responsive.
42+
* Manages the interaction loop with the OpenAI API (specified agent model).
43+
* When the agent generates code, it sends the code to the sandbox's FastAPI `/execute` endpoint using the `requests` library.
44+
* Formats the JSON response (stdout, stderr, errors, display data) from the API and feeds it back to the agent.
45+
* Saves the full conversation log for the test run(s) to a JSON file in the `outputs/` directory.
46+
* **`output_to_notebook.py`:**
47+
* An interactive script that takes a results JSON file (from `OneShotAgentTester` or `PromptEvolver`) as input.
48+
* Converts the conversation log, including code cells and their outputs (stdout, stderr, errors, display data), into a Jupyter Notebook (`.ipynb`) file.
49+
* Saves the `.ipynb` file in the same directory as the input JSON.
50+
* **`evaluator.py`:**
51+
* An interactive script that processes results JSON files from a specified input directory (defaults to `outputs/`).
52+
* For each test run in the JSON, it formats the conversation and sends it to an OpenAI model (specified evaluator model) with instructions to evaluate the agent's performance (0-100 grade and comments) based on defined criteria (e.g., correctness, efficiency, clarity).
53+
* Saves the evaluations (grade and comments) to JSON files (either aggregated or individually) in a specified output location (defaults to the input directory).
54+
* **`prompt_evolver.py`:**
55+
* An orchestrator script for automatically refining prompts.
56+
* Takes an initial prompt, an objective, a dataset, and the number of iterations.
57+
* In each iteration:
58+
* Runs the current prompt using the testing logic (`run_single_test_iteration`).
59+
* Evaluates the result using the evaluation logic (`call_openai_evaluator`).
60+
* Calls another OpenAI model (specified evolver model) to generate an improved prompt based on the objective, previous prompt, conversation summary, and evaluation feedback.
61+
* Uses the evolved prompt for the next iteration.
62+
* Saves a detailed log of the entire evolution process (prompts, test data, evaluations) and the final evolved prompt.
63+
* **`requirements.txt`:** (Top-level)
64+
* Lists Python packages required for the *host* scripts (`OneShotAgentTester.py`, `evaluator.py`, `prompt_evolver.py`, `czi_browser.py`, etc.). Key dependencies include `openai`, `python-dotenv`, `requests`, `docker`, `rich`, `nbformat`.
3765

3866
## Setup
3967

4068
1. **Prerequisites:**
41-
42-
* Python (3.8+ recommended)
69+
* Python (3.10+ recommended)
4370
* `pip` (Python package installer)
4471
* Docker Desktop or Docker Engine (must be running)
4572
* Git (for cloning the repository)
46-
2. **Install Top-Level Python Dependencies:**
47-
73+
2. **Install Host Python Dependencies:**
4874
* Create and activate a Python virtual environment (recommended):
49-
```bash
75+
```
5076
python -m venv venv
5177
source venv/bin/activate # Linux/macOS
5278
# venv\Scripts\activate # Windows CMD
79+
5380
```
5481
* Install required packages for the host scripts:
55-
```bash
56-
pip install -r requirements.txt
5782
```
58-
3. **Set OpenAI API Key:**
83+
pip install -r requirements.txt
5984
60-
* Make the script executable:
61-
```bash
62-
chmod +x make_benchmarking_env.sh
63-
```
64-
* Run the script and enter your key when prompted:
65-
```bash
66-
./make_benchmarking_env.sh
6785
```
86+
3. **Set OpenAI API Key:**
87+
* Make the script executable: `chmod +x make_benchmarking_env.sh`
88+
* Run the script and enter your key when prompted: `./make_benchmarking_env.sh`
6889
* This creates the `.env` file. **Ensure `.env` is listed in your `.gitignore` file.**
69-
7090
4. **Prepare Sandbox Requirements:**
71-
72-
* Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`).
91+
* Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`). Ensure these are compatible with the base Python version in the `Dockerfile`.
7392
7493
## Usage
7594
7695
1. **Download a Dataset:**
77-
78-
* Navigate to the `tools/` directory: `cd tools`
79-
* Run the CZI browser interactively:
80-
```bash
81-
python czi_browser.py
82-
```
83-
* Use the `list-versions` and `list-datasets` commands to find a dataset ID.
84-
* Use the `download <version> <dataset_id>` command to download the data and metadata to the `benchmarking/datasets/` directory.
85-
* Return to the main `benchmarking/` directory: `cd ..`
86-
2. **Run the Agent Tester:**
87-
88-
* Execute the main tester script:
89-
```bash
90-
python OneShotAgentTester.py
91-
```
92-
* **Follow the prompts:**
93-
* Choose how to input your agent prompt (paste, single file, folder).
94-
* Select the downloaded dataset you want the agent to analyze.
95-
* Specify the maximum number of code execution attempts allowed.
96-
* The script will then start the sandbox, copy the data, and begin the interaction loop with the OpenAI API, displaying the conversation and code execution results.
97-
* If you provided a folder of prompts, it will iterate through each one.
98-
3. **Manage Sandbox Manually (Optional):**
99-
100-
* Navigate to the `sandbox/` directory: `cd sandbox`
101-
* Use `benchmarking_sandbox_management.py` for manual control:
102-
* Build image: `python benchmarking_sandbox_management.py build`
103-
* Start container: `python benchmarking_sandbox_management.py start`
104-
* Check status: `python benchmarking_sandbox_management.py status`
105-
* Run code: `python benchmarking_sandbox_management.py run "print('hello from sandbox')"`
106-
* Stop container: `python benchmarking_sandbox_management.py stop`
107-
* Run interactively: `python benchmarking_sandbox_management.py`
108-
* Return to the main `benchmarking/` directory: `cd ..`
109-
110-
## File Structure
96+
* Use the `tools/czi_browser.py` script (run `python tools/czi_browser.py` for interactive mode) to find and download a dataset to the `datasets/` directory.
97+
2. **Test a Prompt (`OneShotAgentTester.py`):**
98+
* Run the script: `python OneShotAgentTester.py`
99+
* Follow prompts to select the prompt source (paste, file, folder), dataset, and max code attempts.
100+
* The script starts the sandbox, runs the test(s) by communicating with the internal API, and saves the results to a JSON file in `outputs/`.
101+
3. **Convert Results to Notebook (`output_to_notebook.py`):**
102+
* Run the script: `python output_to_notebook.py`
103+
* Enter the path to a results JSON file (e.g., `outputs/benchmark_results_....json`).
104+
* An `.ipynb` file will be generated in the same directory.
105+
4. **Evaluate Results (`evaluator.py`):**
106+
* Run the script: `python evaluator.py`
107+
* Enter the path to the folder containing results JSON files (defaults to `outputs/`).
108+
* Enter the desired output location for evaluation files.
109+
* The script calls OpenAI to evaluate each test run and saves the grades/comments.
110+
5. **Evolve a Prompt (`prompt_evolver.py`):**
111+
* Run the script: `python prompt_evolver.py`
112+
* Enter the overall objective for the prompt.
113+
* Provide the initial prompt (paste or file path).
114+
* Select the dataset.
115+
* Enter the number of evolution iterations.
116+
* Specify the output directory for logs.
117+
* The script runs the test-evaluate-evolve loop and saves the full log and the final prompt.
118+
6. **Manage Sandbox Manually (Optional):**
119+
* Use `sandbox/benchmarking_sandbox_management.py` for basic container control:
120+
* Build image: `python sandbox/benchmarking_sandbox_management.py build`
121+
* Start container (API): `python sandbox/benchmarking_sandbox_management.py start`
122+
* Check status: `python sandbox/benchmarking_sandbox_management.py status`
123+
* View logs: `python sandbox/benchmarking_sandbox_management.py logs [N]`
124+
* Stop container: `python sandbox/benchmarking_sandbox_management.py stop`
125+
* Run interactively: `python sandbox/benchmarking_sandbox_management.py`
126+
127+
## File Structure (Updated)
111128
112129
```
113130
benchmarking/
114131
├── sandbox/
115132
│ ├── Dockerfile
116-
│ ├── startup.sh
133+
│ ├── kernel_api.py # FastAPI application
134+
│ ├── start_kernel.py # Script to launch kernel
135+
│ ├── start.sh # Container startup script (kernel + API)
117136
│ ├── requirements.txt # Requirements for INSIDE the container
118-
│ └── benchmarking_sandbox_management.py
137+
│ └── benchmarking_sandbox_management.py # Simplified manager
119138
120139
├── datasets/ # Created by czi_browser.py download
121140
│ └── <dataset_name>.h5ad
122141
│ └── <dataset_name>.json
123142
│ └── ...
124143
144+
├── outputs/ # Default location for results/logs/notebooks
145+
│ └── benchmark_results_*.json
146+
│ └── benchmark_results_*.ipynb
147+
│ └── *_eval.json
148+
│ └── evolution_log_*.json
149+
│ └── final_prompt_*.txt
150+
│ └── output_image_*.png
151+
│ └── ...
152+
125153
├── tools/
126154
│ └── czi_browser.py
127155
128-
├── make_benchmarking_env.sh # used to make the .env file
129-
├── OneShotAgentTester.py # Main testing script
130-
├── requirements.txt # Requirements for host scripts (this file)
156+
├── make_benchmarking_env.sh # Used to make the .env file
157+
├── OneShotAgentTester.py # Runs agent tests via API
158+
├── output_to_notebook.py # Converts results JSON to Notebook
159+
├── evaluator.py # Evaluates test results using AI
160+
├── prompt_evolver.py # Orchestrates prompt evolution loop
161+
├── requirements.txt # Requirements for HOST scripts (this file)
131162
└── README.md # This file
132163
└── .env # Stores API key (add to .gitignore)
164+
└── .gitignore # Should include .env, venv/, __pycache__, outputs/, datasets/
165+
133166
```

0 commit comments

Comments
 (0)