updated README

djriffle · djriffle · commit 7be1fe5413f1 · 2025-04-24T10:41:41.000-04:00
diff --git a/benchmarking/README.md b/benchmarking/README.md
@@ -1,15 +1,17 @@
-# Benchmarking Agent Prompts for Single-Cell Data Analysis
+# Benchmarking and Evolving Agent Prompts for Single-Cell Data Analysis
 
-**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iteration and testing of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
+**⚠️ Work in Progress:** This tooling is currently under development. Its primary goal is to facilitate rapid iteration, testing, evaluation, and evolution of LLM agent prompts for analyzing single-cell transcriptomics datasets using a secure code execution sandbox.
 
 ## Overview
 
 This framework provides the necessary tools to:
 
 1. **Discover and Download Datasets:** Browse and fetch datasets (specifically from the CZI CELLxGENE Census) along with their metadata.
-2. **Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox).
-3. **Agent Interaction:** Orchestrate a "one-shot" interaction between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox, allowing the agent a limited number of attempts to execute code for analysis.
-4. **Prompt Iteration:** Easily test different agent prompts (pasted, from files, or from folders) against the same dataset and sandbox setup.
+2. **Secure Code Execution:** Run Python code generated by an AI agent within an isolated Docker container (sandbox). The sandbox now runs a Jupyter kernel managed by a  **FastAPI service** , providing a stable HTTP interface for code execution.
+3. **Agent Interaction & Testing (`OneShotAgentTester.py`):** Orchestrate interactions between an AI agent (powered by OpenAI's API), a selected dataset, and the code execution sandbox (via the FastAPI service). Allows testing prompts with limited code execution attempts.
+4. **Results Conversion (`output_to_notebook.py`):** Convert the detailed JSON logs from test runs into Jupyter Notebooks (`.ipynb`) for easier review and analysis reproduction.
+5. **AI-Powered Evaluation (`evaluator.py`):** Use an LLM (like GPT-4o) to automatically evaluate the performance of the agent based on the conversation logs, assigning a grade and providing comments.
+6. **Automated Prompt Evolution (`prompt_evolver.py`):** Iteratively refine an initial agent prompt based on an objective, test results, and AI evaluation feedback to automatically discover more effective prompts.
 
 ## Components
 
@@ -19,115 +21,146 @@ The framework consists of the following main components:
   * `make_benchmarking_env.sh`: An interactive script to securely prompt for and save your OpenAI API key.
   * `.env`: The file (created by the script) storing the `OPENAI_KEY`. This file should be added to your `.gitignore`.
 * **`tools/czi_browser.py`:**
-  * A CLI tool (with an interactive mode) for listing available CZI CELLxGENE Census versions and datasets.
-  * Allows downloading specific datasets (`.h5ad` format) and their corresponding metadata (`.json` format) to the `datasets/` directory.
+  * A CLI tool for listing CZI CELLxGENE Census versions and datasets.
+  * Allows downloading specific datasets (`.h5ad`) and metadata (`.json`) to the `datasets/` directory.
 * **`sandbox/`:** Contains the code execution environment.
-  * `Dockerfile`: Defines the Docker image based on `e2bdev/code-interpreter`, adding necessary Python/system dependencies (`pandas`, `anndata`, etc. via `requirements.txt`).
-  * `requirements.txt`: Lists Python packages installed *inside* the sandbox container.
-  * `startup.sh`: Script run when the container starts (e.g., to initialize Jupyter kernel if needed by the base image).
-  * `benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) to build the sandbox image, start/stop the container, and execute Python code strings within it via `docker exec`.
+  * `Dockerfile`: Defines the Docker image based on a Python base, adding necessary Python/system dependencies, Jupyter components, FastAPI, Uvicorn, and the application code.
+  * `requirements.txt`: Lists Python packages installed *inside* the sandbox container (e.g., `anndata`, `scanpy`, `matplotlib`).
+  * `kernel_api.py`: The FastAPI application running inside the container. It receives code execution requests via HTTP, interacts with a local Jupyter kernel using `jupyter_client`, captures results (stdout, stderr, errors, display data), and returns them as JSON.
+  * `start_kernel.py`: A simple script used internally by `start.sh` to launch the Jupyter kernel process with specific arguments (e.g., listening IP, ports).
+  * `start.sh`: The main startup script run by the container (managed by `tini`). It launches the Jupyter kernel in the background and then starts the Uvicorn server to run the `kernel_api.py` FastAPI app.
+  * `benchmarking_sandbox_management.py`: A Python script (with CLI and interactive modes) primarily used for building the sandbox image and manually starting/stopping the container (which runs the API service). Direct kernel interaction commands have been removed.
 * **`datasets/`:** (Created by `czi_browser.py`)
-  * Stores downloaded `.h5ad` data files and `.json` metadata files. This directory should likely be added to your `.gitignore` if datasets are large or numerous.
+  * Stores downloaded `.h5ad` data files and `.json` metadata files.
+* **`outputs/`:** (Created automatically)
+  * Default directory for storing JSON logs from `OneShotAgentTester.py` and `PromptEvolver.py`, evaluation results from `evaluator.py`, and potentially generated notebooks/images.
 * **`OneShotAgentTester.py`:**
-  * The main orchestrator script.
-  * Prompts the user to select an agent prompt input method (paste, file, folder).
-  * Lists datasets available in the `datasets/` directory and prompts for selection.
-  * Asks for the maximum number of code execution attempts allowed for the agent.
-  * Manages the interaction loop: starts the sandbox, copies data, sends prompts and history to the OpenAI API, executes code returned by the agent in the sandbox, and feeds results back to the agent until the attempt limit is reached.
+  * Orchestrates a single test run for one or more prompts against a dataset.
+  * Starts the sandbox container (via `SandboxManager`).
+  * Copies the dataset into the running container.
+  * Checks if the internal API service is responsive.
+  * Manages the interaction loop with the OpenAI API (specified agent model).
+  * When the agent generates code, it sends the code to the sandbox's FastAPI `/execute` endpoint using the `requests` library.
+  * Formats the JSON response (stdout, stderr, errors, display data) from the API and feeds it back to the agent.
+  * Saves the full conversation log for the test run(s) to a JSON file in the `outputs/` directory.
+* **`output_to_notebook.py`:**
+  * An interactive script that takes a results JSON file (from `OneShotAgentTester` or `PromptEvolver`) as input.
+  * Converts the conversation log, including code cells and their outputs (stdout, stderr, errors, display data), into a Jupyter Notebook (`.ipynb`) file.
+  * Saves the `.ipynb` file in the same directory as the input JSON.
+* **`evaluator.py`:**
+  * An interactive script that processes results JSON files from a specified input directory (defaults to `outputs/`).
+  * For each test run in the JSON, it formats the conversation and sends it to an OpenAI model (specified evaluator model) with instructions to evaluate the agent's performance (0-100 grade and comments) based on defined criteria (e.g., correctness, efficiency, clarity).
+  * Saves the evaluations (grade and comments) to JSON files (either aggregated or individually) in a specified output location (defaults to the input directory).
+* **`prompt_evolver.py`:**
+  * An orchestrator script for automatically refining prompts.
+  * Takes an initial prompt, an objective, a dataset, and the number of iterations.
+  * In each iteration:
+    * Runs the current prompt using the testing logic (`run_single_test_iteration`).
+    * Evaluates the result using the evaluation logic (`call_openai_evaluator`).
+    * Calls another OpenAI model (specified evolver model) to generate an improved prompt based on the objective, previous prompt, conversation summary, and evaluation feedback.
+    * Uses the evolved prompt for the next iteration.
+  * Saves a detailed log of the entire evolution process (prompts, test data, evaluations) and the final evolved prompt.
+* **`requirements.txt`:** (Top-level)
+  * Lists Python packages required for the *host* scripts (`OneShotAgentTester.py`, `evaluator.py`, `prompt_evolver.py`, `czi_browser.py`, etc.). Key dependencies include `openai`, `python-dotenv`, `requests`, `docker`, `rich`, `nbformat`.
 
 ## Setup
 
 1. **Prerequisites:**
-
-   * Python (3.8+ recommended)
+   * Python (3.10+ recommended)
    * `pip` (Python package installer)
    * Docker Desktop or Docker Engine (must be running)
    * Git (for cloning the repository)
-2. **Install Top-Level Python Dependencies:**
-
+2. **Install Host Python Dependencies:**
    * Create and activate a Python virtual environment (recommended):
-     ```bash
+     ```
      python -m venv venv
      source venv/bin/activate  # Linux/macOS
      # venv\Scripts\activate  # Windows CMD
+
      ```
    * Install required packages for the host scripts:
-     ```bash
-     pip install -r requirements.txt
      ```
-3. **Set OpenAI API Key:**
+     pip install -r requirements.txt
 
-   * Make the script executable:
-     ```bash
-     chmod +x make_benchmarking_env.sh
-     ```
-   * Run the script and enter your key when prompted:
-     ```bash
-     ./make_benchmarking_env.sh
      ```
+3. **Set OpenAI API Key:**
+   * Make the script executable: `chmod +x make_benchmarking_env.sh`
+   * Run the script and enter your key when prompted: `./make_benchmarking_env.sh`
    * This creates the `.env` file. **Ensure `.env` is listed in your `.gitignore` file.**
-
 4. **Prepare Sandbox Requirements:**
-
-   * Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`).
+   * Edit `sandbox/requirements.txt` to include all the additional Python packages needed *inside* the container for agent code execution (e.g., `pandas`, `numpy`, `scipy`, `scikit-learn`, `anndata`, `matplotlib`, `seaborn`). Ensure these are compatible with the base Python version in the `Dockerfile`.
 
 ## Usage
 
 1. **Download a Dataset:**
-
-   * Navigate to the `tools/` directory: `cd tools`
-   * Run the CZI browser interactively:
-     ```bash
-     python czi_browser.py
-     ```
-   * Use the `list-versions` and `list-datasets` commands to find a dataset ID.
-   * Use the `download <version> <dataset_id>` command to download the data and metadata to the `benchmarking/datasets/` directory.
-   * Return to the main `benchmarking/` directory: `cd ..`
-2. **Run the Agent Tester:**
-
-   * Execute the main tester script:
-     ```bash
-     python OneShotAgentTester.py
-     ```
-   * **Follow the prompts:**
-     * Choose how to input your agent prompt (paste, single file, folder).
-     * Select the downloaded dataset you want the agent to analyze.
-     * Specify the maximum number of code execution attempts allowed.
-   * The script will then start the sandbox, copy the data, and begin the interaction loop with the OpenAI API, displaying the conversation and code execution results.
-   * If you provided a folder of prompts, it will iterate through each one.
-3. **Manage Sandbox Manually (Optional):**
-
-   * Navigate to the `sandbox/` directory: `cd sandbox`
-   * Use `benchmarking_sandbox_management.py` for manual control:
-     * Build image: `python benchmarking_sandbox_management.py build`
-     * Start container: `python benchmarking_sandbox_management.py start`
-     * Check status: `python benchmarking_sandbox_management.py status`
-     * Run code: `python benchmarking_sandbox_management.py run "print('hello from sandbox')"`
-     * Stop container: `python benchmarking_sandbox_management.py stop`
-     * Run interactively: `python benchmarking_sandbox_management.py`
-   * Return to the main `benchmarking/` directory: `cd ..`
-
-## File Structure
+   * Use the `tools/czi_browser.py` script (run `python tools/czi_browser.py` for interactive mode) to find and download a dataset to the `datasets/` directory.
+2. **Test a Prompt (`OneShotAgentTester.py`):**
+   * Run the script: `python OneShotAgentTester.py`
+   * Follow prompts to select the prompt source (paste, file, folder), dataset, and max code attempts.
+   * The script starts the sandbox, runs the test(s) by communicating with the internal API, and saves the results to a JSON file in `outputs/`.
+3. **Convert Results to Notebook (`output_to_notebook.py`):**
+   * Run the script: `python output_to_notebook.py`
+   * Enter the path to a results JSON file (e.g., `outputs/benchmark_results_....json`).
+   * An `.ipynb` file will be generated in the same directory.
+4. **Evaluate Results (`evaluator.py`):**
+   * Run the script: `python evaluator.py`
+   * Enter the path to the folder containing results JSON files (defaults to `outputs/`).
+   * Enter the desired output location for evaluation files.
+   * The script calls OpenAI to evaluate each test run and saves the grades/comments.
+5. **Evolve a Prompt (`prompt_evolver.py`):**
+   * Run the script: `python prompt_evolver.py`
+   * Enter the overall objective for the prompt.
+   * Provide the initial prompt (paste or file path).
+   * Select the dataset.
+   * Enter the number of evolution iterations.
+   * Specify the output directory for logs.
+   * The script runs the test-evaluate-evolve loop and saves the full log and the final prompt.
+6. **Manage Sandbox Manually (Optional):**
+   * Use `sandbox/benchmarking_sandbox_management.py` for basic container control:
+     * Build image: `python sandbox/benchmarking_sandbox_management.py build`
+     * Start container (API): `python sandbox/benchmarking_sandbox_management.py start`
+     * Check status: `python sandbox/benchmarking_sandbox_management.py status`
+     * View logs: `python sandbox/benchmarking_sandbox_management.py logs [N]`
+     * Stop container: `python sandbox/benchmarking_sandbox_management.py stop`
+     * Run interactively: `python sandbox/benchmarking_sandbox_management.py`
+
+## File Structure (Updated)
 
 ```
 benchmarking/
 ├── sandbox/
 │   ├── Dockerfile
-│   ├── startup.sh
+│   ├── kernel_api.py          # FastAPI application
+│   ├── start_kernel.py        # Script to launch kernel
+│   ├── start.sh               # Container startup script (kernel + API)
 │   ├── requirements.txt       # Requirements for INSIDE the container
-│   └── benchmarking_sandbox_management.py
+│   └── benchmarking_sandbox_management.py # Simplified manager
 │
 ├── datasets/                  # Created by czi_browser.py download
 │   └── <dataset_name>.h5ad
 │   └── <dataset_name>.json
 │   └── ...
 │
+├── outputs/                   # Default location for results/logs/notebooks
+│   └── benchmark_results_*.json
+│   └── benchmark_results_*.ipynb
+│   └── *_eval.json
+│   └── evolution_log_*.json
+│   └── final_prompt_*.txt
+│   └── output_image_*.png
+│   └── ...
+│
 ├── tools/
 │   └── czi_browser.py
 │
-├── make_benchmarking_env.sh  # used to make the .env file
-├── OneShotAgentTester.py      # Main testing script
-├── requirements.txt           # Requirements for host scripts (this file)
+├── make_benchmarking_env.sh  # Used to make the .env file
+├── OneShotAgentTester.py      # Runs agent tests via API
+├── output_to_notebook.py      # Converts results JSON to Notebook
+├── evaluator.py               # Evaluates test results using AI
+├── prompt_evolver.py          # Orchestrates prompt evolution loop
+├── requirements.txt           # Requirements for HOST scripts (this file)
 └── README.md                  # This file
 └── .env                       # Stores API key (add to .gitignore)
+└── .gitignore                 # Should include .env, venv/, __pycache__, outputs/, datasets/
+
 ```