Skip to content

Commit 3fe3cf1

Browse files
dschwarz26petermuehlbacherclaude
authored
Add numbers to the 10k researchers post, and fix the image rendering (#122)
* Add numbers to the 10k researchers post, and fix the image rendering * Add runtime (3h 27m) to 10k agents notebook Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Peter Muehlbacher <muehlbacher.peter@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ff28947 commit 3fe3cf1

2 files changed

Lines changed: 18 additions & 28 deletions

File tree

docs-site/src/styles/notebook.css

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -190,6 +190,14 @@
190190
background-color: #f0f0f0;
191191
}
192192

193+
/* Images (matplotlib plots, etc.) */
194+
.notebook-content .output_png img,
195+
.notebook-content .output_subarea img {
196+
max-width: 100%;
197+
height: auto;
198+
display: block;
199+
}
200+
193201
/* Hide empty prompts */
194202
.notebook-content .prompt:empty {
195203
display: none;

docs/case_studies/llm-web-research-agents-at-scale/notebook.ipynb

Lines changed: 10 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,24 +4,24 @@
44
"cell_type": "markdown",
55
"id": "intro",
66
"metadata": {},
7-
"source": [
8-
"# Running LLM Web Research Agents at Scale\n",
9-
"\n",
10-
"The everyrow `agent_map()` function runs an LLM web research agent on every row of a dataframe. This notebook demonstrates how this scales to running 10,000 agents, each of which consists of many LLM calls.\n",
11-
"\n",
12-
"The total cost was ~$0.11/row, reflecting that there can be more than a few LLM calls involved in each row's agent."
13-
]
7+
"source": "# Run 10,000 LLM Web Research Agents\n\nThe everyrow `agent_map()` function runs an LLM web research agent on every row of a dataframe. In this notebook, I demonstrate scaling this to running 10,000 web agents.\n\nFirst, some numbers. The total cost was ~$0.11/row, using 120k LLM calls, 1.56B input tokens, 20.1M output tokens, executing 338k web searches, and reading 11,726 pages. The whole run took only 3 hours 27 minutes.\n\n<table>\n <tr>\n <th>Model</th>\n <th style=\"padding-left: 20px;\">Calls</th>\n <th style=\"padding-left: 20px;\">Input Tokens</th>\n <th style=\"padding-left: 20px;\">Output Tokens</th>\n <th style=\"padding-left: 20px;\">Cost</th>\n </tr>\n <tr>\n <td>gemini-3-flash-preview</td>\n <td style=\"padding-left: 20px;\">98,190</td>\n <td style=\"padding-left: 20px;\">847,115,551</td>\n <td style=\"padding-left: 20px;\">17,237,847</td>\n <td style=\"padding-left: 20px;\">$913.85</td>\n </tr>\n <tr>\n <td>gemini-2.5-flash</td>\n <td style=\"padding-left: 20px;\">11,574</td>\n <td style=\"padding-left: 20px;\">700,327,085</td>\n <td style=\"padding-left: 20px;\">2,715,535</td>\n <td style=\"padding-left: 20px;\">$222.01</td>\n </tr>\n <tr>\n <td>claude-sonnet-4-20250514</td>\n <td style=\"padding-left: 20px;\">10,015</td>\n <td style=\"padding-left: 20px;\">10,912,199</td>\n <td style=\"padding-left: 20px;\">193,567</td>\n <td style=\"padding-left: 20px;\">$35.64</td>\n </tr>\n</table>\n<br>\n\nYou'll see that this is reasonably affordable only because the vast majority of the work is done by Gemini-3-Flash (running the agents) and Gemini-2.5-Flash (reading webpages). The SDK supports using higher powered LLMs when it's really worth it.\n\nAlso, you'll see that to process 10,000 rows, each agent executed 34 web searches, but only fully read ~1.2 pages. The rest of its information it got by reading search result snippets, which can be surprisingly informative to an agent trying to answer simple questions, often allowing it save a lot of tokens by not fetching and read any pages at all, and still answer correctly. Gemini-3-Flash is quite good at this, in general, doing nearly the best on [Deep Research Bench](https://evals.futuresearch.ai/), and by far the most cost-efficient model. (Though Opus 4.6, released in Feb 2026, also shows great token efficiency in doing web research, and can be cost competitive even though it's ~9x the price per token!)\n\nA large cost comes from writing output, as this agent produced a few paragraphs of unstructured research output in addition to specifically requested fields (see the dataframe below). Costs could be reduced by minimizing the outputs, but generally we find that output to be very useful in processing the outputs further, and reduces the chance that the agent is unable to report important information given a restrictive schema."
148
},
159
{
1610
"cell_type": "markdown",
1711
"id": "1uoefxrbb",
1812
"metadata": {},
1913
"source": [
20-
"## Example: Researching 10,000 Drug Products\n",
14+
"## Researching 10,000 Drugs\n",
15+
"\n",
16+
"In many cases, running a web research agent for every row in a dataset is not efficient. We recommend, for example, [first using an intelligent filter](https://everyrow.io/docs/chaining-operations), or deduping your data, to cut down on the amount of web research needed.\n",
17+
"\n",
18+
"Sometimes, though, you do want good research on every entity in a large list, and there's no structured data source to pull from. In that case, you want a system to run as cheaply and consistently as possible to research all of them, using multiple sources and given agents freedom to search around and read what seems relevant for each one.\n",
2119
"\n",
2220
"This example takes a dataset of 10,000 drug product entries (trade name, ingredient, applicant, strength, dosage form) and determines each product's current FDA regulatory status. Determining regulatory status requires researching each product individually against FDA databases, Orange Book listings, Federal Register notices, and other sources. Some products have straightforward histories while others have complex timelines involving tentative approvals, voluntary withdrawals, or transitions between marketed and not marketed status.\n",
2321
"\n",
24-
"This run achieved a 99.97% success rate (9,997 of 10,000 rows returned results). For evals on agent accuracy, see [evals.futuresearch.ai](https://evals.futuresearch.ai/) or our [papers](https://futuresearch.ai/research)."
22+
"This run achieved a 99.97% success rate (9,997 of 10,000 rows returned results).\n",
23+
"\n",
24+
"Below is how you can reproduce this."
2525
]
2626
},
2727
{
@@ -253,16 +253,6 @@
253253
" )"
254254
]
255255
},
256-
{
257-
"cell_type": "markdown",
258-
"id": "cost-header",
259-
"metadata": {},
260-
"source": [
261-
"### Cost\n",
262-
"\n",
263-
"Running 10,000 agents cost $1.17k, averaging around $0.11 per row."
264-
]
265-
},
266256
{
267257
"cell_type": "markdown",
268258
"id": "inspect-header",
@@ -832,14 +822,6 @@
832822
"print(f\" ANDA (generic): {(results_df['regulatory_status'] == 'FDA approved (ANDA \\u2013 generic)').sum():,}\")\n",
833823
"approved[[\"trade_name\", \"ingredient\", \"applicant\", \"dosage_form\", \"research\"]].sample(2, random_state=42)"
834824
]
835-
},
836-
{
837-
"cell_type": "code",
838-
"execution_count": null,
839-
"id": "0fc1b316",
840-
"metadata": {},
841-
"outputs": [],
842-
"source": []
843825
}
844826
],
845827
"metadata": {
@@ -866,4 +848,4 @@
866848
},
867849
"nbformat": 4,
868850
"nbformat_minor": 5
869-
}
851+
}

0 commit comments

Comments
 (0)