Add global agent tests and prompt tweak#450
Open
hanna-paasivirta wants to merge 3 commits intomainfrom
Open
Conversation
Contributor
Author
|
@josephjclark this should require minimal review. The prompt change would just affect the global_chat service slightly and the tests are still a work in progress. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Short Description
Adds
global_chattests for scenarios besides one-shot generation and a routing prompt improvement.The new tests address a few of the scenarios in #437 but not all of them.
Implementation Details
The prompt tweak in this PR helps direct multi-step tasks to the planner, even if they will only involve one type of agent. This helps the service gather information from across the full workflow, instead of isolated steps, or the YAML structure without the code.
Tests
The tests cover a routing matrix across three dimensions:
A lot of the tests are adapted from Brandon's list of user scenarios, with easily verifiable information added in. They will need expanding and tweaking later.
Test details
Here's my prompt for generating the tests.
Given:
Prompt: “What does this do?”
Evaluate:
Given:
Prompt: “What does the Claude step do?”
Evaluate:
Same as above, just change page url
Given:
Prompt: “What does this step do?”
Evaluate:
Given:
Prompt: “Modify ChatGPT step to ask for a haiku instead of a couplet.”
Evaluate:
Same as above, just change page url
Given:
Prompt: “Modify to ask for a haiku instead of a poem.”
Evaluate:
Given:
Prompt: “I want all poems to be in French”
Evaluate:
Same as above, just change page url
Given:
Prompt: “I want all poems in the cat poetry competition to be in French”
Evaluate:
Given:
Prompt: “Make this a poetry competition. Send it to both Claude and ChatGPT. Send that to another Claude step to be the judge. Then send me the results.”
Evaluate:
AI Usage
Please disclose how you've used AI in this work (it's cool, we just want to know!):
You can read more details in our Responsible AI Policy