Hi, thanks for releasing AssetOpsBench. While looking through the Hugging Face scenarios data, I noticed a few rows where the prompt date/time range and the characteristic_form date/time range appear to disagree.
Since characteristic_form is used as expected behavior by the LLM judge, these mismatches may affect evaluation: a response using the date from the user prompt could be judged against a different date in the expected behavior.
Source checked:
Examples:
| id |
Prompt asks for |
characteristic_form says |
Possible fix |
| 10 |
first week of June 2020 |
last week |
Use "first week of June 2020", or clarify the intended reference date for "last week". |
| 11 |
last week of April '20 |
past week |
Use "last week of April 2020", or clarify the intended reference date for "past week". |
| 42 |
September 19, 2020 at quarter to midnight |
September 19, 2015 at 11:45pm |
Change 2015 to 2020 if the prompt is correct. |
| 43 |
6/14/20 |
June 14, 2016 |
Change 2016 to 2020 if 6/14/20 means June 14, 2020. |
| 45 |
Mar 13 '20 |
January 13, 2023 |
Change to March 13, 2020 if the prompt is correct. |
| 48 |
September 19, 2020 at 7pm |
September 19, 2015 at 7pm |
Change 2015 to 2020 if the prompt is correct. |
| 410 |
first week of June 2020 |
first week of June 2020, but later says "time range - First week of May 2020" |
Change the final May reference to June. I checked src/couchdb/sample_data/work_order/event.csv, and the "6 alert records" count appears to match the first week of June. |
One related wording issue I noticed:
| id |
Current wording |
Why it looks ambiguous |
| 430 |
"two month's period from 2020-05-01T12:30:00 to 2022-06-30T19:30:00" |
The explicit timestamps cover more than two years, not two months. If the timestamps are intended, "two month's period" could be replaced with "the date range". |
I am not assuming the prompt is always the source of truth here; the main issue is that the prompt and expected behavior currently point to different time ranges.
Hi, thanks for releasing AssetOpsBench. While looking through the Hugging Face
scenariosdata, I noticed a few rows where the prompt date/time range and thecharacteristic_formdate/time range appear to disagree.Since
characteristic_formis used as expected behavior by the LLM judge, these mismatches may affect evaluation: a response using the date from the user prompt could be judged against a different date in the expected behavior.Source checked:
docs/evaluation.md,src/evaluation/scorers/llm_judge.pyExamples:
characteristic_formsays6/14/20means June 14, 2020.src/couchdb/sample_data/work_order/event.csv, and the "6 alert records" count appears to match the first week of June.One related wording issue I noticed:
I am not assuming the prompt is always the source of truth here; the main issue is that the prompt and expected behavior currently point to different time ranges.