Possible date/time-range mismatches in scenario expected behavior

Hi, thanks for releasing AssetOpsBench. While looking through the Hugging Face `scenarios` data, I noticed a few rows where the prompt date/time range and the `characteristic_form` date/time range appear to disagree.

Since `characteristic_form` is used as expected behavior by the LLM judge, these mismatches may affect evaluation: a response using the date from the user prompt could be judged against a different date in the expected behavior.

Source checked:

- HF dataset: https://huggingface.co/datasets/ibm-research/AssetOpsBench
- Raw file: https://huggingface.co/datasets/ibm-research/AssetOpsBench/raw/main/data/scenarios/all_utterance.jsonl
- Evaluation docs/code: `docs/evaluation.md`, `src/evaluation/scorers/llm_judge.py`

Examples:

| id | Prompt asks for | `characteristic_form` says | Possible fix |
| --- | --- | --- | --- |
| 10 | first week of June 2020 | last week | Use "first week of June 2020", or clarify the intended reference date for "last week". |
| 11 | last week of April '20 | past week | Use "last week of April 2020", or clarify the intended reference date for "past week". |
| 42 | September 19, 2020 at quarter to midnight | September 19, 2015 at 11:45pm | Change 2015 to 2020 if the prompt is correct. |
| 43 | 6/14/20 | June 14, 2016 | Change 2016 to 2020 if `6/14/20` means June 14, 2020. |
| 45 | Mar 13 '20 | January 13, 2023 | Change to March 13, 2020 if the prompt is correct. |
| 48 | September 19, 2020 at 7pm | September 19, 2015 at 7pm | Change 2015 to 2020 if the prompt is correct. |
| 410 | first week of June 2020 | first week of June 2020, but later says "time range - First week of May 2020" | Change the final May reference to June. I checked `src/couchdb/sample_data/work_order/event.csv`, and the "6 alert records" count appears to match the first week of June. |

One related wording issue I noticed:

| id | Current wording | Why it looks ambiguous |
| --- | --- | --- |
| 430 | "two month's period from 2020-05-01T12:30:00 to 2022-06-30T19:30:00" | The explicit timestamps cover more than two years, not two months. If the timestamps are intended, "two month's period" could be replaced with "the date range". |

I am not assuming the prompt is always the source of truth here; the main issue is that the prompt and expected behavior currently point to different time ranges.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible date/time-range mismatches in scenario expected behavior #310

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

id	Prompt asks for	`characteristic_form` says	Possible fix
10	first week of June 2020	last week	Use "first week of June 2020", or clarify the intended reference date for "last week".
11	last week of April '20	past week	Use "last week of April 2020", or clarify the intended reference date for "past week".
42	September 19, 2020 at quarter to midnight	September 19, 2015 at 11:45pm	Change 2015 to 2020 if the prompt is correct.
43	6/14/20	June 14, 2016	Change 2016 to 2020 if `6/14/20` means June 14, 2020.
45	Mar 13 '20	January 13, 2023	Change to March 13, 2020 if the prompt is correct.
48	September 19, 2020 at 7pm	September 19, 2015 at 7pm	Change 2015 to 2020 if the prompt is correct.
410	first week of June 2020	first week of June 2020, but later says "time range - First week of May 2020"	Change the final May reference to June. I checked `src/couchdb/sample_data/work_order/event.csv`, and the "6 alert records" count appears to match the first week of June.

Possible date/time-range mismatches in scenario expected behavior #310

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions