feat(WIP): implement llm for metadata conversion#3
Conversation
kellytea
commented
Sep 30, 2025
- working on conversion between known formats
- using litellm
- still need to validate content of the llm response + refactor llm runner configuration
- working on conversion between known formats -using litellm
| "Convert the SOURCE_DATA to match TARGET_MODEL_SCHEMA.\n" | ||
| "SOURCE_DATA:\n" + json.dumps(source_dict) + "\n\n" | ||
| "SOURCE_MODEL_SCHEMA:\n" + json.dumps(source_model.schema()) + "\n\n" | ||
| "TARGET_MODEL_SCHEMA:\n" + json.dumps(target_model.schema()) + "\n\n" |
There was a problem hiding this comment.
I bet context/docs (definitions of terms, etc.) for the schemas might help with results
we could put a description for all the fields in the pydantic models so that it would end up in the jsonschema from schema(), but this may be a lot of work, even with an initial pass of having AI fill them out from the docs below
alternatively, just throw in the full text of the documentation
https://codemeta.github.io/terms/
https://datacite-metadata-schema.readthedocs.io/en/4.6/
https://github.com/citation-file-format/citation-file-format/blob/main/schema-guide.md
- need to refactor api key handling - implement splitting via strategy design pattern - add testing
- overrides need to set env var
- less capable models like mistral-nemo are failing to map cirtical fields. - chunking doesn't seem to improve this, might try refining C-o-T prompt
- removed using instructor because it was having validation errors - need to fix passing api key in llm call rather than setting env var - works bi-directionally between codemeticulous and other metedata formats now
| source_instance = source_data | ||
|
|
||
| # Create summarized schema of source pydantic model according to data instance | ||
| source_schema_dict = get_schema_summary(source_model, llm_model, key, source_instance) |
There was a problem hiding this comment.
this should just be cached as a static data file so the descriptions aren't opaquely+non-deterministically generated every time
| # ) | ||
| # return response | ||
| try: | ||
| os.environ["OPENROUTER_API_KEY"] = key #FIXME: avoid setting env var |
There was a problem hiding this comment.
lets remove the key getting passed around and just assume the environment variable is set. A bit more secure and lets us support any api that litellm does with no extra work
- created command to generate csv schemas all at once - made generating field types more intuitive than literal - added guardrail for llm validation fails by allowing retries w/ context of the errors
|
first stage of evaluation here should be:
|
- test_llm compares between logical conversion & llm conversion - need to add soft assertion for complex cases, display differences for manual review
- cached passed test case llm outputs to use back in few shot prompting in ai_convert - fixed schema descriptions
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental “AI-assisted” metadata conversion path (via LiteLLM) alongside schema caching/prompting utilities, plus a new LLM-focused test suite and supporting fixtures/logs. It also refactors where format “standards” are defined and makes a small adjustment to the deterministic CodeMeta→DataCite conversion.
Changes:
- Add
ai-convert+generate-schemasCLI commands and the underlying LLM conversion runner (codemeticulous/ai_convert.py). - Add prompt strategy + schema cache generation/lookup utilities (
prompt_strategies.py,generate_schemas.py,schema_cache/*.csv). - Add experimental LLM tests (
tests_llm/) and new CodeMeta fixtures.
Reviewed changes
Copilot reviewed 23 out of 27 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| .gitignore | Ignore .env. |
| README.md | Document experimental AI mode + how to run LLM tests. |
| pyproject.toml | Add LiteLLM/instructor and pytest-check deps. |
| codemeticulous/ai_convert.py | New AI conversion pipeline (prompting + validation + retries). |
| codemeticulous/cli.py | New ai-convert and generate-schemas commands. |
| codemeticulous/convert.py | Refactor to import shared STANDARDS. |
| codemeticulous/datacite/convert.py | Adjust deterministic conversion output (formats/sizes commented out). |
| codemeticulous/generate_schemas.py | Generate/read cached schema summaries for prompting. |
| codemeticulous/prompt_strategies.py | New prompt strategy abstraction + default prompt. |
| codemeticulous/standards.py | Central registry of supported standards/models/converters/schemas. |
| schema_cache/cff.csv | Cached schema summary for CFF. |
| schema_cache/codemeta.csv | Cached schema summary for CodeMeta. |
| schema_cache/datacite.csv | Cached schema summary for DataCite. |
| tests/data/codemeta/valid/artificial-anasazi.json | New CodeMeta fixture. |
| tests/data/codemeta/valid/caltech.json | New CodeMeta fixture. |
| tests/data/codemeta/valid/invenordm.json | New CodeMeta fixture. |
| tests_llm/init.py | Package marker for LLM tests. |
| tests_llm/conftest.py | LLM test logging fixture + file discovery helper. |
| tests_llm/test_llm.py | Parametrized LLM conversion tests + diff logging. |
| tests_llm/logs/openrouter-anthropic-claude-haiku-4-5.json | Added LLM run logs. |
| tests_llm/logs/openrouter-anthropic-claude-sonnet-4-6.json | Added LLM run logs. |
| tests_llm/logs/openrouter-deepseek-deepseek-chat.json | Added LLM run logs. |
| tests_llm/logs/openrouter-google-gemini-2.5-pro.json | Added LLM run logs. |
| tests_llm/logs/openrouter-mistralai-mistral-large.json | Added LLM run logs. |
| tests_llm/logs/openrouter-openai-gpt-4o.json | Added LLM run logs. |
Comments suppressed due to low confidence (2)
codemeticulous/ai_convert.py:121
check_schema()is defined ascheck_schema(model: str, flag: bool=False), but here it’s called withsource_instanceas the second argument. Because model instances are truthy, this changes behavior (e.g., for datacite/cff it will load the full JSON schema instead of the cached CSV summary) and makes the prompt input inconsistent. Either changecheck_schemato accept a model instance explicitly (and rename the parameter), or pass an actual boolean flag here.
# Create summarized schema of source pydantic model according to data instance
source_schema_dict = check_schema(source_format, source_instance)
source_schema = json.dumps(source_schema_dict, indent=2)
target_schema = check_schema(target_format)
target_schema = json.dumps(target_schema, indent=2)
codemeticulous/generate_schemas.py:85
check_schema()’s second parameter is typed/used as a boolean toggle, butconvert_ai()passes a model instance, and the comment suggests pruning “according to data instance” which isn’t implemented. Consider changing the signature tocheck_schema(model: str, instance: BaseModel | None = None, use_full_schema: bool = False)(or similar) so callers can’t accidentally flip behavior by passing a truthy non-bool value.
def check_schema(model: str, flag: bool = False):
schema_file = STANDARDS[model]["schema"]
if flag and schema_file is not None: # iterate through fields if there's an instance that calls for the schema to be pruned
with open(schema_file, 'r') as f:
schema = json.load(f)
return schema
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(source_data, dict): | ||
| source_instance = source_model(**source_data) | ||
| elif isinstance(source_data, source_model): | ||
| source_instance = source_data |
| except Exception as e: | ||
| logging.error(f"ERROR: failed to create list from llm response: ", e) | ||
| raise |
| except Exception as e: | ||
| click.echo(f"Failed to load file: {input_file}. {str(e)}", err=True) | ||
| if verbose: | ||
| traceback.print_exc() |
| LOGS_DIR.mkdir(exist_ok=True) | ||
| model_name = llm_model.replace("/", "-") | ||
| json_path = LOGS_DIR / f"{model_name}.json" | ||
|
|
||
| existing = json.loads(json_path.read_text()) if json_path.exists() else [] | ||
| existing.extend(results) | ||
| json_path.write_text(json.dumps(existing, indent=2)) |
| # add full llm outputs that passed into a seperate log dump | ||
| if len(violations) == 0: | ||
| path = Path(__file__).parent / "logs" / "passed_cases.json" | ||
| raw = path.read_text() if path.exists() else "" | ||
| existing = json.loads(raw) if raw.strip() else [] | ||
|
|
||
| passed_case = { | ||
| "file": file_path.name, | ||
| "source:target": f"{source_format}:{target_format}", | ||
| "source_metadata": source_data, | ||
| "llm_output": ai_dict | ||
| } | ||
|
|
||
| existing.append(passed_case) | ||
| path.write_text(json.dumps(existing, indent=2)) | ||
|
|
| # sizes=[data.fileSize] if data.fileSize else None, | ||
| # formats=codemeta_language_fileformat_to_datacite_format( | ||
| # data.programmingLanguage, data.fileFormat | ||
| # ), |
|
|
||
| *Note: ensure that ```.env``` contains the necessary API keys to call the LLM.* | ||
|
|
||
| Run experiemental AI tests |
| url,Optional,URL,N/A,The URL of the landing page for the resource that the DOI resolves to. This is a mandatory field for making a DOI findable.,https://example.com/ | ||
| types,Mandatory,Resource Type,"resourceType,resourceTypeGeneral",Information about the type of the resource. Usually a recommended free-text description paired with a general resource type from a controlled list.,"'resourceType': 'Census Data', 'resourceTypeGeneral': 'Dataset'" | ||
| creators,Mandatory,List of Persons,"nameType,givenName,familyName,nameIdentifier,affiliation","The main researchers or authors involved in producing the resource, in priority order.", | ||
| nameType,Recommended,Enum,N/A,Type of nasme of the creator.,'Organizational' or 'Personal'. |
| @@ -0,0 +1,65 @@ | |||
| field,type,description | |||
| @context,Text,the JSON-LD context URI for CodeMeta v3 (always "https://w3id.org/codemeta/3.0") | |||
| @@ -19,6 +21,7 @@ dev = [ | |||
| "black>=24.8.0", | |||
| "datamodel-code-generator>=0.26.2", | |||
| "pytest>=8.3.3", | |||
| "pytest-check>=2.8.0", | |||
| ] | |||