Skip to content

feat(WIP): implement llm for metadata conversion#3

Open
kellytea wants to merge 15 commits into
SciCodes:mainfrom
kellytea:automate-conversion
Open

feat(WIP): implement llm for metadata conversion#3
kellytea wants to merge 15 commits into
SciCodes:mainfrom
kellytea:automate-conversion

Conversation

@kellytea

Copy link
Copy Markdown
  • working on conversion between known formats
  • using litellm
  • still need to validate content of the llm response + refactor llm runner configuration

- working on conversion between known formats
-using litellm
Comment thread codemeticulous/ai_convert.py Outdated
"Convert the SOURCE_DATA to match TARGET_MODEL_SCHEMA.\n"
"SOURCE_DATA:\n" + json.dumps(source_dict) + "\n\n"
"SOURCE_MODEL_SCHEMA:\n" + json.dumps(source_model.schema()) + "\n\n"
"TARGET_MODEL_SCHEMA:\n" + json.dumps(target_model.schema()) + "\n\n"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet context/docs (definitions of terms, etc.) for the schemas might help with results

we could put a description for all the fields in the pydantic models so that it would end up in the jsonschema from schema(), but this may be a lot of work, even with an initial pass of having AI fill them out from the docs below

alternatively, just throw in the full text of the documentation

https://codemeta.github.io/terms/
https://datacite-metadata-schema.readthedocs.io/en/4.6/
https://github.com/citation-file-format/citation-file-format/blob/main/schema-guide.md

Comment thread codemeticulous/ai_convert.py Outdated
Comment thread codemeticulous/ai_convert.py Outdated
- need to refactor api key handling
- implement splitting via strategy design
pattern
- add testing
- overrides need to set env var
- less capable models like mistral-nemo are failing to map cirtical fields.
- chunking doesn't seem to improve this, might try  refining C-o-T prompt
- removed using instructor because it was having validation errors
- need to fix passing api key in llm call rather than setting env var
- works bi-directionally between codemeticulous and other metedata formats now
Comment thread codemeticulous/ai_convert.py Outdated
source_instance = source_data

# Create summarized schema of source pydantic model according to data instance
source_schema_dict = get_schema_summary(source_model, llm_model, key, source_instance)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should just be cached as a static data file so the descriptions aren't opaquely+non-deterministically generated every time

Comment thread codemeticulous/ai_convert.py Outdated
# )
# return response
try:
os.environ["OPENROUTER_API_KEY"] = key #FIXME: avoid setting env var

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove the key getting passed around and just assume the environment variable is set. A bit more secure and lets us support any api that litellm does with no extra work

- created command to generate csv schemas all at once
- made generating field types more intuitive than literal
- added guardrail for llm validation fails by allowing retries w/ context of the errors
@sgfost

sgfost commented Mar 10, 2026

Copy link
Copy Markdown
Collaborator

first stage of evaluation here should be:

- test_llm compares between logical conversion & llm conversion
- need to add soft assertion for complex cases, display differences for manual review
@alee alee marked this pull request as ready for review May 12, 2026 20:56
- cached passed test case llm outputs to use back in few shot prompting in ai_convert
- fixed schema descriptions
Copilot AI review requested due to automatic review settings May 12, 2026 20:57

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental “AI-assisted” metadata conversion path (via LiteLLM) alongside schema caching/prompting utilities, plus a new LLM-focused test suite and supporting fixtures/logs. It also refactors where format “standards” are defined and makes a small adjustment to the deterministic CodeMeta→DataCite conversion.

Changes:

  • Add ai-convert + generate-schemas CLI commands and the underlying LLM conversion runner (codemeticulous/ai_convert.py).
  • Add prompt strategy + schema cache generation/lookup utilities (prompt_strategies.py, generate_schemas.py, schema_cache/*.csv).
  • Add experimental LLM tests (tests_llm/) and new CodeMeta fixtures.

Reviewed changes

Copilot reviewed 23 out of 27 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
.gitignore Ignore .env.
README.md Document experimental AI mode + how to run LLM tests.
pyproject.toml Add LiteLLM/instructor and pytest-check deps.
codemeticulous/ai_convert.py New AI conversion pipeline (prompting + validation + retries).
codemeticulous/cli.py New ai-convert and generate-schemas commands.
codemeticulous/convert.py Refactor to import shared STANDARDS.
codemeticulous/datacite/convert.py Adjust deterministic conversion output (formats/sizes commented out).
codemeticulous/generate_schemas.py Generate/read cached schema summaries for prompting.
codemeticulous/prompt_strategies.py New prompt strategy abstraction + default prompt.
codemeticulous/standards.py Central registry of supported standards/models/converters/schemas.
schema_cache/cff.csv Cached schema summary for CFF.
schema_cache/codemeta.csv Cached schema summary for CodeMeta.
schema_cache/datacite.csv Cached schema summary for DataCite.
tests/data/codemeta/valid/artificial-anasazi.json New CodeMeta fixture.
tests/data/codemeta/valid/caltech.json New CodeMeta fixture.
tests/data/codemeta/valid/invenordm.json New CodeMeta fixture.
tests_llm/init.py Package marker for LLM tests.
tests_llm/conftest.py LLM test logging fixture + file discovery helper.
tests_llm/test_llm.py Parametrized LLM conversion tests + diff logging.
tests_llm/logs/openrouter-anthropic-claude-haiku-4-5.json Added LLM run logs.
tests_llm/logs/openrouter-anthropic-claude-sonnet-4-6.json Added LLM run logs.
tests_llm/logs/openrouter-deepseek-deepseek-chat.json Added LLM run logs.
tests_llm/logs/openrouter-google-gemini-2.5-pro.json Added LLM run logs.
tests_llm/logs/openrouter-mistralai-mistral-large.json Added LLM run logs.
tests_llm/logs/openrouter-openai-gpt-4o.json Added LLM run logs.
Comments suppressed due to low confidence (2)

codemeticulous/ai_convert.py:121

  • check_schema() is defined as check_schema(model: str, flag: bool=False), but here it’s called with source_instance as the second argument. Because model instances are truthy, this changes behavior (e.g., for datacite/cff it will load the full JSON schema instead of the cached CSV summary) and makes the prompt input inconsistent. Either change check_schema to accept a model instance explicitly (and rename the parameter), or pass an actual boolean flag here.
    # Create summarized schema of source pydantic model according to data instance
    source_schema_dict = check_schema(source_format, source_instance)
    source_schema = json.dumps(source_schema_dict, indent=2)

    target_schema = check_schema(target_format)
    target_schema = json.dumps(target_schema, indent=2)

codemeticulous/generate_schemas.py:85

  • check_schema()’s second parameter is typed/used as a boolean toggle, but convert_ai() passes a model instance, and the comment suggests pruning “according to data instance” which isn’t implemented. Consider changing the signature to check_schema(model: str, instance: BaseModel | None = None, use_full_schema: bool = False) (or similar) so callers can’t accidentally flip behavior by passing a truthy non-bool value.
def check_schema(model: str, flag: bool = False): 
    schema_file = STANDARDS[model]["schema"]

    if flag and schema_file is not None: # iterate through fields if there's an instance that calls for the schema to be pruned
        with open(schema_file, 'r') as f:
            schema = json.load(f)
        
        return schema
    

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if isinstance(source_data, dict):
source_instance = source_model(**source_data)
elif isinstance(source_data, source_model):
source_instance = source_data
Comment on lines +72 to +74
except Exception as e:
logging.error(f"ERROR: failed to create list from llm response: ", e)
raise
Comment thread codemeticulous/cli.py
except Exception as e:
click.echo(f"Failed to load file: {input_file}. {str(e)}", err=True)
if verbose:
traceback.print_exc()
Comment thread tests_llm/conftest.py
Comment on lines +16 to +22
LOGS_DIR.mkdir(exist_ok=True)
model_name = llm_model.replace("/", "-")
json_path = LOGS_DIR / f"{model_name}.json"

existing = json.loads(json_path.read_text()) if json_path.exists() else []
existing.extend(results)
json_path.write_text(json.dumps(existing, indent=2))
Comment thread tests_llm/test_llm.py Outdated
Comment on lines +187 to +202
# add full llm outputs that passed into a seperate log dump
if len(violations) == 0:
path = Path(__file__).parent / "logs" / "passed_cases.json"
raw = path.read_text() if path.exists() else ""
existing = json.loads(raw) if raw.strip() else []

passed_case = {
"file": file_path.name,
"source:target": f"{source_format}:{target_format}",
"source_metadata": source_data,
"llm_output": ai_dict
}

existing.append(passed_case)
path.write_text(json.dumps(existing, indent=2))

Comment thread codemeticulous/datacite/convert.py Outdated
Comment on lines +387 to +390
# sizes=[data.fileSize] if data.fileSize else None,
# formats=codemeta_language_fileformat_to_datacite_format(
# data.programmingLanguage, data.fileFormat
# ),
Comment thread README.md Outdated

*Note: ensure that ```.env``` contains the necessary API keys to call the LLM.*

Run experiemental AI tests
Comment thread schema_cache/datacite.csv Outdated
url,Optional,URL,N/A,The URL of the landing page for the resource that the DOI resolves to. This is a mandatory field for making a DOI findable.,https://example.com/
types,Mandatory,Resource Type,"resourceType,resourceTypeGeneral",Information about the type of the resource. Usually a recommended free-text description paired with a general resource type from a controlled list.,"'resourceType': 'Census Data', 'resourceTypeGeneral': 'Dataset'"
creators,Mandatory,List of Persons,"nameType,givenName,familyName,nameIdentifier,affiliation","The main researchers or authors involved in producing the resource, in priority order.",
nameType,Recommended,Enum,N/A,Type of nasme of the creator.,'Organizational' or 'Personal'.
Comment thread schema_cache/codemeta.csv
@@ -0,0 +1,65 @@
field,type,description
@context,Text,the JSON-LD context URI for CodeMeta v3 (always "https://w3id.org/codemeta/3.0")
Comment thread pyproject.toml
Comment on lines 7 to 25
@@ -19,6 +21,7 @@ dev = [
"black>=24.8.0",
"datamodel-code-generator>=0.26.2",
"pytest>=8.3.3",
"pytest-check>=2.8.0",
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants