feat(WIP): implement llm for metadata conversion by kellytea · Pull Request #3 · SciCodes/codemeticulous

kellytea · 2025-09-30T21:47:08Z

working on conversion between known formats
using litellm
still need to validate content of the llm response + refactor llm runner configuration

- working on conversion between known formats -using litellm

sgfost · 2025-10-01T17:55:44Z

+        "Convert the SOURCE_DATA to match TARGET_MODEL_SCHEMA.\n"
+        "SOURCE_DATA:\n" + json.dumps(source_dict) + "\n\n"
+        "SOURCE_MODEL_SCHEMA:\n" + json.dumps(source_model.schema()) + "\n\n"
+        "TARGET_MODEL_SCHEMA:\n" + json.dumps(target_model.schema()) + "\n\n"


I bet context/docs (definitions of terms, etc.) for the schemas might help with results

we could put a description for all the fields in the pydantic models so that it would end up in the jsonschema from schema(), but this may be a lot of work, even with an initial pass of having AI fill them out from the docs below

alternatively, just throw in the full text of the documentation

https://codemeta.github.io/terms/
https://datacite-metadata-schema.readthedocs.io/en/4.6/
https://github.com/citation-file-format/citation-file-format/blob/main/schema-guide.md

- need to refactor api key handling - implement splitting via strategy design pattern - add testing

- overrides need to set env var

- less capable models like mistral-nemo are failing to map cirtical fields. - chunking doesn't seem to improve this, might try refining C-o-T prompt

- removed using instructor because it was having validation errors - need to fix passing api key in llm call rather than setting env var - works bi-directionally between codemeticulous and other metedata formats now

sgfost · 2026-02-03T20:55:21Z

+        source_instance = source_data
+
+    # Create summarized schema of source pydantic model according to data instance
+    source_schema_dict = get_schema_summary(source_model, llm_model, key, source_instance)


this should just be cached as a static data file so the descriptions aren't opaquely+non-deterministically generated every time

sgfost · 2026-02-03T21:51:29Z

+    #     )
+    #     return response
+    try:
+        os.environ["OPENROUTER_API_KEY"] = key #FIXME: avoid setting env var


lets remove the key getting passed around and just assume the environment variable is set. A bit more secure and lets us support any api that litellm does with no extra work

- created command to generate csv schemas all at once - made generating field types more intuitive than literal - added guardrail for llm validation fails by allowing retries w/ context of the errors

sgfost · 2026-03-10T22:15:27Z

first stage of evaluation here should be:

take a set of codemeta objects (https://github.com/SciCodes/codemeticulous/tree/main/tests/data or https://github.com/SciCodes/software-metadata-extraction-benchmark)
convert them to other formats with the logical conversion (T_logical)
do the same conversions with the llm conversion (T_llm) and make sure T_llm ⊇ T_logical

- test_llm compares between logical conversion & llm conversion - need to add soft assertion for complex cases, display differences for manual review

- cached passed test case llm outputs to use back in few shot prompting in ai_convert - fixed schema descriptions

Copilot

Pull request overview

This PR introduces an experimental “AI-assisted” metadata conversion path (via LiteLLM) alongside schema caching/prompting utilities, plus a new LLM-focused test suite and supporting fixtures/logs. It also refactors where format “standards” are defined and makes a small adjustment to the deterministic CodeMeta→DataCite conversion.

Changes:

Add ai-convert + generate-schemas CLI commands and the underlying LLM conversion runner (codemeticulous/ai_convert.py).
Add prompt strategy + schema cache generation/lookup utilities (prompt_strategies.py, generate_schemas.py, schema_cache/*.csv).
Add experimental LLM tests (tests_llm/) and new CodeMeta fixtures.

Reviewed changes

Copilot reviewed 23 out of 27 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
.gitignore	Ignore `.env`.
README.md	Document experimental AI mode + how to run LLM tests.
pyproject.toml	Add LiteLLM/instructor and pytest-check deps.
codemeticulous/ai_convert.py	New AI conversion pipeline (prompting + validation + retries).
codemeticulous/cli.py	New `ai-convert` and `generate-schemas` commands.
codemeticulous/convert.py	Refactor to import shared `STANDARDS`.
codemeticulous/datacite/convert.py	Adjust deterministic conversion output (formats/sizes commented out).
codemeticulous/generate_schemas.py	Generate/read cached schema summaries for prompting.
codemeticulous/prompt_strategies.py	New prompt strategy abstraction + default prompt.
codemeticulous/standards.py	Central registry of supported standards/models/converters/schemas.
schema_cache/cff.csv	Cached schema summary for CFF.
schema_cache/codemeta.csv	Cached schema summary for CodeMeta.
schema_cache/datacite.csv	Cached schema summary for DataCite.
tests/data/codemeta/valid/artificial-anasazi.json	New CodeMeta fixture.
tests/data/codemeta/valid/caltech.json	New CodeMeta fixture.
tests/data/codemeta/valid/invenordm.json	New CodeMeta fixture.
tests_llm/init.py	Package marker for LLM tests.
tests_llm/conftest.py	LLM test logging fixture + file discovery helper.
tests_llm/test_llm.py	Parametrized LLM conversion tests + diff logging.
tests_llm/logs/openrouter-anthropic-claude-haiku-4-5.json	Added LLM run logs.
tests_llm/logs/openrouter-anthropic-claude-sonnet-4-6.json	Added LLM run logs.
tests_llm/logs/openrouter-deepseek-deepseek-chat.json	Added LLM run logs.
tests_llm/logs/openrouter-google-gemini-2.5-pro.json	Added LLM run logs.
tests_llm/logs/openrouter-mistralai-mistral-large.json	Added LLM run logs.
tests_llm/logs/openrouter-openai-gpt-4o.json	Added LLM run logs.

Comments suppressed due to low confidence (2)

codemeticulous/ai_convert.py:121

check_schema() is defined as check_schema(model: str, flag: bool=False), but here it’s called with source_instance as the second argument. Because model instances are truthy, this changes behavior (e.g., for datacite/cff it will load the full JSON schema instead of the cached CSV summary) and makes the prompt input inconsistent. Either change check_schema to accept a model instance explicitly (and rename the parameter), or pass an actual boolean flag here.

    # Create summarized schema of source pydantic model according to data instance
    source_schema_dict = check_schema(source_format, source_instance)
    source_schema = json.dumps(source_schema_dict, indent=2)

    target_schema = check_schema(target_format)
    target_schema = json.dumps(target_schema, indent=2)

codemeticulous/generate_schemas.py:85

check_schema()’s second parameter is typed/used as a boolean toggle, but convert_ai() passes a model instance, and the comment suggests pruning “according to data instance” which isn’t implemented. Consider changing the signature to check_schema(model: str, instance: BaseModel | None = None, use_full_schema: bool = False) (or similar) so callers can’t accidentally flip behavior by passing a truthy non-bool value.

def check_schema(model: str, flag: bool = False): 
    schema_file = STANDARDS[model]["schema"]

    if flag and schema_file is not None: # iterate through fields if there's an instance that calls for the schema to be pruned
        with open(schema_file, 'r') as f:
            schema = json.load(f)
        
        return schema

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if isinstance(source_data, dict):
+        source_instance = source_model(**source_data)
+    elif isinstance(source_data, source_model):
+        source_instance = source_data


+        except Exception as e:
+            logging.error(f"ERROR: failed to create list from llm response: ", e)  
+            raise             


+    except Exception as e:
+        click.echo(f"Failed to load file: {input_file}. {str(e)}", err=True)
+        if verbose:
+            traceback.print_exc()


+    LOGS_DIR.mkdir(exist_ok=True)
+    model_name = llm_model.replace("/", "-")
+    json_path = LOGS_DIR / f"{model_name}.json"
+
+    existing = json.loads(json_path.read_text()) if json_path.exists() else []
+    existing.extend(results)
+    json_path.write_text(json.dumps(existing, indent=2))


+    # add full llm outputs that passed into a seperate log dump
+    if len(violations) == 0:
+        path = Path(__file__).parent / "logs" / "passed_cases.json"
+        raw = path.read_text() if path.exists() else ""
+        existing = json.loads(raw) if raw.strip() else []
+
+        passed_case = {
+            "file": file_path.name,
+            "source:target": f"{source_format}:{target_format}",
+            "source_metadata": source_data,
+            "llm_output": ai_dict
+        }
+
+        existing.append(passed_case)
+        path.write_text(json.dumps(existing, indent=2))
+


+                # sizes=[data.fileSize] if data.fileSize else None,
+                # formats=codemeta_language_fileformat_to_datacite_format(
+                #     data.programmingLanguage, data.fileFormat
+                # ),


+
+*Note: ensure that ```.env``` contains the necessary API keys to call the LLM.*
+
+Run experiemental AI tests


+url,Optional,URL,N/A,The URL of the landing page for the resource that the DOI resolves to. This is a mandatory field for making a DOI findable.,https://example.com/
+types,Mandatory,Resource Type,"resourceType,resourceTypeGeneral",Information about the type of the resource. Usually a recommended free-text description paired with a general resource type from a controlled list.,"'resourceType': 'Census Data', 'resourceTypeGeneral': 'Dataset'"
+creators,Mandatory,List of Persons,"nameType,givenName,familyName,nameIdentifier,affiliation","The main researchers or authors involved in producing the resource, in priority order.",
+nameType,Recommended,Enum,N/A,Type of nasme of the creator.,'Organizational' or 'Personal'.


@@ -0,0 +1,65 @@
+field,type,description
+@context,Text,the JSON-LD context URI for CodeMeta v3 (always "https://w3id.org/codemeta/3.0")


@@ -19,6 +21,7 @@ dev = [
    "black>=24.8.0",
    "datamodel-code-generator>=0.26.2",
    "pytest>=8.3.3",
+    "pytest-check>=2.8.0",
 ]


feat(WIP): implement llm for metadata conversion

a0e2726

- working on conversion between known formats -using litellm

sgfost reviewed Oct 1, 2025

View reviewed changes

Comment thread codemeticulous/ai_convert.py Outdated

sgfost reviewed Oct 1, 2025

View reviewed changes

Comment thread codemeticulous/ai_convert.py Outdated

kellytea added 5 commits November 10, 2025 14:08

feat: added llm converion

1585ec8

- need to refactor api key handling - implement splitting via strategy design pattern - add testing

fix: simplified llm calls via instructor

fd2a876

- overrides need to set env var

fix(WIP): improve "poor" llm model performance

27c6edf

- less capable models like mistral-nemo are failing to map cirtical fields. - chunking doesn't seem to improve this, might try refining C-o-T prompt

feat: summmarize pydantic schemas with generated descriptions via llm

3a42cc5

feat: pruned schemas into llm conversion calls

74bfca7

- removed using instructor because it was having validation errors - need to fix passing api key in llm call rather than setting env var - works bi-directionally between codemeticulous and other metedata formats now

sgfost reviewed Feb 3, 2026

View reviewed changes

kellytea added 4 commits February 16, 2026 18:31

fix(WIP): consolidate PR fixes

e7fa909

fix: refactored schema generation + ai conversion logic

e67cdf9

- created command to generate csv schemas all at once - made generating field types more intuitive than literal - added guardrail for llm validation fails by allowing retries w/ context of the errors

update README.md

588bba7

update README.md

16c2e1a

kellytea added 3 commits March 24, 2026 17:11

test(WIP): first layer evaluation for llm's conversion performance

728dbd8

- test_llm compares between logical conversion & llm conversion - need to add soft assertion for complex cases, display differences for manual review

test(WIP): adding logging by llm model

aeb471b

test: refactored observability logs

fb4b493

alee marked this pull request as ready for review May 12, 2026 20:56

feat: added few shot examples

483e1c6

- cached passed test case llm outputs to use back in few shot prompting in ai_convert - fixed schema descriptions

Copilot AI review requested due to automatic review settings May 12, 2026 20:57

Copilot started reviewing on behalf of kellytea May 12, 2026 20:58 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

fix: consolidated copilot suggestions

363e9a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(WIP): implement llm for metadata conversion#3

feat(WIP): implement llm for metadata conversion#3
kellytea wants to merge 15 commits into
SciCodes:mainfrom
kellytea:automate-conversion

kellytea commented Sep 30, 2025

Uh oh!

sgfost Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

sgfost Feb 3, 2026

Uh oh!

sgfost Feb 3, 2026

Uh oh!

sgfost commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Note: ensure that ```.env``` contains the necessary API keys to call the LLM.

		Run experiemental AI tests

		@@ -0,0 +1,65 @@
		field,type,description
		@context,Text,the JSON-LD context URI for CodeMeta v3 (always "https://w3id.org/codemeta/3.0")

Uh oh!

Conversation

kellytea commented Sep 30, 2025

Uh oh!

sgfost Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sgfost Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sgfost Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sgfost commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants