Skip to content
This repository was archived by the owner on Mar 2, 2026. It is now read-only.

Latest commit

Β 

History

History
169 lines (110 loc) Β· 5.2 KB

File metadata and controls

169 lines (110 loc) Β· 5.2 KB

version-cv

CodeQL Advanced Bandit

Version-cv is a research-driven deep learning repository focused on mathematical problem solving, image recognition, and mathematical reasoning in large language models (LLMs). It builds on the foundation of version-tab, which emphasizes mathematical symbolic reasoning, math-based vectorization, and tabular LLM development.

Extending this work into the visual domain, version-cv builds for vision-based tasks and multimodal understanding. It integrates PyFlink for distributed data processing, Apache Atlas for metadata and lineage tracking, Apache Airflow for workflow orchestration, PyArrow for efficient in-memory columnar data interchange, and Mojo for high-performance AI ML/DL development. Together, these technologies enable scalable, reproducible research across structured and unstructured data pipelines.

Due to time constraints during the project's development, many of these tools were not fully leveragedβ€”but they are included as a contribution to the open-source community for continued research, experimentation, and advancement in this space.

Research Publications/References:

See Research & References section below for a broader scope of the research for this project.


πŸ“Š Key Datasets

Initial benchmarks were considered but not implemented due to time constraint research and builds. They are provided in data/ and docs/ directories.


πŸ“ Project Structure

version-cv/
β”œβ”€β”€ cloud
β”œβ”€β”€ data
β”œβ”€β”€ docs
β”œβ”€β”€ models
β”œβ”€β”€ notebooks
β”œβ”€β”€ sandbox
β”œβ”€β”€ .gitattributes
β”œβ”€β”€ .gitignore
β”œβ”€β”€ CITATION.cff
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ install_pixi.sh
β”œβ”€β”€ pixi.lock
└── pixi.toml

⚑ Setup

This project is built with Pixi to manage environments and Python dependencies.

# Install Pixi if not already installed
curl -sSf https://pixi.sh/install.sh | bash

# Or run the installation script
./install_pixi.sh

# Initialize Pixi (creates pixi.toml and pixi.lock)
pixi init

# Install dependencies
pixi install

# Enter Pixi environment
pixi shell

# Pixi environment information
pixi info

πŸš€ Quick Start

On machines with low compute these may not run as fast. Run Jupyter notebooks to see how models were designed.

Option 1: Using Pixi shell

python models/basemodel.py
jupyter lab

Option 2: One-liner

pixi run python models/basemodel.py
pixi run jupyter lab

πŸ“Š Running & Viewing Results

  1. Place your images and data in the data/ directory (e.g., data/handwriting, data/formulas).

  2. Run training/inference:

python models/basemodel.py

πŸ§ͺ Notebooks

Jupyter Lab:

jupyter lab

Run Jupyter notebooks basemodel.ipynb or mathwriting.ipynb from the notebooks/ folder for exploratory workflows.


πŸ“ƒ Research & References

  • Gervais et al., MathWriting: A Dataset for Handwritten Mathematical Expression Recognition arXiv:2404.10690
  • Saxton et al., Analyzing Mathematical Reasoning Abilities of Neural Models arXiv:1904.01557
  • OpenAI, Improving Mathematical Reasoning with Process Supervision (2023) Blog Link
  • Hendrycks et al., Measuring Mathematical Problem Solving With the MATH Dataset arXiv:2103.03874

Additional implementation notes are in docs/ and data usage info is in data/.


πŸ›‘οΈ Security Note

version-cv is built with integrated security and Python dependency management tools Bandit and pip-audit. Security and reproducibility improvements are important and welcome via PR's

Bandit

Bandit is a static analysis tool that is utilized to identify common security issues in Python code.

To run manually:

bandit -r models/ notebooks/

pip-audit

pip-audit is a tool for scanning Python dependencies and packages in your environment for vulnerabilities.

To run:

pixi run pip-audit

Or:

pip-audit