Thank you for your interest in contributing to PullData! This document provides guidelines and instructions for contributing.
- Fork and clone the repository:
git clone https://github.com/YOUR_USERNAME/pulldata.git
cd pulldata- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install in development mode with dev dependencies:
pip install -e ".[dev]"- Install pre-commit hooks:
pre-commit installWe use several tools to maintain code quality:
- black: Code formatting (line length: 100)
- ruff: Linting and import sorting
- mypy: Type checking
Run all checks:
# Format code
black pulldata/
# Lint
ruff check pulldata/
# Type check
mypy pulldata/
# Or run all at once
pre-commit run --all-filesWe use pytest for testing. All new features should include tests.
# Run all tests
pytest
# Run with coverage
pytest --cov=pulldata --cov-report=html
# Run specific test file
pytest tests/test_parser.py
# Run tests matching a pattern
pytest -k "test_pdf"- Place tests in the
tests/directory - Name test files
test_*.py - Name test functions
test_* - Use fixtures for common setup
- Aim for >80% code coverage
Example:
import pytest
from pulldata.parsing import PDFParser
@pytest.fixture
def sample_pdf():
return "tests/fixtures/sample.pdf"
def test_pdf_parsing(sample_pdf):
parser = PDFParser()
result = parser.parse(sample_pdf)
assert result.text is not None
assert len(result.pages) > 0main: Stable, production-ready codedevelop: Integration branch for featuresfeature/*: New featuresfix/*: Bug fixesdocs/*: Documentation updates
Follow the conventional commits format:
<type>(<scope>): <subject>
<body>
<footer>
Types:
feat: New featurefix: Bug fixdocs: Documentation changesstyle: Code style changes (formatting, etc.)refactor: Code refactoringtest: Adding or updating testschore: Maintenance tasks
Examples:
feat(parser): add DOCX table extraction support
fix(storage): resolve race condition in FAISS index updates
docs(readme): update installation instructions
- Create a feature branch from
develop - Make your changes with clear, atomic commits
- Add tests for new functionality
- Update documentation as needed
- Ensure all tests pass and code is formatted
- Push your branch and create a pull request
- Respond to review feedback
PR Title Format:
[Type] Brief description of changes
PR Description Template:
## Description
Brief description of what this PR does.
## Changes
- Change 1
- Change 2
## Testing
How to test these changes.
## Related Issues
Closes #123All PRs require:
- At least one approval from a maintainer
- All CI checks passing
- No merge conflicts
- Up-to-date with target branch
- Additional output formats (CSV, HTML, etc.)
- Performance optimizations
- Documentation improvements
- Bug fixes
- Test coverage improvements
- Streaming generation for large reports
- Multi-modal support (images, charts)
- Advanced table processing
- Custom embedding models
- Integration with cloud storage
- Web UI
- Tutorial notebooks
- Video guides
- API documentation
- Architecture deep-dives
- Performance tuning guides
When contributing, keep in mind:
- Target hardware: Tesla P4 (8GB VRAM)
- Optimize for speed and memory efficiency
- Use batch processing where possible
- Cache expensive operations
- Profile before optimizing
- Open an issue for questions
- Join discussions on GitHub Discussions
- Check existing documentation
By contributing, you agree that your contributions will be licensed under the MIT License.
Contributors will be recognized in:
- README.md contributors section
- Release notes
- Git commit history
Thank you for contributing to PullData!