Semi‑automatic structured data extraction from image collections
DataGatherer is a full‑stack application that assists in turning sets of images into structured, queryable information. It provides tooling to upload, organize, derive relationships, and finally generate a user–hashtag matrix showing how frequently users have used specific tags. The goal: accelerate manual curation while preserving accuracy.
- Semi‑automatic data extraction workflow from uploaded images
- User ↔ hashtag frequency matrix generation
- OCR & suggestion services (extensible service layer)
- Django admin for direct entity management
- Index computation & similarity (Annoy index present in repo)
- Containerized deployment (Docker + docker compose)
- Reverse‑proxied via Nginx (scalable edge layer)
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ React │ → │ Django API │ → │ Services / │
│ Frontend │ │ (job/storage│ │ Compute/Idx │
└────────────┘ │ functions) │ └──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Nginx ← docker network / persistence / media volume
Core backend apps:
functions_api– common functions & business logicjob_api– job orchestration & processing logicstorage_api– models/serialization for stored entitiesservices– OCR, suggestion & index computation modules
| Layer | Technology |
|---|---|
| Frontend | React (JS) |
| Backend | Django |
| Services | Python modules (OCR / Annoy) |
| Proxy | Nginx |
| Packaging | Docker / docker compose |
| Data Store | (Likely Django default DB / configure externally) |
| user_1 | user_2 | ... | user_3 | |
|---|---|---|---|---|
| hashtag_1 | weight_11 | |||
| hashtag_2 | ||||
| hashtag_n | weight_nn |
Definition: weight(user, hashtag) = number of posts by that user containing the hashtag.
This application is designed for local use only.
Please ensure:
- Port
8080is not exposed publicly (restrict firewall / router). - Default admin credentials are changed if ever network‑accessible.
- Uploaded media does not contain sensitive PII unless you harden access.
Important: No built‑in authentication hardening or role‑based access beyond Django’s defaults. You are responsible for securing deployments.
Ensure you have:
- Docker & Docker Compose (v2+)
- (Optional) Python 3.11+ if running backend outside containers
git clone https://github.com/ViDem02/DataGatherer.git
cd DataGatherer
docker compose build
docker compose upThen open: http://127.0.0.1:8080
URL: http://127.0.0.1:8080/api/admin
Default credentials (auto‑provisioned on first run):
username: admin
password: admin
Change these immediately for any non‑isolated environment.
- Backend: Unit tests cover critical data operations & matrix accuracy.
- Frontend: Currently not automatically tested (roadmap item).
Run backend tests (inside the backend container or a Python venv):
python manage.py testbackend/
data_gatherer/ # Django project root
functions_api/ # Functions & shared logic
job_api/ # Job orchestration
storage_api/ # Persistence & serializers
services/ # OCR, suggestion, indexing
frontend/ # React SPA
nginx/ # Reverse proxy config
docs/ # Diagrams & documentation
Add new extraction or analytics services by placing Python modules in services/ and wiring tasks via job_api logic. Frontend queries can be added under src/queries/ and consumed in state slices (e.g. keyboardSlice.js).
- Frontend automated testing (Jest + React Testing Library)
- Auth hardening (JWT / session rotation)
- Role‑based access & audit logging
- Pluggable storage backends (PostgreSQL, S3 for media)
- Advanced similarity & clustering visualizations
- Export matrix in multiple formats (CSV / Parquet)
Currently oriented toward personal / portfolio use. If you wish to contribute:
- Fork & create a feature branch
- Keep changes focused & documented
- Open a PR describing intent & test impact
License not specified. If publishing publicly, choose one (MIT / Apache‑2.0 / Proprietary) and update this section.
This repo demonstrates architectural and data processing approaches and is not production‑hardened. Use at your own risk.
Feel free to reach out or open issues for discussion, improvements, or clarifications.