Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Changelog

## [2025-10-16T21:44:46-04:00 (America/New_York)]
### Added
- Documented a synthetic dataset ingestion workflow in `docs/retrieval.md` (including sample loader code) so benchmarking
runs can hydrate graph drivers without recomputing embeddings.

### Changed
- Expanded operations, setup, and environment guides (`docs/operations.md`, `SETUP.md`, `ENVIRONMENT_NEEDS.md`,
`NEEDED_FOR_TESTING.md`) with batching/verification tips for loading generated JSONL/CSV corpora.
- Updated core documentation and planning artifacts (`README.md`, `PROJECT.md`, `PLAN.md`, `ROADMAP.md`, `SOT.md`,
`RECOMMENDATIONS.md`, `PLANNING_THOUGHTS.md`, `ISSUES.md`, `RESUME_NOTES.md`, `TODO.md`) to reference the ingestion workflow
and capture the follow-up automation task.

## [2025-10-16T20:39:06-04:00 (America/New_York)]
### Added
- Added live integration coverage for Memgraph, Neo4j, and Redis via `meshmind/tests/test_integration_live.py` and configured
Expand Down
4 changes: 3 additions & 1 deletion ENVIRONMENT_NEEDS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@
consolidation heuristics and pagination under load. The new
`scripts/generate_synthetic_dataset.py` utility produces JSONL/CSV corpora
(defaults: 10k memories, 20k triplets, 384-dim embeddings) that can be copied to
shared storage for on-demand benchmarking.
shared storage for on-demand benchmarking. Pair the shared datasets with the
ingestion workflow documented in `docs/retrieval.md` so operators can seed
environments quickly without recomputing embeddings.
- Maintain outbound package download access to PyPI and vendor repositories; this
session confirmed package installation works when the network is open, and future
sessions need the same capability to refresh locks or install new optional
Expand Down
2 changes: 1 addition & 1 deletion ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@
## Low Priority / Nice to Have
- [x] Offer alternative storage backends (in-memory driver, SQLite, etc.) for easier local development.
- [x] Provide an administrative dashboard or CLI commands for listing namespaces, counts, and maintenance statistics (CLI admin subcommands now expose predicates, telemetry, and graph checks).
- [ ] Publish onboarding guides and troubleshooting FAQs for contributors.
- [ ] Publish onboarding guides and troubleshooting FAQs for contributors (synthetic dataset ingestion docs landed in `docs/retrieval.md`, but a broader newcomer guide is still pending).
- [ ] Explore plugin registration for embeddings and retrieval strategies to reduce manual wiring.
2 changes: 1 addition & 1 deletion NEEDED_FOR_TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@
external services are unavailable.
- Use `meshmind/testing` fakes (`FakeMemgraphDriver`, `FakeRedisBroker`, `FakeEmbeddingEncoder`, `FakeLLMClient`) in tests or demos to eliminate external infrastructure requirements. Integration suites marked with `@pytest.mark.integration` exercise live Memgraph/Neo4j/Redis instances and expect the docker stack to be running.
- Invoke `meshmind admin predicates` and `meshmind admin maintenance --max-attempts <n> --base-delay <seconds> --run <task>` during local runs to inspect predicate registries, telemetry, and tune maintenance retries without external services.
- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests.
- Use the benchmarking utilities in `scripts/` (`evaluate_importance.py`, `consolidation_benchmark.py`, `benchmark_pagination.py`) to validate heuristics and driver performance offline before connecting to live infrastructure. Generate large corpora with `scripts/generate_synthetic_dataset.py` when you need ≥10k memories for stress tests and follow the ingestion workflow in `docs/retrieval.md` to load them into the graph drivers used by your tests.
- Seed demo data as needed using the `examples/extract_preprocess_store_example.py` script after configuring environment
variables.
- Create a `.env` file storing the environment variables above for consistent local configuration.
Expand Down
3 changes: 2 additions & 1 deletion PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@
2. **Maintenance Tasks** – Tasks emit telemetry, persist consolidation/compression results, and now retry conflicting writes with
configurable exponential backoff (`MAINTENANCE_MAX_ATTEMPTS`, `MAINTENANCE_BASE_DELAY_SECONDS`). Synthetic benchmark scripts,
the new `scripts/generate_synthetic_dataset.py`, and integration tests against live Memgraph/Neo4j validate behaviour on larger
workloads; next, replay production-like datasets to tune thresholds.
workloads. Fresh documentation in `docs/retrieval.md` and `docs/operations.md` now describes how to ingest those synthetic datasets
into the target backend; next, replay production-like datasets to tune thresholds.
3. **Importance Scoring Improvements** – Heuristic scoring is live, records distribution metrics via telemetry, and ships with
`scripts/evaluate_importance.py` for synthetic/offline evaluation. Next: incorporate real feedback loops or LLM-assisted
ranking to tune weights over time.
Expand Down
2 changes: 1 addition & 1 deletion PLANNING_THOUGHTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- **Pydantic Model Policy** – Follow the documented plan (target Pydantic 2.12+, refresh locks when 3.13 wheels land, record migration guidance) to avoid resurrecting compatibility shims.

## Upcoming Research
- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py`).
- Benchmark consolidation heuristics on synthetic datasets representing customer scale and capture telemetry snapshots (seed data via `scripts/generate_synthetic_dataset.py` and load it using the ingestion workflow documented in `docs/retrieval.md`).
- Compare graph query latency across in-memory, SQLite, Memgraph, and Neo4j drivers when using pagination and filtering.
- Evaluate rerank quality across LLM providers using a labelled evaluation set to determine optimal default models.
- Investigate options for secure secret storage (e.g., Vault, AWS Secrets Manager) to standardise API key management.
2 changes: 1 addition & 1 deletion PROJECT.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
- Docker Compose now provisions Memgraph, Neo4j, and Redis; integration-specific stacks (including the Celery worker) live under
`meshmind/tests/docker/`. `pytest -m integration` exercises live services once the stack is running. See `ENVIRONMENT_NEEDS.md`
and `SETUP.md` for enabling optional services locally.
- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets.
- `scripts/generate_synthetic_dataset.py` produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) to stress retrieval and consolidation flows prior to ingesting real datasets. The accompanying ingestion workflow documented in `docs/retrieval.md` shows how to hydrate graph drivers without recomputing embeddings.

## Roadmap Highlights
- Push graph-backed retrieval deeper into the drivers (vector similarity, structured filters) so the new server-side filtering/pagination evolves into full backend-native ranking.
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,9 @@ Tasks instantiate the driver lazily, emit structured logs/metrics, and persist c
- **Synthetic dataset generation** – `scripts/generate_synthetic_dataset.py` creates large JSONL/CSV corpora of
memories/triplets (defaults: 10k memories, 20k triplets, 384-dim embeddings) so you can stress retrieval, consolidation,
and integration flows before ingesting real data.
- **Synthetic dataset ingestion** – Follow the workflow documented in `docs/retrieval.md` to load the generated JSONL/CSV
payloads into MeshMind via the Python client. The operations guide walks through batching tips and post-ingestion
verification so benchmark runs start from a consistent baseline.
- **Importance scoring** – `scripts/evaluate_importance.py` runs the heuristic against JSON or synthetic datasets and reports
descriptive statistics for quick regression checks.
- **Consolidation throughput** – `scripts/consolidation_benchmark.py` generates synthetic workloads to measure batch merging
Expand Down
3 changes: 2 additions & 1 deletion RECOMMENDATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,8 @@

## Documentation & Onboarding
- Keep `README.md`, `SOT.md`, `docs/`, and onboarding guides synchronized with each release; document rerank, retrieval, and
registry flows with diagrams when possible.
registry flows with diagrams when possible. The new synthetic dataset ingestion workflow in `docs/retrieval.md` should be
incorporated into future onboarding materials.
- Maintain the troubleshooting section for optional tooling (ruff, pyright, typeguard, toml-sort, yamllint) now referenced in
the Makefile and expand it as new developer utilities are introduced. Keep `SETUP.md` synchronized when dependencies change.
- Provide walkthroughs for configuring LLM reranking, including sample prompts and response expectations.
Expand Down
3 changes: 2 additions & 1 deletion RESUME_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- Added live integration coverage (`meshmind/tests/test_integration_live.py`) for Memgraph, Neo4j, and Redis, introduced a pytest marker configuration, and documented the workflow across README/SETUP/docs.
- Generated a fresh `uv.lock`, pinned `.python-version` to 3.12, and updated install docs to standardise on `uv sync --all-extras`.
- Created `scripts/generate_synthetic_dataset.py` for large JSONL/CSV corpora and referenced it across benchmarking docs.
- Documented the synthetic dataset ingestion workflow across `docs/retrieval.md`, `docs/operations.md`, README, and supporting planning guides so benchmarks can load corpora without recomputing embeddings.
- Updated documentation and planning collateral (README.md, SETUP.md, docs/development.md, docs/testing.md, docs/operations.md, PROJECT.md, PLAN.md, RECOMMENDATIONS.md, ROADMAP.md, ENVIRONMENT_NEEDS.md, NEEDED_FOR_TESTING.md, SOT.md, PLANNING_THOUGHTS.md, DUMMIES.md, TODO.md, RESUME_NOTES.md) to reflect the integration workflow, dataset generation, and the new Pydantic policy.

## Environment State
Expand All @@ -26,5 +27,5 @@
1. Address remaining `TODO.md` priority items (backend-native vector similarity, Celery worker integration, grpcurl end-to-end tests) now that graph services are accessible locally.
2. Automate the integration suite in CI and capture resource requirements for shared infrastructure.
3. Prepare grpcurl-based smoke tests for `meshmind serve-grpc` and plan protobuf client packaging once integration coverage extends beyond the Python stub.
4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`.
4. Feed findings from large synthetic datasets into retry/backoff defaults and document recommended values in `ENVIRONMENT_NEEDS.md`, validating the new ingestion workflow as part of those runs.
5. Continue tracking shim retirements in `DUMMIES.md` and follow the cleanup plan in `CLEANUP.md` so remaining fakes can be removed when infrastructure allows.
2 changes: 1 addition & 1 deletion ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

## Near-Term (0–2 Weeks)
- Automate the new integration suite (`pytest -m integration`) in CI so Memgraph/Neo4j/Redis regressions fail fast.
- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py`).
- Finalize maintenance write policies by implementing retry/backoff semantics and measuring consolidation accuracy against representative datasets (now aided by `scripts/generate_synthetic_dataset.py` and the documented ingestion workflow in `docs/retrieval.md`).
- Publish ROADMAP and PLANNING_THOUGHTS artifacts, and seed the `research/` folder with competitive analysis to ground prioritization discussions.
- Expand automated smoke tests for REST `/memories/counts`, CLI `meshmind admin counts`, and provisioning scripts to ensure guardrails stay trustworthy.
- Capture outstanding shim retirement work (FastAPI tests now live; continue tracking FakeLLM/Fake drivers) in CLEANUP.md with precise acceptance criteria for each removal.
Expand Down
4 changes: 3 additions & 1 deletion SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@ docker compose -f meshmind/tests/docker/memgraph.yml up -d
```

> Need synthetic load? Run `python scripts/generate_synthetic_dataset.py build/datasets/benchmark`
> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests.
> to seed JSONL/CSV fixtures before loading them into Memgraph/Neo4j for stress tests. Follow the
> ingestion workflow in `docs/retrieval.md` when copying the fixtures into your graph backend so
> benchmarks reuse the same namespace/layout.

### 3.2 Cleaning up

Expand Down
2 changes: 1 addition & 1 deletion SOT.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Supporting assets:
- `SETUP.md`: End-to-end provisioning instructions covering Python deps, environment variables, and Compose workflows.
- `run/install_setup.sh`, `run/maintenance_setup.sh`: Automation scripts for provisioning fresh environments and refreshing cached workspaces.
- `scripts/evaluate_importance.py`, `scripts/consolidation_benchmark.py`, `scripts/benchmark_pagination.py`: Evaluation and benchmarking tools for importance heuristics, consolidation throughput, and driver pagination performance.
- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios.
- `scripts/generate_synthetic_dataset.py`: Produces large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) for integration and benchmark scenarios. See `docs/retrieval.md` for the recommended ingestion workflow that stores the generated payloads without recomputing embeddings.
- `.github/workflows/ci.yml`: GitHub Actions workflow running linting/formatting checks and pytest.
- `pyproject.toml`: Project metadata and dependency list (pins Python `>=3.11,<3.13`; see compatibility notes in `ISSUES.md`).
- Documentation (`PROJECT.md`, `PLAN.md`, `SOT.md`, `README.md`, etc.) describing the system and roadmap.
Expand Down
3 changes: 2 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
- [x] Add packaging tests to guarantee `meshmind/protos/memory_service.proto` ships with the distribution and exposes the expected service definition.
- [x] Document runtime and operational guidance for the gRPC server across README, SETUP, `docs/api.md`, and `docs/operations.md`.
- [x] Add Makefile and CI targets (`make protos`, `make protos-check`) plus scripts to regenerate/verify protobuf bindings, failing CI when drift occurs.
- [x] Document ingestion workflows for the synthetic dataset generator across `docs/retrieval.md` and operations guides so benchmarking instructions stay cohesive.
- [x] Replace the REST stub with the concrete FastAPI application and migrate smoke tests to `fastapi.testclient.TestClient`.
- [x] Remove Celery dummy fallbacks by requiring the real app/beat imports and keeping docker-compose stacks in sync.
- [x] Add a `serve-grpc` CLI subcommand and verify it delegates to the runtime helpers.
Expand All @@ -95,9 +96,9 @@
- [ ] Add integration tests that spin up `meshmind serve-grpc` and exercise ingestion/search via grpcurl to complement the unit-level coverage (blocked until network-accessible infrastructure is ready).
- [ ] Publish protobuf-generated client artifacts (Python wheel or language-neutral bundles) so external services can consume the API once infrastructure is available.
- [ ] Automate the live integration suite (`pytest -m integration`) in CI so Memgraph/Neo4j/Redis regressions fail fast.
- [ ] Document ingestion workflows for the synthetic dataset generator across `docs/retrieval.md` and operations guides so benchmarking instructions stay cohesive.
- [ ] Document the retired REST/Celery shims in release notes and communicate migration steps to downstream integrators.
- [ ] Capture gRPC CLI usage examples (including docker-compose orchestration) in `docs/api.md` and `docs/operations.md` once integration smoke tests complete.
- [ ] Automate ingestion of synthetic dataset payloads (JSONL/CSV) via a CLI or script wrapper so benchmarking runs do not require custom snippets.

## Recommended Waiting for Approval Tasks

Expand Down
5 changes: 4 additions & 1 deletion docs/operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,10 @@ This guide covers operational tasks for MeshMind deployments.

- `make benchmarks` runs the synthetic benchmarking scripts (`scripts/evaluate_importance.py`, `scripts/consolidation_benchmark.py`, `scripts/benchmark_pagination.py`) with fast defaults and stores JSON summaries in `build/benchmarks/`.
- Override script flags to stress specific backends (for example `--backend neo4j` or higher iteration counts) once live services are provisioned, and capture findings in `FINDINGS.md` / `ENVIRONMENT_NEEDS.md`.
- Use `scripts/generate_synthetic_dataset.py` to produce large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) before loading data into Memgraph/Neo4j for stress testing.
- Use `scripts/generate_synthetic_dataset.py` to produce large JSONL/CSV corpora (defaults: 10k memories, 20k triplets, 384-dim embeddings) before loading data into Memgraph/Neo4j for stress testing. Pair the generator with the ingestion snippet from `docs/retrieval.md` to hydrate graph backends quickly without recomputing embeddings. When loading via the MeshMind client:
- Batch writes (for example in chunks of 500 memories/triplets) to keep request payload sizes manageable.
- Align namespaces across the JSONL/CSV payloads and retrieval queries so pagination filters remain effective.
- Call `meshmind.cli.admin counts --namespace <ns>` after ingestion to confirm memory distribution before executing benchmarks.

## Deployment Considerations

Expand Down
55 changes: 55 additions & 0 deletions docs/retrieval.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,61 @@ batch processing patterns.
- `rerank_model` / `rerank_endpoint`: explicit overrides that take precedence over environment defaults when reranking.
- `fields`: optional mapping for textual searches (regex, exact, fuzzy) to target metadata keys.

## Synthetic Dataset Ingestion Workflow

Large-scale retrieval experiments rely on synthetic corpora so benchmarks stay reproducible. Use the following workflow to
seed data generated by `scripts/generate_synthetic_dataset.py` into your target backend:

1. Generate the corpus:

```bash
python scripts/generate_synthetic_dataset.py build/datasets/benchmark \
--memories 10000 \
--triplets 20000 \
--namespace benchmark
```

This produces `memories.jsonl` (memory payloads) and `triplets.csv` (relationships) under `build/datasets/benchmark/`.

2. Load memories with a short Python helper. The snippet below deserialises the JSONL payload and stores the objects directly
through the MeshMind client:

```python
from __future__ import annotations

from pathlib import Path

from meshmind.client import MeshMind
from meshmind.core.types import Memory


def load_memories(path: Path, namespace: str, batch_size: int = 500) -> None:
mm = MeshMind()
batch: list[Memory] = []
with path.open("r", encoding="utf-8") as handle:
for line in handle:
payload = Memory.parse_raw(line)
payload.namespace = namespace
batch.append(payload)
if len(batch) >= batch_size:
mm.store_memories(list(batch))
batch.clear()
if batch:
mm.store_memories(list(batch))


load_memories(Path("build/datasets/benchmark/memories.jsonl"), namespace="benchmark")
```

3. Persist relationships in a similar fashion using `MeshMind.store_triplets` and the generated CSV payload (for example, with
`csv.DictReader`).

Comment on lines +99 to +102
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Document required entity label when loading synthetic triplets

Step 3 describes ingesting triplets.csv with MeshMind.store_triplets but the CSV produced by scripts/generate_synthetic_dataset.py only contains subject,predicate,object,namespace,metadata. meshmind.core.types.Triplet also requires an entity_label (line 40), so constructing Triplet(**row) from the CSV as implied raises a validation error. The guide needs to either add entity_label to the generated data or instruct readers which label to inject before calling store_triplets.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codex take care of this and check that the code in scripts/generate_synthetic_dataset.py will also generate the .csv file rows so that each contain an entity_label, if it doesn't already then implement it. Make sure the guide and any other references are consistent.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Testing

  • ⚠️ python scripts/generate_synthetic_dataset.py build/tmp_dataset --memories 5 --triplets 5 --embedding-dim 8 --namespace demo (fails: ModuleNotFoundError: No module named 'numpy')

View task →

4. Run retrieval queries (`meshmind search`, REST/gRPC calls, or the MeshMind Python client) targeting the `benchmark`
namespace and optional `entity_labels` to exercise vector, hybrid, and metadata filters against the seeded dataset.

The same JSONL/CSV payloads can be adapted for bulk ingestion APIs exposed by the REST/gRPC services if you prefer remote
loading. Make sure to keep namespaces aligned so pagination and label filters remain effective across benchmarking runs.

## Extending Retrieval

1. Add a new module under `meshmind/retrieval` with a function that accepts `(query, memories, **kwargs)`.
Expand Down
Loading