Implementation of the KarmaLego time-interval pattern mining pipeline with end-to-end data ingestion, pattern discovery, and patient-level application.
Based on the paper:
Moskovitch, Robert, and Yuval Shahar. "Temporal Patterns Discovery from Multivariate Time Series via Temporal Abstraction and Time-Interval Mining."
(See original for theoretical grounding.)
This implementation is designed to be used as an temporal analysis and feature extraction tool in my thesis.
KarmaLego mines frequent time-interval relation patterns (TIRPs) by combining two key stages:
KarmaLego first scans each patient's timeline to identify pairwise Allen relations (e.g., before, overlaps, meets) between intervals:
Each cell in the matrix shows the temporal relation between intervals (e.g., A¹ o B¹ = A¹ overlaps B¹). These relations become the building blocks of complex temporal patterns.
Patterns are built incrementally by traversing a tree of symbol and relation extensions, starting from frequent 1-intervals (K=1) and growing to longer TIRPs (K=2,3,...). Only frequent patterns are expanded (Apriori pruning), and relation consistency is ensured using transitivity rules.
Practical flow in this implementation. The pipeline enumerates singletons and all frequent pairs (k=2) in the Karma stage. The Lego stage then skips extending singletons and starts from k≥2, extending patterns to length k+1. This avoids regenerating pairs (and their support checks) a second time. I also apply CSAC (see below), which anchors each extension on the actual parent embeddings inside each entity, ensuring only consistent child embeddings are considered.
This structure enables efficient discovery of high-order, temporally consistent patterns without exhaustively searching all combinations.
KarmaLego supports four levels of temporal relation granularity: 2, 3, 5, and 7 relations. Each level coarsens the full 7-relation Allen algebra for performance. Relations are defined between interval pairs (A, B) using start and end times.
| Relation | Explanation |
|---|---|
| 'p' | Proceed: merges before/meet/overlap (A < B, A m B, A o B) |
| 'c' | Contain: merges contain/started-by/finished-by/equal (A c B, A s B, A f B, A = B) |
| Relation | Explanation |
|---|---|
| '<' | A ends before B starts (A < B) |
| 'o' | A overlaps B: A starts before B, A ends during B, B ends after A (A o B) |
| 'c' | A contains B: A starts before B and ends after B (A c B) |
| Relation | Explanation |
|---|---|
| '<' | A ends before B starts (A < B) |
| 'o' | A overlaps B: A starts before B, A ends during B, B ends after A (A o B) |
| 'c' | A contains B: A starts before B and ends after B (A c B) |
| 'f' | A is finished-by B: A starts before B, A and B end at the same time (A f B) |
| 's' | A is started-by B: A and B start at the same time, A ends after B (A s B) |
| Relation | Explanation |
|---|---|
| '<' | A ends before B starts (A < B) |
| 'm' | A meets B: A ends exactly when B starts (A m B) |
| 'o' | A overlaps B: A starts before B, A ends during B, B ends after A (A o B) |
| 'c' | A contains B: A starts before B and ends after B (A c B) |
| 'f' | A is finished-by B: A starts before B, A and B end at the same time (A f B) |
| 's' | A is started-by B: A and B start at the same time, A ends after B (A s B) |
| '=' | A equals B: A and B have identical start and end times (A = B) |
This repository provides:
- A clean, efficient implementation of KarmaLego (Karma + Lego) for discovering frequent Time Interval Relation Patterns (TIRPs).
- Support for pandas / Dask-backed ingestion of clinical-style interval data.
- Symbol encoding, pattern mining, and per-patient pattern application (apply modes: counts and cohort-normalized features).
- Utilities for managing temporal relations, pattern equality/deduplication, and tree-based extension.
The design goals are: clarity, performance, testability, and reproducibility.
This implementation incorporates several core performance techniques from the KarmaLego framework:
-
Apriori pruning: Patterns are extended only if all their (k−1)-subpatterns are frequent, cutting unpromising branches early.
-
Temporal relation transitivity (+ memoization): Allen relation composition reduces relation search at extension time; the
compose_relation()function is memoized to eliminate repeated small-table lookups. -
SAC (Subset of Active Candidates): Support checks for a child TIRP are restricted to the entities that supported its parent, avoiding scans of unrelated entities at deeper levels.
-
CSAC (Consistent SAC) — two enforced constraints:
4a. Embedding-level consistency:
The exact parent embeddings (index tuples) per entity are tracked. Each child embedding at level k+1 must directly extend a valid parent embedding at level k — no independent full-search is re-run for k>1 patterns. This is the "consistent" part: embeddings grow incrementally and only along verified paths.4b. Adjacency constraint (same-concept interposer check):
For any pair of positions(i, j)in an embedding where the relation is an ordering type (e.g.<,m, orpdepending on the relation alphabet), no other occurrence of the same concept is permitted to exist between positionsiandjin the entity. This is the core SAC definition from the paper.
This is enforced at two levels:- Karma (k=2): Candidate
(i, j)pairs are filtered before being stored in any embedding map. - Lego extension (k>2): After relation matching, all new ordering pairs introduced by the extension are checked for interposers before the extended embedding is accepted.
The ordering relation set is relation-table-aware via
get_sac_relations():
{'<','m'}for 7R ·{'<'}for 5R and 3R ·{'p'}for 2R.- Accuracy: identical to a full exhaustive search with the same constraint applied; CSAC is pruning + constraint enforcement, not an approximation.
- Speed: large savings in dense timelines by skipping impossible extensions and non-adjacent embeddings early.
- Karma (k=2): Candidate
-
Skip duplicate pair generation: Pairs (k=2) are produced once in Karma and not re-generated in Lego. This eliminates ~×2 duplication for pairs and can reduce Lego runtime dramatically.
-
Level-2 index for O(1) last-pair lookup during Lego extension:
After Karma finishes, a level-2 index is built from the frequent k=2 TIRP embeddings:
(sym_A, sym_B, rel) → { eid → { pos_A → [pos_B, ...] } }During Lego extension of a k-length TIRP to k+1, the bottleneck step is finding all positions of the new symbol that have the correct relation to the last parent position (sym[-2] → sym[-1], same entity). Without the index this requires a bisect scan over all positions of that symbol, followed by a
pairwise_relslook-up per candidate.
With the index, the lookup is:candidates = level2_index[(sym_A, sym_B, rel)][eid].get(parent_last_pos, ())
— a dict look-up that returns only pre-verified candidates in O(1), with no failed bisect steps.
Interaction with CSAC:
The k=2 embeddings stored in the index were already CSAC-filtered during Karma (the adjacency / interposer check was applied before each embedding was stored). Therefore, for any candidate retrieved from the index, the last pair's relation and CSAC constraint are already guaranteed. Only the relations from earlier parent positions to the new position still need to be verified. This saves one extra relation look-up and one CSAC adjacency scan per candidate, compounding across all entities at every Lego extension step.Fallback:
is_above_vertical_supportwithparent_embeddings_mapset always requires alevel2_index(asserted at entry). If an entity has no index entry for a given(sym_A, sym_B, rel)key — meaning Karma found no CSAC-valid pair for it — the entity is skipped immediately viacontinuerather than scanning. Any candidate a bisect scan could produce would fail the same relation or CSAC check that excluded it from the index, so the skip is safe and avoids entering the parent-embedding loop entirely. -
Two-phase Karma to avoid upfront O(m²) pairwise materialisation:
On large datasets (e.g. ≥1 000 intervals/entity with
max_distance=None), computing pairwise temporal relations for all symbol pairs before any frequency filtering produces an O(m²) dict. At ~1 250 intervals/patient this reaches ~156 M entries (≈12–15 GB) — infeasible before a single TIRP has been evaluated.run_karmatherefore splits into three internal phases:- Phase A — Singleton discovery: Iterates all distinct symbols, computes vertical support from
symbol_index, and producesfrequent_symbols.pairwise_relsstarts empty; no pair relations are computed yet. - Phase B — Lazy pairwise precomputation: Filters each entity's
symbol_indexto frequent symbols only, then computespairwise_relsby iterating only the sorted positions of surviving-symbol intervals. Both the inner and outer loops are restricted to frequent-symbol positions, so the result is proportional to|frequent_symbols|² × n_entitiesrather thanm². Themax_distanceearly-break remains effective because positions are lexicographically sorted. - Phase C — k=2 TIRP construction: Consumes the now-populated
pairwise_relsto build frequent length-2 TIRPs (with CSAC filtering) and attach them to the singleton tree.
The three phases are encapsulated in a single
Karma.run_karma(entity_list, precomputed)call; sub-phase timings are logged at DEBUG level.discover_patternssees only one unified Karma timing.Why not single-pass? A single pass is feasible on small datasets (few symbols, small
max_distance), but breaks down when the unfiltered pairwise dict exceeds available RAM. The two-phase split is strictly better for production-scale data; on small benchmarks overhead is negligible. - Phase A — Singleton discovery: Iterates all distinct symbols, computes vertical support from
-
Precomputed per-entity views (reused everywhere): Lexicographic sorting and symbol→positions maps are built once and reused in support checks and extension, avoiding repeat work.
-
Integer time arithmetic: Timestamps are held as
int64; relation checks use pure integer math. If source data were datetimes, they are converted to ns; if they were numeric, they remain your unit. -
Depth-first (DFS) Lego traversal for lower peak memory:
The Lego extension phase uses recursive DFS rather than a BFS queue. Under BFS, all k=n patterns with their
embeddings_mapdicts live in memory simultaneously before any k=n+1 work begins; on datasets with wide pattern trees this can become several times the working set size of a single depth level.With DFS, a pattern's
embeddings_mapis freed immediately after its children have consumed it (each child callsis_above_vertical_support, which reads fromparent_embeddings_map, then clears it). At any moment, only the embeddings along one active root-to-leaf path are retained — plus the k=2 embeddings of siblings not yet visited. Peak memory scales with depth × branching-factor-per-node rather than total nodes per level.A secondary benefit is reduced Python GC pressure: fewer large dicts stay alive simultaneously, so garbage collection cycles are less frequent and cheaper. On large runs this has a measurable wall-clock effect even before any memory limit is approached.
-
Post-singleton symbol_index eviction (infrequent symbol removal):
After Phase A of
run_karmaidentifiesfrequent_symbols, Phase B immediately filters each entity'ssymbol_indexandsymbol_itemsto retain only the surviving symbols:entry["symbol_index"] = {sym: pos for sym, pos in entry["symbol_index"].items() if sym in frequent_symbols} entry["symbol_items"] = tuple(entry["symbol_index"].items())
This is an O(n_entities × n_symbols) one-time pass performed inside
run_karmabefore any pairwise or Lego work begins. After it,symbol_items— the iterable consumed byall_extensions's inner loop — never contains an infrequent symbol, so the per-iteration frequency guard is eliminated entirely. The savings compound across every DFS node at every depth level.Note:
pairwise_relsis not filtered here because it was never computed for infrequent symbols to begin with (Phase B computes it fresh, restricted to frequent-symbol positions only — see item 7).
These optimizations ensure that KarmaLego runs efficiently on large temporal datasets and scales well as pattern complexity increases.
Performance Notes:
- The core KarmaLego algorithm operates on in-memory Python lists (
entity_list) and is not accelerated by Dask. - The current Lego phase runs sequentially. Attempts to parallelize it (e.g., with Dask or multiprocessing) introduced overhead that slowed performance.
- Dask can still be useful during ingestion and preprocessing (e.g., using
dd.read_csv()for large CSVs). - Fine-grained parallelism is not recommended due to fast per-node checks and high task management overhead. If the support task increases significantly, perhaps a patient-level parallelism of a TIRP will become useful.
- Better scaling can be achieved by:
- Splitting the dataset into concept clusters or patient cohorts and running in parallel across jobs.
- Using
min_ver_suppandmax_kto control pattern explosion. - Persisting symbol maps to ensure consistent encoding across runs.
- No k=1→k=2 in Lego: pairs are already created in Karma; Lego starts from k≥2. This removes structural duplicates and their support checks.
- DFS memory hygiene: each node's
embeddings_mapis released immediately after all its extensions have been checked for support, before recursing into children. Only the embeddings along the current active path remain live at any point, keeping peak RAM proportional to pattern depth rather than pattern breadth.
KarmaLego/
├── core/
│ ├── __init__.py # package marker
│ ├── karmalego.py # algorithmic core: TreeNode, TIRP, KarmaLego/Karma/Lego pipeline
│ ├── io.py # ingestion / preprocessing / mapping / decoding helpers
│ ├── relation_table.py # temporal relation transition tables and definitions
│ └── utils.py # low-level helpers
├── data/
│ ├── synthetic_diabetes_temporal_data.csv # example input dataset (output from the Mediator)
│ ├── symbol_map.json # saved symbol encoding (concept:value -> int)
│ └── inverse_symbol_map.json # reverse mapping for human-readable decoding
├── unittests/
│ ├── test_treenode.py # TreeNode behavior
│ ├── test_tirp.py # TIRP equality, support, relation semantics
│ └── test_karmalego.py # core pipeline / small synthetic pattern discovery
├── main.py # example end-to-end driver / demo script
├── main.ipynb # example end-to-end driver / demo script (better for VMs)
├── pyproject.toml # editable installation manifest
├── pytest.ini # pytest configuration
├── requirements.txt # pinned dependencies (pandas, dask, tqdm, pytest, numpy, etc.)
├── README.md # human-readable version of this document
└── .gitignore # ignored files for git
Recommended Python version: 3.8+
Use a virtual environment:
python -m venv .venv
# Windows:
.\.venv\Scripts\activate
pip install -e .
pip install -r requirements.txt pytestThe -e . makes the local package importable as core.karmalego during development.
Input must be a table (CSV or DataFrame) with these columns:
PatientId: identifier per entity (patient)ConceptName: event or concept (e.g., lab test name)StartDateTime: interval start (e.g.,"08/01/2023 00:00"inDD/MM/YYYY HH:MM)EndDateTime: interval end (same format)Value: discrete value or category (e.g.,'High','Normal')
You have full flexibility to affect the input and output shapes and formats in the io.py module, as long as you maintain this general structure.
Example row:
PatientId,ConceptName,StartDateTime,EndDateTime,Value
p1,HbA1c,08/01/2023 0:00,08/01/2023 0:15,High
The provided main.py demonstrates the full pipeline:
- Load the CSV (switch to Dask if scaling).
- Validate schema and required fields.
- Build or load symbol mappings (
ConceptName:Value→ integer codes). - Preprocess: parse dates, apply mapping.
- Convert to entity_list (list of per-patient interval sequences).
- Discover patterns using KarmaLego.
- Decode patterns back to human-readable symbol strings.
- Apply patterns to each patient using the apply modes (see below):
tirp-count,tpf-dist,tpf-duration. - Persist outputs: discovered_patterns.csv and a single wide CSV with five feature columns per pattern.
Example invocation:
python main.pyThis produces:
discovered_patterns.csv— flat table of frequent TIRPs with support and decoded symbols.patient_pattern_vectors.ALL.csv— one row per (PatientId, Pattern) with 5 columns: tirp_count_unique_last, tirp_count_all, tpf_dist_unique_last, tpf_dist_all, tpf_duration
You can pivot patient_pattern_vectors.ALL.csv to a wide feature matrix for modeling.
epsilon: temporal tolerance for equality/meet decisions (same unit as your preprocessed timestamps). If source columns were datetimes, they are converted to ns, so you may pass a numeric ns value or apd.Timedelta.max_distance: maximum gap between intervals to still consider them related (e.g., 1 hour →pd.Timedelta(hours=1)), same unit rule as above.min_ver_supp: minimum vertical support threshold (fraction of patients that must exhibit a pattern for it to be retained).
-
tirp-count
Horizontal support per patient. Counting strategy:- unique_last (default): count one occurrence per distinct last-symbol index among valid embeddings (e.g., in
A…B…A…B…C,A<B<Ccounts 1). - all: count every embedding.
- unique_last (default): count one occurrence per distinct last-symbol index among valid embeddings (e.g., in
-
tpf-dist
Min–max normalize thetirp-countvalues across the cohort per pattern into [0,1]. -
tpf-duration
For each patient and pattern, take the union of the pattern’s embedding windows-each window is the full span from the start of the first interval in the embedding to the end of the last-so overlapping/touching windows are merged and not double-counted (gaps inside a window are included). The per-patient total is then min–max normalized across patients (per pattern) to [0,1].
Example: ifA<Boccurs with windows of 10h and 5h that don’t overlap, duration =10 + 5 = 15; if the second starts 2h before the first ends, duration =10 + 5 − 2 = 13.
-
Horizontal support (per entity): number of embeddings (index-tuples) of the TIRP found in a given entity.
Discovery counts all valid embeddings (standard KarmaLego).
Apply offers a strategy: unique_last (one per distinct last index; default) or all (every embedding).
Example: InA…B…A…B…C, the patternA…B…Chas 3 embeddings; discovery counts 3(A₀,B₁,C₄),(A₀,B₃,C₄),(A₂,B₃,C₄), whiletirp-countwithunique_lastcounts 1. -
Vertical support (dataset level): fraction of entities that have at least one embedding of the TIRP.
If you prefer “non-overlapping” or “one-per-window” counts for downstream modeling, compute that in the apply phase without changing discovery (a toggle can be added there).
from core.karmalego import KarmaLego
# Prepare entity list, examples in io.py module (may vary between datasources)
# Full running example in main.py
kl = KarmaLego(epsilon=pd.Timedelta(seconds=2),
max_distance=pd.Timedelta(hours=4),
min_ver_supp=0.03,
num_relations=7)
df_patterns = kl.discover_patterns(entity_list, min_length=1, max_length=None) # returns DataFramefrom core.parallel_runner import run_parallel_jobs
# 1. Define your jobs
# Each job is a dict with 'name', 'data' (entity_list), and 'params'
jobs = [
{
'name': 'cohort_A',
'data': entity_list_A,
'params': {
'epsilon': pd.Timedelta(minutes=1),
'max_distance': pd.Timedelta(hours=1),
'min_ver_supp': 0.5,
'min_length': 2
}
},
{
'name': 'cohort_B',
'data': entity_list_B,
'params': {
'epsilon': pd.Timedelta(minutes=1),
'max_distance': pd.Timedelta(hours=1),
'min_ver_supp': 0.4,
'min_length': 2
}
}
]
# 2. Run in parallel (uses multiprocessing)
# Returns a single DataFrame with a 'job_name' column identifying the source job
df_all = run_parallel_jobs(jobs, num_workers=4)# Build all 5 feature columns in one CSV
# Keys are (tuple(symbols), tuple(relations)) — stable, cheaper than repr, and correct
rep_to_str = {(tuple(t.symbols), tuple(t.relations)): s for t, s in zip(df_patterns["tirp_obj"], df_patterns["tirp_str"])}
pattern_keys = [(tuple(t.symbols), tuple(t.relations)) for t in df_patterns["tirp_obj"]]
vec_count_ul = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tirp-count", count_strategy="unique_last")
vec_count_all = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tirp-count", count_strategy="all")
vec_tpf_dist_ul = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-dist", count_strategy="unique_last")
vec_tpf_dist_all = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-dist", count_strategy="all")
vec_tpf_duration = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-duration", count_strategy="unique_last")
rows = []
for pid in patient_ids:
for rep in pattern_keys:
rows.append({
"PatientId": pid,
"Pattern": rep_to_str.get(rep, rep),
"tirp_count_unique_last": vec_count_ul.get(pid, {}).get(rep, 0.0),
"tirp_count_all": vec_count_all.get(pid, {}).get(rep, 0.0),
"tpf_dist_unique_last": vec_tpf_dist_ul.get(pid, {}).get(rep, 0.0),
"tpf_dist_all": vec_tpf_dist_all.get(pid, {}).get(rep, 0.0),
"tpf_duration": vec_tpf_duration.get(pid, {}).get(rep, 0.0),
})
import pandas as pd
pd.DataFrame(rows).to_csv("data/patient_pattern_vectors.ALL.csv", index=False)This block can be parallelized on a patient level or on a function level, but since it's usage can change between works, I see no point in adding a single parallelism method to the module. Feel free to extend.
Run the full test suite:
python -m pytest unittests -q -sRun a single test file:
python -m pytest unittests/test_tirp.py -q -sThe -s flag shows pattern printouts and progress bars for debugging.
Contains:
symbols(tuple of encoded ints)relations(tuple of temporal relation codes)k(pattern length)vertical_supportsupport_countentity_indices_supportingindices_of_last_symbol_in_entitiestirp_obj(internal object; drop before sharing)symbols_readable(if decoded)
Wide long format: one row per (PatientId, Pattern) with the following columns:
tirp_count_unique_last— horizontal support per patient using unique_last counting.tirp_count_all— horizontal support per patient counting all embeddings.tpf_dist_unique_last— min–max oftirp_count_unique_lastacross patients, per pattern.tpf_dist_all— min–max oftirp_count_allacross patients, per pattern.tpf_duration— union of embedding spans per patient (no overlap double-counting), then min–max across patients, per pattern.
Note:
tpf-*values are normalized per pattern to [0,1] across the cohort.
Example for singletons: ifAspans are[1,2,1]across patients, they normalize to[0.0, 1.0, 0.0].
- Replace
pandas.read_csvwithdask.dataframe.read_csvfor large inputs; the ingestion helpers inio.pysupport Dask. - Persist precomputed symbol maps to keep encoding stable across runs.
- Use categorical dtype for symbol column after mapping to reduce memory pressure.
- Tune
min_ver_suppto control pattern explosion vs sensitivity. - If memory is tight on extremely dense data, consider limiting
max_kor post-processing horizontal support to non-overlapping counts; CSAC itself preserves correctness but can retain many embeddings per entity for highly frequent TIRPs.
This implementation is entirely in-memory. The raw data (entity_list), the pattern tree, and the CSAC embedding maps (which store valid index-tuples for every active pattern) all reside in RAM.
Memory usage is driven by:
- Dataset Size: Number of records (intervals) and number of patients.
- Temporal Entity Count: Number of unique
ConceptName:Valuepairs (not just uniqueConceptName). Each distinct value (e.g.,"HbA1c:High"vs"HbA1c:Normal") is a separate temporal entity. - Number of Relations: Coarser relation sets (2, 3) produce fewer patterns; finer sets (5, 7) explode combinatorially.
- Pattern Density: How many frequent patterns exist and how many embeddings they have per patient.
- CSAC Overhead: Storing exact embedding tuples for active candidates is memory-intensive for dense data.
Worst-Case Memory Estimation Table
These estimates assume all possible patterns up to k=5 are frequent (worst case) and dense embeddings per patient. Real workloads with min_ver_supp > 0 will use significantly less memory.
| Temporal Entities | Relations | Patterns (≤k=5) | Patients | Est. RAM (Worst) | Recommendation |
|---|---|---|---|---|---|
| 20 | 2 | ~67k | 10k | 8 GB | Workstation OK. |
| 20 | 3 | ~130k | 10k | 12 GB | Workstation OK. |
| 20 | 5 | ~320k | 10k | 24 GB | High-RAM Workstation. |
| 20 | 7 | ~540k | 10k | 40 GB | Server / Split. |
| 50 | 2 | ~2.6M | 10k | 64 GB | Server / Split. |
| 50 | 3 | ~6.4M | 10k | 128 GB | Split Cohort. |
| 50 | 5 | ~22M | 10k | 256 GB+ | Must Split. |
| 50 | 7 | ~48M | 10k | 512 GB+ | Must Split. |
| 100 | 2 | ~81M | 10k | 512 GB+ | Must Split. |
| 100 | 3 | ~242M | 10k | 1+ TB | Must Split. |
| 100 | 5 | ~1.1B | 10k | 4+ TB | Must Split. |
| 100 | 7 | ~2.8B | 10k | 8+ TB | Must Split. |
Important: These are theoretical worst-case numbers assuming zero Apriori pruning. In practice,
min_ver_suppwill eliminate the vast majority of candidate patterns, reducing memory by 10–100×. Use this table to understand upper bounds, not typical usage.
Pattern Count Formula (Worst Case):
For N temporal entities and R relations, patterns up to length k:
For k=5, this grows as O(N⁵ × R⁴), which is why entity count and relation granularity are critical levers.
Mitigation Strategy: If your data exceeds these limits, do not run as a single job.
- Split the cohort: Divide patients into chunks where each chunk handles different temporal entities (a subset of concepts per chunk).
- Run in parallel: Use the
run_parallel_jobsutility to process chunks independently. - Merge results: Concatenate the resulting pattern DataFrames.
Note that a split based on subsets of patients should not be very efficient, as the in-memory tree might grow roughly to the same size. Only the process itself might be faster, but the memory usage should not be affected.
To upload this project to an external machine, create a self-contained zip that includes all source code and configuration but excludes unit-tests, cached bytecode, and data files (which should be uploaded separately or already present on the VM).
PowerShell (Windows) — run from the KarmaLego\ root:
Compress-Archive -Force -Path `
"core",`
"main.ipynb",`
"pyproject.toml",`
"requirements.txt",`
"README.md",`
"LICENSE" `
-DestinationPath "karmalego-deploy.zip"On the VM, install with:
unzip karmalego-deploy.zip
cd KarmaLego
pip install -e .pip install -e . registers core as a package in the environment (using the dependencies declared in pyproject.toml) so from core.karmalego import ... works from any working directory, including from inside Jupyter.
The first notebook cell runs this automatically — you only need to run it once after upload.

