|
| 1 | +# RFC: TrackingDataContainer — A Metadata-Aware Container for Tracking Data |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +This RFC proposes the creation of a `TrackingDataContainer`: a shared, standardized, metadata-aware container for sports tracking data. It is backed by Apache Arrow and acts as the central interface between tools like Kloppy, Floodlight, BallRadar, and others. |
| 6 | + |
| 7 | +The container combines high-performance, zero-copy data access with a rich metadata model, enabling efficient workflows, semantic column selection, and reproducibility. All packages in the ecosystem are expected to use this container as the primary I/O interface. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Motivation |
| 12 | + |
| 13 | +As tracking pipelines grow across football and other sports, the need for modularity, performance, and inter-package interoperability becomes critical. |
| 14 | + |
| 15 | +Current challenges: |
| 16 | + |
| 17 | +- Object-based tracking (e.g. Kloppy) doesn't scale well to large datasets |
| 18 | +- Packages use inconsistent schemas and lack a shared representation |
| 19 | +- Derived metrics and transforms aren’t reproducible or shareable |
| 20 | +- Skeleton data, training data, and multi-team use cases need flexibility |
| 21 | + |
| 22 | +A container backed by Apache Arrow, with enforced metadata and a shared schema, solves these problems while keeping compatibility with modern analytics engines (Polars, DuckDB, etc.). |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Goals |
| 27 | + |
| 28 | +- Define a canonical, metadata-rich container for tracking data |
| 29 | +- Standardize column naming for interoperability |
| 30 | +- Ensure all packages use the container interface |
| 31 | +- Enable fast, lazy access to Arrow or Polars objects |
| 32 | +- Support derived metrics, frame materialization, and multiple data layers (e.g. skeleton) |
| 33 | + |
| 34 | +### Non-goals |
| 35 | + |
| 36 | +- Defining a full storage or lakehouse solution |
| 37 | +- Handling raw input formats (delegated to Kloppy or custom parsers) |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Design Overview |
| 42 | + |
| 43 | +The `TrackingDataContainer` wraps an Apache Arrow table and Kloppy-style metadata. It supports metadata-aware column access, derived metric registration, and engine-agnostic transformations (Polars, NumExpr, etc.). |
| 44 | + |
| 45 | +Packages access the container via a consistent interface, with optional access to Arrow, Polars LazyFrames, or NumPy arrays. |
| 46 | + |
| 47 | +```python |
| 48 | +container = kloppy.load_as_container("data.json", "metadata.json") |
| 49 | + |
| 50 | +# Add ball speed |
| 51 | +container.add_metric("ball_speed", expression="sqrt((ball.x - ball.x.shift(1))**2 + ...)", engine="polars") |
| 52 | + |
| 53 | +# Frame materialization |
| 54 | +frame = container.materialize_frame(timestamp=534.2) |
| 55 | + |
| 56 | +# Use LazyFrame |
| 57 | +df = container.lazy().select([...]).collect() |
| 58 | +``` |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Detailed Design |
| 63 | + |
| 64 | +### Core Components |
| 65 | + |
| 66 | +```python |
| 67 | +class TrackingDataContainer: |
| 68 | + def arrow(self) -> pa.Table |
| 69 | + def lazy(self) -> pl.LazyFrame |
| 70 | + def selector: ColumnSelector # e.g. selector.player("home", 7).x |
| 71 | + def add_metric(name: str, expression: str | pl.Expr, engine: str) |
| 72 | + def add_metrics(list_of_exprs: list[pl.Expr]) |
| 73 | + def get_column(...) |
| 74 | + def materialize_frame(timestamp: float) |
| 75 | +``` |
| 76 | + |
| 77 | +### Column Naming Convention |
| 78 | + |
| 79 | +We propose to align column naming with the [Common Data Format (CDF)](https://doi.org/10.48550/arXiv.2401.01882) standard as published by Anzer et al. (2025). This avoids reinventing the wheel and increases interoperability with current and future tooling that supports this standard. |
| 80 | + |
| 81 | +Example conventions based on CDF: |
| 82 | + |
| 83 | +| Concept | Column Name | |
| 84 | +|-----------|---------------------------------------------| |
| 85 | +| Ball | `ball_x`, `ball_y`, `ball_z` | |
| 86 | +| Player | `teams/players/<player_id>_x`, `_y` | |
| 87 | +| Skeleton | `teams/players/<player_id>/<limb>_x` | |
| 88 | +| Metrics | `package/metric_name` | |
| 89 | +| Time | `timestamp`, `frame_id`, `period` | |
| 90 | + |
| 91 | +These conventions follow: |
| 92 | + |
| 93 | +- Snake_case field names |
| 94 | +- British spelling (e.g. "colour") |
| 95 | +- Metric or object first, followed by descriptor |
| 96 | +- Units described in metadata, not column name |
| 97 | + |
| 98 | +By adhering to CDF, the `TrackingDataContainer` will be interoperable with other ecosystems (e.g. research, federation-level analytics) and allow users to mix data from different vendors more reliably. |
| 99 | + |
| 100 | +### Metadata |
| 101 | + |
| 102 | +- Required for all sessions |
| 103 | +- Includes teams, players, field size, orientation, coordinate system |
| 104 | +- Tracks added metrics, including provenance |
| 105 | +- Defines optional logical "layers" (e.g. tracking, skeleton, predictions) |
| 106 | + |
| 107 | +### Storage & Computation Model |
| 108 | + |
| 109 | +The container is designed to support **lazy evaluation** and **filter pushdown** using Apache Arrow-compatible formats like Parquet and Iceberg. |
| 110 | + |
| 111 | +- **Lazy Evaluation**: All transformations (e.g. metric calculations, filters) are applied lazily using Polars' `LazyFrame` or Arrow compute APIs. |
| 112 | +- **Filter Pushdown**: When reading from Parquet or Iceberg, the container supports pushing column and row filters down to the storage engine, dramatically improving performance for large datasets. |
| 113 | +- **Column Projection**: Only the necessary columns are loaded or materialized, based on selectors or expressions. |
| 114 | +- **Iceberg Integration (Future Direction)**: Iceberg can serve as the underlying storage format, offering versioning, schema evolution, and partition pruning. The container will provide adapters to read from and write to Iceberg tables. |
| 115 | + |
| 116 | +This design ensures the container can be scaled from local processing to lakehouse-style infrastructure with minimal friction. |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +## Use Cases / Examples |
| 121 | + |
| 122 | +- Load tracking data and add derived metrics |
| 123 | +- Run smoothing or pitch control via external packages |
| 124 | +- Convert skeleton data to tracking data (e.g. using center of mass) |
| 125 | +- Visualize or export a single frame for debugging |
| 126 | +- Export Polars dataframe with semantic selectors |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## Rationale & Alternatives |
| 131 | + |
| 132 | +### Why not just Arrow? |
| 133 | +- Arrow is fast and standard, but lacks semantics |
| 134 | +- Metadata enables safe, interpretable, and sport-agnostic usage |
| 135 | + |
| 136 | +### Why enforce the container? |
| 137 | +- To guarantee consistency across packages |
| 138 | +- To avoid schema drift and partial implementations |
| 139 | + |
| 140 | +### Why CDF and flat schema? |
| 141 | +- CDF provides a well-vetted, community-driven naming standard |
| 142 | +- Flat schema increases compatibility and simplifies filter pushdown and projection |
| 143 | +- Structs may be supported later as an internal representation |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Compatibility |
| 148 | + |
| 149 | +- Packages must use `TrackingDataContainer` for I/O |
| 150 | +- Read-only tools may optionally operate on Arrow, if they follow the schema |
| 151 | +- Kloppy will provide `load_as_container(...)` as the default loader |
| 152 | + |
| 153 | +--- |
| 154 | + |
| 155 | +## Future Directions |
| 156 | + |
| 157 | +- Iceberg support with partitioning, time-travel, and pushdown |
| 158 | +- Event and prediction layers as Arrow sub-tables or column groups |
| 159 | +- Multi-sensor support (e.g. GPS + camera) |
| 160 | +- JSON-LD-style metadata for validation and discovery |
| 161 | + |
| 162 | +--- |
| 163 | + |
| 164 | +## Open Questions |
| 165 | + |
| 166 | +- Should metrics be namespaced (`package/metric`)? |
| 167 | +- Should struct-based layouts be supported now or later? |
| 168 | +- Do we support event data natively or as a separate container? |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Conclusion |
| 173 | + |
| 174 | +The `TrackingDataContainer` provides a shared, fast, and extensible foundation for sports tracking pipelines. It aligns with modern data practices (Arrow), adopts the CDF standard for naming and schema, encourages inter-package reuse, and ensures semantic correctness through enforced metadata. |
| 175 | + |
| 176 | +We propose making this the default container for all tools in the ecosystem. |
| 177 | + |
| 178 | +**Feedback welcome!** |
0 commit comments