Skip to content

Commit 9fee7a5

Browse files
authored
Create RFC.md
0 parents  commit 9fee7a5

1 file changed

Lines changed: 178 additions & 0 deletions

File tree

RFC.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# RFC: TrackingDataContainer — A Metadata-Aware Container for Tracking Data
2+
3+
## Summary
4+
5+
This RFC proposes the creation of a `TrackingDataContainer`: a shared, standardized, metadata-aware container for sports tracking data. It is backed by Apache Arrow and acts as the central interface between tools like Kloppy, Floodlight, BallRadar, and others.
6+
7+
The container combines high-performance, zero-copy data access with a rich metadata model, enabling efficient workflows, semantic column selection, and reproducibility. All packages in the ecosystem are expected to use this container as the primary I/O interface.
8+
9+
---
10+
11+
## Motivation
12+
13+
As tracking pipelines grow across football and other sports, the need for modularity, performance, and inter-package interoperability becomes critical.
14+
15+
Current challenges:
16+
17+
- Object-based tracking (e.g. Kloppy) doesn't scale well to large datasets
18+
- Packages use inconsistent schemas and lack a shared representation
19+
- Derived metrics and transforms aren’t reproducible or shareable
20+
- Skeleton data, training data, and multi-team use cases need flexibility
21+
22+
A container backed by Apache Arrow, with enforced metadata and a shared schema, solves these problems while keeping compatibility with modern analytics engines (Polars, DuckDB, etc.).
23+
24+
---
25+
26+
## Goals
27+
28+
- Define a canonical, metadata-rich container for tracking data
29+
- Standardize column naming for interoperability
30+
- Ensure all packages use the container interface
31+
- Enable fast, lazy access to Arrow or Polars objects
32+
- Support derived metrics, frame materialization, and multiple data layers (e.g. skeleton)
33+
34+
### Non-goals
35+
36+
- Defining a full storage or lakehouse solution
37+
- Handling raw input formats (delegated to Kloppy or custom parsers)
38+
39+
---
40+
41+
## Design Overview
42+
43+
The `TrackingDataContainer` wraps an Apache Arrow table and Kloppy-style metadata. It supports metadata-aware column access, derived metric registration, and engine-agnostic transformations (Polars, NumExpr, etc.).
44+
45+
Packages access the container via a consistent interface, with optional access to Arrow, Polars LazyFrames, or NumPy arrays.
46+
47+
```python
48+
container = kloppy.load_as_container("data.json", "metadata.json")
49+
50+
# Add ball speed
51+
container.add_metric("ball_speed", expression="sqrt((ball.x - ball.x.shift(1))**2 + ...)", engine="polars")
52+
53+
# Frame materialization
54+
frame = container.materialize_frame(timestamp=534.2)
55+
56+
# Use LazyFrame
57+
df = container.lazy().select([...]).collect()
58+
```
59+
60+
---
61+
62+
## Detailed Design
63+
64+
### Core Components
65+
66+
```python
67+
class TrackingDataContainer:
68+
def arrow(self) -> pa.Table
69+
def lazy(self) -> pl.LazyFrame
70+
def selector: ColumnSelector # e.g. selector.player("home", 7).x
71+
def add_metric(name: str, expression: str | pl.Expr, engine: str)
72+
def add_metrics(list_of_exprs: list[pl.Expr])
73+
def get_column(...)
74+
def materialize_frame(timestamp: float)
75+
```
76+
77+
### Column Naming Convention
78+
79+
We propose to align column naming with the [Common Data Format (CDF)](https://doi.org/10.48550/arXiv.2401.01882) standard as published by Anzer et al. (2025). This avoids reinventing the wheel and increases interoperability with current and future tooling that supports this standard.
80+
81+
Example conventions based on CDF:
82+
83+
| Concept | Column Name |
84+
|-----------|---------------------------------------------|
85+
| Ball | `ball_x`, `ball_y`, `ball_z` |
86+
| Player | `teams/players/<player_id>_x`, `_y` |
87+
| Skeleton | `teams/players/<player_id>/<limb>_x` |
88+
| Metrics | `package/metric_name` |
89+
| Time | `timestamp`, `frame_id`, `period` |
90+
91+
These conventions follow:
92+
93+
- Snake_case field names
94+
- British spelling (e.g. "colour")
95+
- Metric or object first, followed by descriptor
96+
- Units described in metadata, not column name
97+
98+
By adhering to CDF, the `TrackingDataContainer` will be interoperable with other ecosystems (e.g. research, federation-level analytics) and allow users to mix data from different vendors more reliably.
99+
100+
### Metadata
101+
102+
- Required for all sessions
103+
- Includes teams, players, field size, orientation, coordinate system
104+
- Tracks added metrics, including provenance
105+
- Defines optional logical "layers" (e.g. tracking, skeleton, predictions)
106+
107+
### Storage & Computation Model
108+
109+
The container is designed to support **lazy evaluation** and **filter pushdown** using Apache Arrow-compatible formats like Parquet and Iceberg.
110+
111+
- **Lazy Evaluation**: All transformations (e.g. metric calculations, filters) are applied lazily using Polars' `LazyFrame` or Arrow compute APIs.
112+
- **Filter Pushdown**: When reading from Parquet or Iceberg, the container supports pushing column and row filters down to the storage engine, dramatically improving performance for large datasets.
113+
- **Column Projection**: Only the necessary columns are loaded or materialized, based on selectors or expressions.
114+
- **Iceberg Integration (Future Direction)**: Iceberg can serve as the underlying storage format, offering versioning, schema evolution, and partition pruning. The container will provide adapters to read from and write to Iceberg tables.
115+
116+
This design ensures the container can be scaled from local processing to lakehouse-style infrastructure with minimal friction.
117+
118+
---
119+
120+
## Use Cases / Examples
121+
122+
- Load tracking data and add derived metrics
123+
- Run smoothing or pitch control via external packages
124+
- Convert skeleton data to tracking data (e.g. using center of mass)
125+
- Visualize or export a single frame for debugging
126+
- Export Polars dataframe with semantic selectors
127+
128+
---
129+
130+
## Rationale & Alternatives
131+
132+
### Why not just Arrow?
133+
- Arrow is fast and standard, but lacks semantics
134+
- Metadata enables safe, interpretable, and sport-agnostic usage
135+
136+
### Why enforce the container?
137+
- To guarantee consistency across packages
138+
- To avoid schema drift and partial implementations
139+
140+
### Why CDF and flat schema?
141+
- CDF provides a well-vetted, community-driven naming standard
142+
- Flat schema increases compatibility and simplifies filter pushdown and projection
143+
- Structs may be supported later as an internal representation
144+
145+
---
146+
147+
## Compatibility
148+
149+
- Packages must use `TrackingDataContainer` for I/O
150+
- Read-only tools may optionally operate on Arrow, if they follow the schema
151+
- Kloppy will provide `load_as_container(...)` as the default loader
152+
153+
---
154+
155+
## Future Directions
156+
157+
- Iceberg support with partitioning, time-travel, and pushdown
158+
- Event and prediction layers as Arrow sub-tables or column groups
159+
- Multi-sensor support (e.g. GPS + camera)
160+
- JSON-LD-style metadata for validation and discovery
161+
162+
---
163+
164+
## Open Questions
165+
166+
- Should metrics be namespaced (`package/metric`)?
167+
- Should struct-based layouts be supported now or later?
168+
- Do we support event data natively or as a separate container?
169+
170+
---
171+
172+
## Conclusion
173+
174+
The `TrackingDataContainer` provides a shared, fast, and extensible foundation for sports tracking pipelines. It aligns with modern data practices (Arrow), adopts the CDF standard for naming and schema, encourages inter-package reuse, and ensures semantic correctness through enforced metadata.
175+
176+
We propose making this the default container for all tools in the ecosystem.
177+
178+
**Feedback welcome!**

0 commit comments

Comments
 (0)