|
29 | 29 | ;; rethink and the projects core core primitives today using the |
30 | 30 | ;; Victoria electricity demand dataset. |
31 | 31 |
|
32 | | -;; ## Design Philosophy |
33 | | -;; |
34 | | -;; If you've used Pandas for time series work, you're familiar with the index-based |
35 | | -;; approach: set a DateTimeIndex, and operations like slicing and resampling work |
36 | | -;; implicitly on it. It's convenient, but it's also hidden state threaded through |
37 | | -;; your data. |
38 | | -;; |
39 | | -;; tablecloth.time takes a different path. Following tablecloth's design — and |
40 | | -;; Clojure's preference for explicit, composable operations — you always specify |
41 | | -;; which column you're working with. Each function takes the data and the columns |
42 | | -;; it operates on. The pipeline reads like what it does. |
43 | | -;; |
44 | | -;; This isn't a compromise. As Chris Nuernberger (author of tech.ml.dataset) noted, |
45 | | -;; with immutable datasets that get rebuilt on each transformation, a tree-based |
46 | | -;; index offers no performance advantage over binary search on sorted data. The |
47 | | -;; simplicity is the feature. |
| 32 | +;; ## Why No Index? |
| 33 | +;; |
| 34 | +;; The original tablecloth.time was built around an index — set a time column |
| 35 | +;; as your dataset's index, and operations like `slice` and `resample` would |
| 36 | +;; work implicitly on it, just like Pandas. This seemed necessary for two reasons: |
| 37 | +;; performance (tree-based indexes offer O(log n) lookups) and convenience |
| 38 | +;; (you don't have to keep specifying which column is the time column). |
| 39 | +;; |
| 40 | +;; But when tech.ml.dataset removed its indexing mechanism in v7, it forced a |
| 41 | +;; rethink. And the rethink revealed that neither rationale held up. |
| 42 | +;; |
| 43 | +;; **On performance:** Unlike Python DataFrames, Clojure's datasets are immutable. |
| 44 | +;; They're rebuilt on each transformation. Under these conditions, maintaining a |
| 45 | +;; tree-based index is pure overhead — you'd rebuild it constantly. As Chris |
| 46 | +;; Nuernberger (author of tech.ml.dataset) put it: "Just sorting the dataset and |
| 47 | +;; using binary search will outperform most/all tree structures in this scenario." |
| 48 | +;; |
| 49 | +;; **On convenience:** The index adds implicit state threaded through your data. |
| 50 | +;; Tablecloth's API avoids this — you always say which columns you're operating on. |
| 51 | +;; The pipeline reads like what it does. This aligns with Clojure's broader preference |
| 52 | +;; for explicit, composable operations over hidden magic. |
| 53 | +;; |
| 54 | +;; The simplicity isn't a compromise. It's the feature. |
| 55 | +;; |
| 56 | +;; For the full discussion of this design shift, see |
| 57 | +;; [Composability Over Abstraction](https://humanscodes.com/tablecloth-time-relaunch) |
| 58 | +;; on humanscodes. |
48 | 59 |
|
49 | 60 | ;; ## Loading the Data |
50 | 61 | ;; |
|
0 commit comments