Skip to content

Commit 25356a2

Browse files
teunbrandclaude
andauthored
Density layer (#110)
* build up scaffold * bandwidth calculation * add range utility * working version for densities * Add snapshot tests for density SQL generation Add three tests verifying density SQL generation with normalized whitespace comparison: - test_density_sql_no_groups: validates CROSS JOIN without groups - test_density_sql_with_two_groups: validates grouped density with join conditions - test_density_sql_computed_bandwidth: validates Silverman's rule computation Tests verify both SQL structure and executability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Writing density follows area logic * optimisations * edit for clarity * add test assumption that KDE for each group integrates to 1 * implement different kernels * support 'weight' aesthetic * write some docs * try to build some violin infrastructure * violin writer * treat orthogonal `x` as yet another group * add tests for violin geom Adds two tests verifying violin stat transform behavior: one with only x/y aesthetics and another with additional color grouping. Ensures correct stat column generation and grouping logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * move detail wrangling upstream * add docs for violin * cargo fmt * Add linewidth aesthetic to density geom For consistency with violin and other area-based geoms that can have strokes, density now supports the linewidth aesthetic. * fix density column names * fix stupid mistake * include linetype aesthetics * Update density query to include (unnormalised) intensity * register columns properly * violin can remap offset aesthetic * polish docs * dainty little docfix * support NULL-containing groups in density Use IS NOT DISTINCT FROM for NULL-safe group comparisons in density computation. Allows density plots to include groups where grouping columns contain NULL values. Changes: - build_data_cte: only filter NULLs in value column - compute_density: use IS NOT DISTINCT FROM for bandwidth joins - compute_density: use IS NOT DISTINCT FROM for grid matching Benchmarked at 123ms vs 124ms (no performance regression). Alternative ID mapping approach was 42x slower (5.3s). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * use QUANTILE_CONT for density bandwidth computation Replace NTILE-based quartile computation with QUANTILE_CONT for IQR calculation in Silverman's rule of thumb. Changes: - Replace 3-CTE approach (quartiles, metrics, bandwidth) with 1 CTE - Use QUANTILE_CONT(x, 0.75) - QUANTILE_CONT(x, 0.25) for IQR - Extract silverman_rule() function for clarity - Remove unused partition variable - Enhance test to validate exact SQL structure (grouped + ungrouped) Benefits: - Single aggregate pass over data (more efficient) - Much more readable - formula directly visible - Standard SQL aggregate functions vs complex window functions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * add back violin detail encoding logic * fix(violin): preserve upstream transforms instead of overwriting ViolinRenderer was overwriting the entire transform array, which deleted the source filter added upstream. This broke multi-layer plots where violin data needs to be filtered from the unified dataset. Now follows BoxplotRenderer's pattern: preserve existing transforms and extend with violin-specific transforms. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * violin layer shows circle symbol in legend * cargo fmt --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 9c529d1 commit 25356a2

8 files changed

Lines changed: 1729 additions & 13 deletions

File tree

doc/syntax/index.qmd

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@ There are many different layers to choose from when visualising your data. Some
2222
- [`ribbon`](layer/ribbon.qmd) is used to display series extrema.
2323
- [`polygon`](layer/polygon.qmd) is used to display arbitrary shapes as polygons.
2424
- [`bar`](layer/bar.qmd) creates a bar chart, optionally calculating y from the number of records in each bar
25+
- [`density`](layer/density.qmd) creates univariate kernel density estimates, showing the distribution of a variable
26+
- [`violin`](layer/violin.qmd) displays a rotated kernel density estimate
2527
- [`histogram`](layer/histogram.qmd) bins the data along the x axis and produces a bar for each bin showing the number of records in it
2628
- [`boxplot`](layer/boxplot.qmd) displays continuous variables as 5-number summaries
2729

doc/syntax/layer/density.qmd

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: "Density"
3+
---
4+
5+
> Layers are declared with the [`DRAW` clause](../clause/draw.qmd). Read the documentation for this clause for a thorough description of how to use it.
6+
7+
Visualise the distribution of a single continuous variable by computing a kernel density estimate. It has a similar interpretation as a histogram but smoothing out observations rather than binning them.
8+
9+
## Aesthetics
10+
The following aesthetics are recognised by the density layer.
11+
12+
### Required
13+
* `x`: Position on the x-axis.
14+
15+
### Optional
16+
* `stroke`: The colour of the contour lines.
17+
* `fill`: The colour of the inner area.
18+
* `colour`: Shorthand for setting `stroke` and `fill` simultaneously.
19+
* `opacity`: The opacity of the colours.
20+
* `linewidth`: The width of the contour lines.
21+
* `linetype` The dash pattern of the contour line.
22+
23+
## Settings
24+
* `stacking`: Determines how multiple groups are displayed. One of the following:
25+
* `'off'`: The groups `y`-values are displayed as-is (default).
26+
* `'on'`: The `y`-values are stacked per `x` position, accumulating over groups.
27+
* `'fill'`: Like `'on'` but displayed as a fraction of the total per `x` position.
28+
* `bandwidth`: A numerical value setting the smoothing bandwidth to use. If absent (default), the bandwidth will be computed using Silverman's rule of thumb.
29+
* `adjust`: A numerical value as multiplier for the `bandwidth` setting, with 1 as default.
30+
* `kernel`: Determines the smoothing kernel shape. Can be one of the following:
31+
* `'gaussian'` (default)
32+
* `'epanechnikov'`
33+
* `'triangular'`
34+
* `'rectangular'` or `'uniform'`
35+
* `'biweight'` or `'quartic'`
36+
* `'cosine'`
37+
38+
## Data transformation
39+
The density layer will compute a 1-dimensional grid using the range of the data. The distances between the grid locations and observations are computed ($x - x_i$) and serve as input for a kernel function. The contributions of each observation is then averaged across the grid.
40+
41+
$$
42+
\frac{1}{(\sum_{i=1}^{n}w_i)h}\sum_{i=1}^{n}w_iK \left(\frac{x - x_i}{h}\right)
43+
$$
44+
45+
Where:
46+
47+
* $K$ is the kernel function
48+
* $h$ is the bandwidth
49+
* $w_i$ is the weight of observation $i$
50+
51+
By default $w_i = 1$, so the procedure simplifies thus:
52+
53+
$$
54+
\frac{1}{nh}\sum_{i=1}^{n}K \left(\frac{x - x_i}{h}\right)
55+
$$
56+
57+
### Properties
58+
59+
* `weight`: If mapped, it sets the relative contribution of an observation $w_i$ to the density estimate.
60+
61+
### Calculated statistics
62+
63+
* `density`: The estimated probability density per point on the grid. The total area of a single density curve adds up to 1.
64+
* `intensity`: Also termed 'probability intensity estimation', it is the precursor of the `density` variable. Specifically it is the same as the density without normalisation, i.e. it omits the $\frac{1}{nh}$ part of the computation. You can use `REMAPPING intensity AS y` if you want to reflect differences in group sizes.
65+
66+
### Default remappings
67+
68+
* `density AS y`: By default the density layer will display the computed density along the y-axis.
69+
70+
## Examples
71+
72+
A typical KDE computation with different groups:
73+
74+
```{ggsql}
75+
VISUALISE bill_dep AS x, species AS colour FROM ggsql:penguins
76+
DRAW density SETTING opacity => 0.8
77+
```
78+
79+
Changing the relative bandwidth through the `adjust` setting.
80+
81+
```{ggsql}
82+
VISUALISE bill_dep AS x, species AS colour FROM ggsql:penguins
83+
DRAW density SETTING opacity => 0.8, adjust => 0.1
84+
```
85+
86+
Stacking the different groups instead of overlaying them.
87+
88+
```{ggsql}
89+
VISUALISE bill_dep AS x, species AS colour FROM ggsql:penguins
90+
DRAW density SETTING stacking => 'on'
91+
```
92+
93+
Using weighted estimates by mapping a column to the optional weight aesthetic. Note that the difference in output is subtle.
94+
95+
```{ggsql}
96+
VISUALISE bill_dep AS x, species AS colour FROM ggsql:penguins
97+
DRAW density
98+
MAPPING body_mass AS weight
99+
SETTING opacity => 0.8
100+
```
101+
102+
If you want to compare a histogram and a density layer, you can use the `intensity` computed variable to match the histogram scale.
103+
104+
```{ggsql}
105+
VISUALISE bill_len AS x FROM ggsql:penguins
106+
DRAW histogram SETTING opacity => 0.5
107+
DRAW density
108+
REMAPPING intensity AS y
109+
SETTING opacity => 0.5
110+
```
111+
112+
Using the intensity rather than the density also portrays differences in group sizes better.
113+
Note the relative height of the groups.
114+
115+
```{ggsql}
116+
VISUALISE bill_dep AS x, species AS colour FROM ggsql:penguins
117+
DRAW density
118+
REMAPPING intensity AS y
119+
SETTING opacity => 0.8
120+
```
121+

doc/syntax/layer/violin.qmd

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
---
2+
title: "Violin"
3+
---
4+
5+
> Layers are declared with the [`DRAW` clause](../clause/draw.qmd). Read the documentation for this clause for a thorough description of how to use it.
6+
7+
Violin plots display the distribution of a single continuous variable for multiple groups.
8+
The violins are mirrored kernel density estimates, similar to the [density](density.qmd) layer, but organised as distinct groups.
9+
10+
## Aesthetics
11+
The following aesthetics are recognised by the violin layer.
12+
13+
### Required
14+
* `x`: Position on the x-axis (categorical).
15+
* `y`: Value on the y-axis for which to compute density.
16+
17+
### Optional
18+
* `stroke`: The colour of the contour lines.
19+
* `fill`: The colour of the inner area.
20+
* `colour`: Shorthand for setting `stroke` and `fill` simultaneously.
21+
* `opacity`: The opacity of the colours.
22+
* `linewidth`: The width of the contour lines.
23+
* `linetype` The dash pattern of the contour line.
24+
25+
## Settings
26+
* `bandwidth`: A numerical value setting the smoothing bandwidth to use. If absent (default), the bandwidth will be computed using Silverman's rule of thumb.
27+
* `adjust`: A numerical value as multiplier for the `bandwidth` setting, with 1 as default.
28+
* `kernel`: Determines the smoothing kernel shape. Can be one of the following:
29+
* `'gaussian'` (default)
30+
* `'epanechnikov'`
31+
* `'triangular'`
32+
* `'rectangular'` or `'uniform'`
33+
* `'biweight'` or `'quartic'`
34+
* `'cosine'`
35+
36+
## Data transformation
37+
A violin layer uses the same computation as a density layer. See the [density data transformation](density.qmd#data-transformation) section for details.
38+
The major difference between a violin layer and a density layer is just the matter of display.
39+
40+
### Properties
41+
42+
* `weight`: If mapped, it sets the relative contribution of an observation to the density estimate.
43+
44+
### Calculated statistics
45+
46+
* `density`: The estimated probability density per point on the grid. The total area of a single density curve adds up to 1.
47+
* `intensity`: Also termed 'probability intensity estimation', it is the precursor of the `density` variable. Specifically it is the same as the density without normalisation. You can use `REMAPPING intensity AS offset` if you want to reflect differences in group sizes.
48+
49+
### Default remappings
50+
51+
* `density AS offset`: By default the offsets around a centerline reflect the computed density.
52+
53+
## Examples
54+
55+
A typical violin plot.
56+
57+
```{ggsql}
58+
VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins
59+
DRAW violin
60+
```
61+
62+
The `adjust` setting controls the smoothing.
63+
64+
```{ggsql}
65+
VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins
66+
DRAW violin SETTING adjust => 0.1
67+
```
68+
69+
To more clearly indicate differences in group sizes, you can use the `intensity` computed variable.
70+
Note that we have fewer (n=68) Chinstrap penguins than Adelie (n=152) or Gentoo (n=124) penguins.
71+
72+
```{ggsql}
73+
VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins
74+
DRAW violin REMAPPING intensity AS offset
75+
```
76+
77+
You can combine groups to expand the categories.
78+
79+
<!-- When dodging is implemented we should use that example instead -->
80+
81+
```{ggsql}
82+
SELECT *, species || ' ' || island AS groups FROM ggsql:penguins
83+
VISUALISE groups AS x, bill_dep AS y, island AS fill
84+
DRAW violin
85+
```
86+

0 commit comments

Comments
 (0)