Skip to content

Commit 3bbed9d

Browse files
teunbrandclaude
andauthored
Smooth layer (#223)
* initial kernel smooth * add OLS method * add TLS method * test: add grouped and ungrouped tests for OLS and TLS regression methods Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * address review comments * add grid trimming for smooth and violin layers to prevent extrapolation Smooth and violin layers now trim their rendering to the actual data range instead of extrapolating beyond it. Grid expansion uses 3×bandwidth (kernel-aware) instead of a fixed 10% extension. Key optimizations: - Consolidated min/max computation into bandwidth CTE (eliminates redundant query) - Unified grid construction logic in build_grid_cte() with per-group trimming - Removed compute_range_sql() and unused execute parameter throughout call chain - Extracted shared SQL generation logic to reduce duplication All density/smooth/violin tests pass (269 tests verified). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * remove vestigial density filtering from violin writer The violin renderer was filtering out low-density points (< 0.001) to trim thin tails, but this is now handled properly upstream with grid trimming in the stat transform. The stat_violin function trims the grid to data range + 3×bandwidth, making this downstream filter redundant. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * add docs * resolve doc merge issues * cargo fmt * Change boolean `trim` to numeric `tails` * add test * cargo fmt * amend based on review --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent a691d0b commit 3bbed9d

7 files changed

Lines changed: 948 additions & 228 deletions

File tree

doc/syntax/index.qmd

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ There are many different layers to choose from when visualising your data. Some
3333
- [`histogram`](layer/type/histogram.qmd) bins the data along the x axis and produces a bar for each bin showing the number of records in it.
3434
- [`boxplot`](layer/type/boxplot.qmd) displays continuous variables as 5-number summaries.
3535
- [`errorbar`](layer/type/errorbar.qmd) a line segment with hinges at the endpoints.
36+
- [`smooth`](layer/type/smooth.qmd) a trendline that follows the data shape.
3637

3738
### Position adjustments
3839
- [`stack`](layer/position/stack.qmd) places objects with a shared baseline on top of each other.

doc/syntax/layer/type/smooth.qmd

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
title: "Smooth"
3+
---
4+
5+
> Layers are declared with the [`DRAW` clause](../clause/draw.qmd). Read the documentation for this clause for a thorough description of how to use it.
6+
7+
Smooth layers are used to display a trendline among a series of observations.
8+
9+
## Aesthetics
10+
11+
### Required
12+
* Primary axis (e.g. `x`): Position along the primary axis.
13+
* Secondary axis (e.g. `y`): Position along the secondary axis.
14+
15+
### Optional
16+
* `colour`/`stroke`: The colour of the line
17+
* `opacity`: The opacity of the line
18+
* `linewidth`: The width of the line
19+
* `linetype`: The type of line, i.e. the dashing pattern
20+
21+
## Settings
22+
23+
* `method`: Choice of the method for generating the trendline. One of the following:
24+
* `'nw'` or `'nadaraya-watson'` estimates the trendline using the Nadaraya-Watson kernel regression method (default).
25+
* `'ols'` estimates a straight trendline using ordinary least squares method.
26+
* `'tls'` estimates a straight trendline using total least squares method.
27+
28+
The settings below only apply when `method => 'nw'` and are ignored when using other methods.
29+
* `bandwidth`: A numerical value setting the smoothing bandwidth to use. If absent (default), the bandwidth will be computed using Silverman's rule of thumb.
30+
* `adjust`: A numerical value as multiplier for the `bandwidth` setting, with 1 as default.
31+
* `kernel`: Determines the smoothing kernel shape. Can be one of the following:
32+
* `'gaussian'` (default)
33+
* `'epanechnikov'`
34+
* `'triangular'`
35+
* `'rectangular'` or `'uniform'`
36+
* `'biweight'` or `'quartic'`
37+
* `'cosine'`
38+
39+
## Data transformation
40+
41+
### Nadaraya-Watson kernel regression
42+
43+
The default `method => 'nw'` computes a locally weighted average of $y$.
44+
45+
$$
46+
y(x) = \frac{\sum_{i=1}^nW(x)y_i}{\sum_{i=1}^nW(x)}
47+
$$
48+
49+
Where:
50+
51+
* $W(x)$ is kernel intensity $w_iK(\frac{x - x_i}{h})$ where
52+
* $K$ is the kernel function
53+
* $h$ is the bandwidth
54+
* $w_i$ is the weight of observation $i$
55+
56+
Please note the similarity of $W(x)$ to the [kernel density estimation formula](density.qmd#data-transformation).
57+
58+
### Ordinary least squares
59+
60+
The `method => 'ols'` setting uses ordinary least squares to compute the intercept $a$ and slope $b$ of a straight line.
61+
The method minimizes the 1-dimensional distance between a point and the vertical projection of that point on the line.
62+
Only considering the vertical distances implies having measurement error in $y$, but not $x$.
63+
64+
$$
65+
y = a + bx
66+
$$
67+
68+
Wherein:
69+
70+
$$
71+
a = E[Y] - bE[X]
72+
$$
73+
74+
and
75+
76+
$$
77+
b = \frac{\text{cov}(X, Y)}{\text{var}(X)} = \frac{E[XY] - E[X]E[Y]}{E[X^2]-(E[X])^2}
78+
$$
79+
80+
### Total least squares
81+
82+
The `method => 'tls'` setting uses total least squares to compute the intercept $a$ and slope $b$ of a straight line.
83+
The method minimizes the 2-dimensiontal distance between a point and the perpendicular projection of that point on the line.
84+
Minimising the perpendicular distances (rather than just the vertical distances) makes sense if there is uncertainty or measurement error in not just $y$, but in $x$ as well.
85+
In such case, it is a more accurate depiction of the relationship between $x$ and $y$, but it isn't the best predictor of $y$ given $x$.
86+
87+
$$
88+
y = a + bx
89+
$$
90+
91+
Wherein:
92+
93+
$$
94+
a = E[Y] - bE[X]
95+
$$
96+
97+
and
98+
99+
$$
100+
b = \frac{\text{var}(Y) - \text{var}(X) + \sqrt{(\text{var}(Y) - \text{var}(X))^2 + 4\text{cov}(X, Y)^2}}{2\text{cov}(X, Y)}
101+
$$
102+
103+
### Properties
104+
105+
* `weight` is available when using `method => 'nw'`, where when mapped, it sets the relative contribution of an observation $w_i$ to the average.
106+
107+
### Calculated statistics
108+
109+
* `intensity` corresponds to $y$ in the formulas described in the [data transformation](#data-transformation) section.
110+
111+
### Default remappings
112+
113+
* `intensity AS y`: By default the smooth layer will display the $y$ in the formulas along the y-axis.
114+
115+
## Examples
116+
117+
The default `method => 'nw'` might be too coarse for timeseries.
118+
119+
<!-- Ideally, we would just use the date here directly but we currently require numeric data -->
120+
121+
```{ggsql}
122+
SELECT *, EPOCH(Date) AS numdate FROM ggsql:airquality
123+
VISUALISE numdate AS x, Temp AS y
124+
DRAW point
125+
DRAW smooth
126+
```
127+
128+
You can make the fit more granular by reducing the bandwidth, for example using `adjust`.
129+
130+
```{ggsql}
131+
SELECT *, EPOCH(Date) AS numdate FROM ggsql:airquality
132+
VISUALISE numdate AS x, Temp AS y
133+
DRAW point
134+
DRAW smooth SETTING adjust => 0.2
135+
```
136+
137+
There is a subtle difference between the ordinary and total least squares method.
138+
139+
```{ggsql}
140+
VISUALISE bill_len AS x, bill_dep AS y FROM ggsql:penguins
141+
DRAW point
142+
DRAW smooth MAPPING 'Ordinary' AS colour SETTING method => 'ols'
143+
DRAW smooth MAPPING 'Total' AS colour SETTING method => 'tls'
144+
```
145+
146+
Simpson's Paradox is a case where a trend of combined groups is reversed when groups are considered separately.
147+
148+
```{ggsql}
149+
VISUALISE bill_len AS x, bill_dep AS y, species AS stroke FROM ggsql:penguins
150+
DRAW point SETTING opacity => 0
151+
DRAW smooth SETTING method => 'ols'
152+
DRAW smooth MAPPING 'All' AS stroke SETTING method => 'ols'
153+
```

doc/syntax/layer/type/violin.qmd

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,9 @@ The following aesthetics are recognised by the violin layer.
3434
* `'biweight'` or `'quartic'`
3535
* `'cosine'`
3636
* `width`: Relative width of the violins. Defaults to `0.9`.
37+
* `tails`: Expansion rule for drawing the tails. One of the following:
38+
* A number setting a multiple of adjusted bandwidths to expand each group's range. Defaults to 3.
39+
* `null` to use the whole data range rather than group ranges.
3740

3841
## Data transformation
3942
A violin layer uses the same computation as a density layer. See the [density data transformation](density.qmd#data-transformation) section for details.
@@ -71,6 +74,13 @@ VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins
7174
DRAW violin SETTING adjust => 0.1
7275
```
7376

77+
The `tails` setting controls the display beyond the data range. You can set it to `0` to use the exact group's data range.
78+
79+
```{ggsql}
80+
VISUALISE species AS x, bill_dep AS y FROM ggsql:penguins
81+
DRAW violin SETTING tails => 0
82+
```
83+
7484
To more clearly indicate differences in group sizes, you can use the `intensity` computed variable.
7585
Note that we have fewer (n=68) Chinstrap penguins than Adelie (n=152) or Gentoo (n=124) penguins.
7686

0 commit comments

Comments
 (0)