Summary
When an integer column is mapped to fill (or color) in a bar chart, ggsql's stat transform drops the column from the result, causing a validation error:
Validation error: Column 'fill' referenced in aesthetic 'fill' (layer 1 (global data)) does not exist.
Available columns: __ggsql_aes_pos1__, __ggsql_aes_pos2__, __ggsql_aes_pos2end__
Reproducible example
Rust (integration test style)
use ggsql::reader::{DuckDBReader, Reader};
use ggsql::writer::VegaLiteWriter;
let reader = DuckDBReader::from_connection_string("duckdb://memory").unwrap();
// Integer column (survived: 0/1) mapped to fill
let spec = reader.execute(
"SELECT *
FROM (VALUES
('Male', 0), ('Male', 1), ('Female', 0), ('Female', 1),
('Male', 0), ('Male', 0), ('Female', 1), ('Female', 1)
) AS t(sex, survived)
VISUALISE sex AS x, survived AS fill
DRAW bar"
);
// This fails with: Column 'fill' referenced in aesthetic 'fill' ... does not exist
assert!(spec.is_ok(), "Should handle integer fill: {:?}", spec.err());
Python
import ggsql
import polars as pl
reader = ggsql.DuckDBReader("duckdb://memory")
df = pl.DataFrame({
"sex": ["Male", "Male", "Female", "Female", "Male", "Male", "Female", "Female"],
"survived": [0, 1, 0, 1, 0, 0, 1, 1],
})
reader.register("titanic", df)
# Fails with validation error
spec = reader.execute("""
SELECT * FROM titanic
VISUALISE sex AS x, survived AS fill
DRAW bar
""")
Note: adding SCALE DISCRETE fill or SCALE fill RENAMING 0 => 'No', 1 => 'Yes' doesn't help because RENAMING doesn't set a scale_type, so the discreteness check still falls through to the schema-based inference.
Root cause
In src/execute/schema.rs:171-172, discreteness is determined purely by data type:
let is_discrete =
matches!(dtype, DataType::String | DataType::Boolean) || dtype.is_categorical();
Integers are never considered discrete. The downstream effect:
add_discrete_columns_to_partition_by (src/execute/mod.rs:677) checks if a mapped column is discrete
- Integer
survived → not discrete → not added to partition_by
- The bar stat transform (
src/plot/layer/geom/bar.rs:87) builds GROUP BY from partition_by + x column
- Since
fill isn't in group_by, survived is dropped from the aggregation SQL
- The resulting DataFrame only has
pos1, pos2, pos2end
- Writer validation fails because
fill references a column that no longer exists
Note that SCALE fill RENAMING ... doesn't help because RENAMING doesn't set scale.scale_type, so add_discrete_columns_to_partition_by falls through to the schema check (line 740-741), which still says "integer = not discrete."
Inconsistency with ggplot2
In ggplot2, this works because all mapped aesthetics contribute to grouping, regardless of column type:
library(ggplot2)
df <- data.frame(sex = c("Male", "Female", "Male", "Female"),
survived = c(0L, 1L, 0L, 1L))
# Works fine — survived (integer) is used for grouping in stat_count
ggplot(df, aes(x = sex, fill = survived)) + geom_bar()
ggplot2 treats the integer as continuous for color scale purposes (producing a gradient), but still uses it for grouping in the stat transform. The grouping and the scale type are independent concerns.
Possible approaches
A) Aesthetic-based grouping
Certain aesthetics (fill, color, shape, linetype, stroke) inherently imply grouping. Any column mapped to these should be added to partition_by regardless of data type.
Pros: Targeted fix, only changes behavior for aesthetics where grouping is clearly intended.
Cons: Doesn't cover edge cases like mapping a numeric column to opacity in a bar chart. Requires maintaining a list of "grouping aesthetics."
B) All non-positional mapped columns survive stat transforms
Every non-positional, non-stat-consumed aesthetic column gets added to GROUP BY for stat transforms, regardless of data type or aesthetic name.
Pros: Simpler logic, matches ggplot2's behavior most closely (where group is the interaction of all mapped discrete variables, but stat transforms preserve all mappings). No need to maintain a special list.
Cons: Broader change — could affect behavior for intentionally continuous aesthetics like opacity mapped to a numeric column in a stat geom. Though in practice, including a continuous column in GROUP BY just means "don't aggregate it away," which is usually correct.
Additional consideration: RENAMING should imply discrete
Independently of the above, SCALE fill RENAMING ... should probably set or imply a discrete scale type. If you're providing explicit label mappings for specific values, discrete semantics are almost certainly intended.
Summary
When an integer column is mapped to
fill(orcolor) in a bar chart, ggsql's stat transform drops the column from the result, causing a validation error:Reproducible example
Rust (integration test style)
Python
Note: adding
SCALE DISCRETE fillorSCALE fill RENAMING 0 => 'No', 1 => 'Yes'doesn't help because RENAMING doesn't set ascale_type, so the discreteness check still falls through to the schema-based inference.Root cause
In
src/execute/schema.rs:171-172, discreteness is determined purely by data type:Integers are never considered discrete. The downstream effect:
add_discrete_columns_to_partition_by(src/execute/mod.rs:677) checks if a mapped column is discretesurvived→ not discrete → not added topartition_bysrc/plot/layer/geom/bar.rs:87) buildsGROUP BYfrompartition_by+ x columnfillisn't ingroup_by,survivedis dropped from the aggregation SQLpos1,pos2,pos2endfillreferences a column that no longer existsNote that
SCALE fill RENAMING ...doesn't help because RENAMING doesn't setscale.scale_type, soadd_discrete_columns_to_partition_byfalls through to the schema check (line 740-741), which still says "integer = not discrete."Inconsistency with ggplot2
In ggplot2, this works because all mapped aesthetics contribute to grouping, regardless of column type:
ggplot2 treats the integer as continuous for color scale purposes (producing a gradient), but still uses it for grouping in the stat transform. The grouping and the scale type are independent concerns.
Possible approaches
A) Aesthetic-based grouping
Certain aesthetics (
fill,color,shape,linetype,stroke) inherently imply grouping. Any column mapped to these should be added topartition_byregardless of data type.Pros: Targeted fix, only changes behavior for aesthetics where grouping is clearly intended.
Cons: Doesn't cover edge cases like mapping a numeric column to
opacityin a bar chart. Requires maintaining a list of "grouping aesthetics."B) All non-positional mapped columns survive stat transforms
Every non-positional, non-stat-consumed aesthetic column gets added to
GROUP BYfor stat transforms, regardless of data type or aesthetic name.Pros: Simpler logic, matches ggplot2's behavior most closely (where
groupis the interaction of all mapped discrete variables, but stat transforms preserve all mappings). No need to maintain a special list.Cons: Broader change — could affect behavior for intentionally continuous aesthetics like
opacitymapped to a numeric column in a stat geom. Though in practice, including a continuous column in GROUP BY just means "don't aggregate it away," which is usually correct.Additional consideration: RENAMING should imply discrete
Independently of the above,
SCALE fill RENAMING ...should probably set or imply a discrete scale type. If you're providing explicit label mappings for specific values, discrete semantics are almost certainly intended.