Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 2.17 KB

File metadata and controls

60 lines (45 loc) · 2.17 KB

Aggregation

An aggregate or aggregation is a function where the values of multiple rows are processed together to form a single summary value. For performing an aggregation, DataFusion provides the :py:func:`~datafusion.dataframe.DataFrame.aggregate`

.. ipython:: python

    from datafusion import SessionContext
    from datafusion import column, lit
    from datafusion import functions as f
    import random

    ctx = SessionContext()
    df = ctx.from_pydict(
        {
            "a": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
            "b": ["one", "one", "two", "three", "two", "two", "one", "three"],
            "c": [random.randint(0, 100) for _ in range(8)],
            "d": [random.random() for _ in range(8)],
        },
        name="foo_bar"
    )

    col_a = column("a")
    col_b = column("b")
    col_c = column("c")
    col_d = column("d")

    df.aggregate([], [f.approx_distinct(col_c), f.approx_median(col_d), f.approx_percentile_cont(col_d, lit(0.5))])

When the group_by list is empty the aggregation is done over the whole :class:`.DataFrame`. For grouping the group_by list must contain at least one column

.. ipython:: python

    df.aggregate([col_a], [f.sum(col_c), f.max(col_d), f.min(col_d)])

More than one column can be used for grouping

.. ipython:: python

    df.aggregate([col_a, col_b], [f.sum(col_c), f.max(col_d), f.min(col_d)])