Skip to content
llorenspujol edited this page Jan 24, 2020 · 11 revisions

WikiAPI ReferenceGroup-By

Group-By Aggregation

# dl.groupby([field, ...])

Create a group-by aggregation query. Accepts one or more fields by which to group values, either as an array of fields (groupby([field1, field2])) or via variable-length arguments (groupby(field1, field2)).

Each field argument can be one of the following:

  • A string indicating the field name to group by. This string will be used both to generate an accessor for extracting group-by values and as the output field name.
  • An accessor function that extracts values to group by. The output field name will be the first valid value of dl.name(field), field.name, or an automatically generated name as fallback.
  • A field definition object with the syntax {name: string, get: function}. If the get accessor function is omitted, it will be generated using the name property.

Returns a group-by aggregator object, which includes methods for specifying summary aggregations, adding or removing values from the aggregation, and returning the aggregated results. See below for all supported methods.

If no summary aggregations are specified (via groupby.summarize or groupby.count), the default summary is to collect an array of values for each group. This is equivalent to explicitly calling summarize({'*':'values'}).

var array = [{a:1, b:1}, {a:1, b:2}, {a:2, b:3}, {a:2, b:4}];
dl.groupby('a').execute(array);
// [
//  {a: 1, values: [{a:1, b:1}, {a:1, b:2}]},
//  {a: 2, values: [{a:2, b:3}, {a:2, b:4}]}
// ]

# groupby.stream([flag])

Indicate if a query should accept streaming removes. If flag is true, streaming removals via the remove function will be supported. Otherwise, only streaming insert calls will be supported and any invocation of remove will result in a thrown error. In order to properly initialize the necessary data structures for streaming removes, this method must be called prior to summarize or count.

If this method is invoked without any arguments, it will simply return the current setting of the stream flag. Otherwise, invoking this method will set the stream flag and also clear the current summary aggregate specification.

var add = [{a:1}, {a:1}, {a:1}, {a:2}];
var rem = [add[0], add[1]];
var cnt = dl.groupby('a')
  .count()
  .insert(add)
  .remove(rem) // Error! Stream is not set to true.
  .result();
var add = [{a:1}, {a:1}, {a:1}, {a:2}];
var rem = [add[0], add[1]];
var cnt = dl.groupby('a')
  .stream(true) // request support for streaming removes
  .count()
  .insert(add)
  .remove(rem)
  .result(); // [{a:1, count:1}, {a:2, count:2}]

# groupby.key([accessor])

Provides an accessor for a unique key value for every input object to an aggregation. If provided, this key value will be used to determine object equality for purposes of streaming removes. If a key value is not provided or is null, strict equality of objects is required for streaming removal. If this method is invoked without any arguments it will return the current key value, otherwise it returns the aggregator instance to support method chaining.

# groupby.summarize(aggregates)

Specify summary aggregates to compute for each group. This method supports two argument formats: a convenient short format and a more complete long format.

The short format uses a single object hash that maps from field names to one or more aggregation functions: {'foo': 'valid', 'bar': ['sum', 'median']}. The aggregation function can be either a single string or an array of strings, each a valid aggregation function name.

The long format uses an array of per-field aggregate specification objects. The previous short format example translates to the long format argument [{name: 'foo', ops: ['valid']}, {name: 'bar', ops: ['sum', 'median']}].

Aggregate specifications support the following properties:

  • name: The name of the field to aggregate (required). This name will be used to generate output field names. It will also be used to generate an accessor for input data if the get property is not specified.
  • get: An optional accessor (function or property string) to extract values from input data.
  • ops: An array of aggregate functions (required). See below for the supported aggregates.
  • as: An optional array of names to use for the output fields. By default, the aggregator will automatically create output field names of the form op_name (e.g., sum_bar, median_bar). The as array provides a set of custom names to use instead. The array should be the same length as the ops array. Standard automatic name generation is used for null entries.

The supported aggregation functions are:

  • 'values': Builds up an array of all input objects in the group.
  • 'count': Count the total number of elements in the group.
  • 'valid': Count values that are not null, undefined or NaN. See dl.count.valid.
  • 'missing': Count the number of null or undefined values. See dl.count.missing.
  • 'distinct': Count the number distinct values. See dl.count.distinct.
  • 'sum': Compute the sum of values in a group. See dl.sum.
  • 'mean': Compute the mean (average) of values in a group. See dl.mean.
  • 'average': Compute the mean (average) of values in a group. Identical to 'mean'.
  • 'variance': Compute the sample variance of values in a group. See dl.variance.
  • 'variancep': Compute the population variance of values in a group.
  • 'stdev': Compute the sample standard deviation of values in a group. See dl.stdev.
  • 'stdevp': Compute the population standard deviation of values in a group.
  • 'median': Compute the median of values in a group. See dl.median.
  • 'q1': Compute the lower quartile boundary of values in a group. See dl.quartile.
  • 'q3': Compute the upper quartile boundary of values in a group. See dl.quartile.
  • 'modeskew': Compute the mode skewness of values in a group. See dl.modeskew.
  • 'min': Compute the minimum value in a group. See dl.extent.
  • 'max': Compute the maximum value in a group. See dl.extent.
  • 'argmin': Find the input object that minimizes the value in a group.
  • 'argmax': Find the input object that maximizes the value in a group.

Many of the aggregation functions above are straightforward, but a few deserve additional discussion.

The 'values' and 'count' functions operate directly on the input objects and return the same value regardless of the provided field name. Similar to SQL's count(*), these can be specified with the special name '*', as in summarize({'*':'count'}).

The 'argmin' and 'argmax' functions are a bit unusual: instead of returning the minimum or maximum value of a field, they return the original input object that contains the minimum or maximum value. This can be useful for retrieving another field associated with the minimum or maximum value (e.g., for each region, in which year did I have the maximum revenue?). If multiple entries share the minimum or maximum value, the first observed input object will be returned.

// summarize with short format arguments
var data = [{foo: 1, bar: 1}, {foo: 1, bar: 2}, {foo: null, bar: 3}];
dl.groupby()
  .summarize({'foo': 'valid', 'bar': ['sum', 'median']})
  .execute(data); // {valid_foo: 2, sum_bar: 6, median_bar: 2}
// summarize with long format arguments
var data = [{foo: 1, bar: 1}, {foo: 1, bar: 2}, {foo: null, bar: 3}];
dl.groupby().summarize([
    {name: 'foo', ops: ['valid']},
    {name: 'bar', ops: ['sum', 'median'], as: ['s', 'm']}
  ])
  .execute(data); // {valid_foo: 2, s: 6, m: 2}

# groupby.count()

Specify a single count summary aggregate. This is a convenience shorthand for summarize({'*':'count'}).

var data = [{a:1}, {a:1}, {a:2}, {a:3}];
dl.groupby('a').count().execute(data);
// [{a:1, count:2}, {a:2, count:1}, {a:3, count:1}]

# groupby.execute(values)

Clear the current aggregation state, run the query on a new set of values, and return the result. This method is a convenience shorthand for clear().insert(values).result().

# groupby.insert(values)

Add values to the aggregation. This method supports streaming updates of additional input objects. Call result to then retrieve the current aggregation results.

# groupby.remove(values)

Remove values from the aggregation. In order to remove results from a query the streaming flag must be set using stream(true) (see groupby.stream). Otherwise, this method will throw an error when invoked.

# groupby.result()

Return the current group-by aggregation result, which is an array of objects (one per group) containing summary aggregates. Call this method after a series of insert and/or remove actions to retrieve the results.

# groupby.clear()

Clear all current results from the query. The group-by fields and summary aggregate definitions will be retained, but all computed groups and aggregate summary values will be cleared.

Clone this wiki locally