Update doc with changes to numeric type propagation (#79)

tableau · Feb 5, 2024 · d4b31e9 · d4b31e9
1 parent 612d4cc
commit d4b31e9
Show file tree

Hide file tree

Showing 3 changed files with 38 additions and 21 deletions.
diff --git a/website/docs/releases.md b/website/docs/releases.md
@@ -28,6 +28,11 @@ In case you are wondering why all our releases start with `0.0`, read [this FAQ
 
 ### Upcoming release
 
+* Overhauled the [SQL type propagation rules for the `NUMERIC` data type](/docs/sql/datatype/numeric). This can lead to differences in the precision and number of decimal digits of `NUMERIC` calculations in existing queries and their results. The changes are:
+    * Multiplication and division of `NUMERIC` with integer and numeric types now follows different rules to determine the result's precision and scale.
+    * The `AVG` aggregate function now adds 6 instead of 4 decimal digits.
+    * The statistical aggregate functions (like `STDDEV_SAMP`) now always return `NUMERIC(38,6)` for integer and numeric inputs. If a scale of 6 is too low for your purposes consider casting the input to `double precision`.
+    * Integer literals are now treated as if they were of type numeric with the minimal required precision when in arithmetic operations with `NUMERIC` values.
 * IANA released version 2023d of the Time Zone Database. Hyper’s time zone information is updated accordingly. Noteworthy changes:
   * Ittoqqortoormiit, Greenland changes time zones on 2024-03-31.
   * Vostok, Antarctica changed time zones on 2023-12-18.

diff --git a/website/docs/sql/aggregate.md b/website/docs/sql/aggregate.md
@@ -17,7 +17,7 @@ Function|Argument Type(s)|Return Type|Description|
 ---|---|---|---|
 `any_value(expression)`|any type|same as argument data type|an arbitrary, implementation-defined value from the set of input values. The result is non-deterministic.
 `approx_count_distinct(expression, e)`|any, `double precision`|`bigint`|Computes approximation of `count(distinct expression)`, with expected relative error `e`. Supported values of `e` are in range (0.002, 1]. The `e` argument is optional, if omitted, the value 0.023 is used (2.3% expected relative error to real distinct count).
-`avg(expression)`|any numerical type|`numeric` with a scale of 4 for any integer-type argument, `double precision` for a floating-point argument, otherwise the same as the argument data type|the average (arithmetic mean) of all input values
+`avg(expression)`|any numerical type|`numeric` with a scale of 6 for any integer-type argument and numeric arguments with a scale less than 6, `double precision` for a floating-point argument, otherwise the same as the argument data type|the average (arithmetic mean) of all input values
 `bit_and(expression)`|integral types|same as argument data type|the bitwise AND of all non-null input values, or null if none
 `bit_or(expression)`|integral types|same as argument data type|the bitwise OR of all non-null input values, or null if none
 `bool_and(expression)`|`bool`|`bool`|true if all input values are true, otherwise false
@@ -48,12 +48,16 @@ Function|Argument Type|Return Type|Description
 `corr(Y, X)`|`double precision`|`double precision`|correlation coefficient
 `covar_pop(Y, X)`|`double precision`|`double precision`|population covariance
 `covar_samp(Y, X)`|`double precision`|`double precision`|sample covariance
-`stddev(expression)`|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|alias for `stddev_samp`
-`stddev_pop(expression)`|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|population standard deviation of the input values
-`stddev_samp(expression)`|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|sample standard deviation of the input values
-`variance`(`<expression>`)|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|alias for `var_samp`
-`var_pop`(`<expression>`)|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|population variance of the input values (square of the population standard deviation)
-`var_samp`(`<expression>`)|any numerical type|`NUMERIC(38,4)` for any integer-type argument, `double precision` for a floating-point argument, `NUMERIC(38,s)` for any `NUMERIC(p,s)`|sample variance of the input values (square of the sample standard deviation)
+`stddev(expression)`|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|alias for `stddev_samp`
+`stddev_pop(expression)`|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|population standard deviation of the input values
+`stddev_samp(expression)`|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|sample standard deviation of the input values
+`variance`(`<expression>`)|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|alias for `var_samp`
+`var_pop`(`<expression>`)|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|population variance of the input values (square of the population standard deviation)
+`var_samp`(`<expression>`)|any numerical type|`NUMERIC(38,6)` for any integer-type or numeric argument, `double precision` for a floating-point argument|sample variance of the input values (square of the sample standard deviation)
+
+:::tip
+Casting the input of an aggregate function can be used to force a different output type. E.g., `VAR_POP(CAST(A AS DOUBLE PRECISION))` can be used to get a `double precision` result independent of the type of the column `A`. 
+:::
 
 ## Ordered set aggregates {#ordered-set}
 

diff --git a/website/docs/sql/datatype/numeric.md b/website/docs/sql/datatype/numeric.md
@@ -45,32 +45,40 @@ can be configured. To declare a column of type `numeric` use the syntax
 Alternatively, `NUMERIC(precision)` selects a scale of 0.
 Specifying `NUMERIC` selects the maximum precision of 38 and a scale of 0.
 
-The type propagation rules for arithmetic operations with numerics never
-decrease the scale and set the precision such that the biggest possible
-result will fit into the result type. This may lead to undesired growth
-of both scale and precision, especially when chaining multiple
-arithmetic operations. Large scale might be undesirable because it takes
-away from the digits in front of the decimal point, potentially leading
-to overflow errors. Large precision might also be undesirable because
-`numeric` values with precision over 18 internally use 128-bit which may
-slow down processing. To avoid this, explicit casts to the desired scale
-and precision can be added throughout a query.
+The type propagation rules for arithmetic operations with numerics often
+lead to larger precision and scale in the result type. This may lead to
+undesired growth of both scale and precision, especially when chaining
+multiple arithmetic operations. Large scale might be undesirable because 
+it takes away from the digits in front of the decimal point, potentially 
+leading to overflow errors. Large precision might also be undesirable
+because `numeric` values with precision over 18 internally use 128-bit
+which may slow down processing. To avoid this, explicit casts to the
+desired scale and precision can be added throughout a query. The resulting
+precision is always capped at the maximum of 38.
 
 Arithmetic operations between a `NUMERIC(p1,s1)` and a `NUMERIC(p2,s2)`
 have the following results:
 
 |Operator|Result Type|
 |---|---|
-| + or - |NUMERIC(precision, scale) with:<br/>scale = max(s1,s2)<br/>precision = min(38, max((p1-s1),(p2-s2)) + 1 + scale)|
-|*|NUMERIC(precision, scale) with:<br/> scale = max(max(s1,s2), min(s1+s2, 38 - (p1-s1) - (p2-s2)))<br/> precision = min(p1+p2, 38)|
-|/|NUMERIC(precision, scale) with:<br/> scale = max(s1,s2)<br/>precision = min(38, ((p1-s1) + s2 + scale)) |
-|%|NUMERIC(precision, scale) with:<br/> scale = max(s1,s2)<br/>precision = min((p1-s1), (p2-s2)) + scale  |
+| + or - |NUMERIC(precision, scale) with:<br/>scale = max(s1,s2)<br/>precision = max((p1-s1),(p2-s2)) + 1 + scale|
+|*|NUMERIC(precision, scale) with:<br/> scale* = s1 + s2<br/> precision = p1+p2|
+|/|NUMERIC(precision, scale) with:<br/> scale* = max(6, s1 + p2 + 1)<br/>precision = p1 - s1 + s2 + max(6, s1 + p2 + 1)|
+|%|NUMERIC(precision, scale) with:<br/> scale = max(s1,s2)<br/>precision = min((p1-s1), (p2-s2)) + scale|
+
+*) An additional rule applies for multiplication and division: If the resulting precision
+from the above rules exceeds 38, the scale is reduced by the exceeding amount.
+During this step, the scale is never reduced below 6.
 
 When used in arithmetic operation together with `NUMERIC`,
 `DOUBLE PRECISION` operands will always give `DOUBLE PRECISION` results,
 `SMALLINT` behaves the same as `NUMERIC(5,0)`, `INTEGER` as
 `NUMERIC(10,0)` and `BIGINT` as `NUMERIC(19,0)`.
 
+When used in arithmetic operation together with number literals, the literal will be treated
+as the smallest fitting `NUMERIC` type. E.g., `100` will be treated as
+`NUMERIC(3,0)`, `10.0` as `NUMERIC(3,1)`. This rule only applies to literals, not to expressions containing literals. E.g., `(1+1)` will be treated as `NUMERIC(10,0)` not `NUMERIC(1,0)`.
+
 :::note
 In the SQL standard, as well as in PostgreSQL and many other database
 systems, the types `decimal` and `numeric` are equivalent and both