Skip to content

Commit

Permalink
fix: Fix incorrect primitive type detection (#122)
Browse files Browse the repository at this point in the history
Problem
=======
`typeLength`, and potentially `precision`, with value "null" causes
incorrect primitive type detection result.

Solution
========
We should handle the null values such that when the `typeLength` or
`precisions` field is of value "null", its primitive type are detected
as "INT64".

Steps to Verify:
The bug reproduces when the parquet file consists of a Dictionary_Page
with a INT64 field whose typeLength is null upon read. Unfortunately, I
don't have such a test file for now. My debugging was based on a piece
of privately shared data from our customer.

When the bug reproduces, the primitive type parsed from the schema
(Fixed_Length_Byte_Array) won't match the primitive type discovered from
the column data (Int64). Due to a discrepancy on how the library decodes
data pages, when the data is in a Dictionary_Page, the decoding logic
will hit the check for `typeLength` and fail. For Data_Page and
Data_Page_V2, decoding ignores the schema and privileges the primitive
type inferred from the column data. However, for Dictionary_Page,
decoding uses the primitive type specified in the schema.

decodeDataPageV2

https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L1104

decodeDictionaryPage

https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L947

Notice that one uses "opts.type" while the other uses
"opts.column.primitiveType".
  • Loading branch information
JasonYeMSFT authored Mar 13, 2024
1 parent 91fc71f commit 3de7eea
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions lib/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,14 @@ interface INTERVAL {

export function getParquetTypeDataObject(type: ParquetType, field?: ParquetField | Options | FieldDefinition): ParquetTypeDataObject {
if (type === 'DECIMAL') {
if (field?.typeLength !== undefined) {
if (field?.typeLength !== undefined && field?.typeLength !== null) {
return {
primitiveType: 'FIXED_LEN_BYTE_ARRAY',
originalType: 'DECIMAL',
typeLength: field.typeLength,
toPrimitive: toPrimitive_FIXED_LEN_BYTE_ARRAY_DECIMAL
};
} else if (field?.precision !== undefined && field.precision > 18) {
} else if (field?.precision !== undefined && field?.precision !== null && field.precision > 18) {
return {
primitiveType: 'BYTE_ARRAY',
originalType: 'DECIMAL',
Expand Down

0 comments on commit 3de7eea

Please sign in to comment.