Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Fix incorrect primitive type detection (#122)
Problem ======= `typeLength`, and potentially `precision`, with value "null" causes incorrect primitive type detection result. Solution ======== We should handle the null values such that when the `typeLength` or `precisions` field is of value "null", its primitive type are detected as "INT64". Steps to Verify: The bug reproduces when the parquet file consists of a Dictionary_Page with a INT64 field whose typeLength is null upon read. Unfortunately, I don't have such a test file for now. My debugging was based on a piece of privately shared data from our customer. When the bug reproduces, the primitive type parsed from the schema (Fixed_Length_Byte_Array) won't match the primitive type discovered from the column data (Int64). Due to a discrepancy on how the library decodes data pages, when the data is in a Dictionary_Page, the decoding logic will hit the check for `typeLength` and fail. For Data_Page and Data_Page_V2, decoding ignores the schema and privileges the primitive type inferred from the column data. However, for Dictionary_Page, decoding uses the primitive type specified in the schema. decodeDataPageV2 https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L1104 decodeDictionaryPage https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L947 Notice that one uses "opts.type" while the other uses "opts.column.primitiveType".
- Loading branch information