fix: Fix incorrect primitive type detection (#122)

Problem ======= `typeLength`, and potentially `precision`, with value "null" causes incorrect primitive type detection result. Solution ======== We should handle the null values such that when the `typeLength` or `precisions` field is of value "null", its primitive type are detected as "INT64". Steps to Verify: The bug reproduces when the parquet file consists of a Dictionary_Page with a INT64 field whose typeLength is null upon read. Unfortunately, I don't have such a test file for now. My debugging was based on a piece of privately shared data from our customer. When the bug reproduces, the primitive type parsed from the schema (Fixed_Length_Byte_Array) won't match the primitive type discovered from the column data (Int64). Due to a discrepancy on how the library decodes data pages, when the data is in a Dictionary_Page, the decoding logic will hit the check for `typeLength` and fail. For Data_Page and Data_Page_V2, decoding ignores the schema and privileges the primitive type inferred from the column data. However, for Dictionary_Page, decoding uses the primitive type specified in the schema. decodeDataPageV2 https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L1104 decodeDictionaryPage https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L947 Notice that one uses "opts.type" while the other uses "opts.column.primitiveType".
LibertyDSNP · Mar 13, 2024 · 3de7eea · 3de7eea
1 parent 91fc71f
commit 3de7eea
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/lib/types.ts b/lib/types.ts
@@ -20,14 +20,14 @@ interface INTERVAL {
 
 export function getParquetTypeDataObject(type: ParquetType, field?: ParquetField | Options | FieldDefinition): ParquetTypeDataObject {
   if (type === 'DECIMAL') {
-    if (field?.typeLength !== undefined) {
+    if (field?.typeLength !== undefined && field?.typeLength !== null) {
       return {
         primitiveType: 'FIXED_LEN_BYTE_ARRAY',
         originalType: 'DECIMAL',
         typeLength: field.typeLength,
         toPrimitive: toPrimitive_FIXED_LEN_BYTE_ARRAY_DECIMAL
       };
-    } else if (field?.precision !== undefined && field.precision > 18) {
+    } else if (field?.precision !== undefined && field?.precision !== null && field.precision > 18) {
       return {
         primitiveType: 'BYTE_ARRAY',
         originalType: 'DECIMAL',