Skip to content

Commit

Permalink
Remove to string for dictionary pages (#138)
Browse files Browse the repository at this point in the history
# Problem
As reported in #121 and #131 there are issues with the parsing of
`FIXED_LEN_BYTE_ARRAY` values. This may be caused by the fact that they
are converted to strings as part of the decoding process for
DICTIONARY_PAGE types. This should close both issues.

Solution
========
By removing the `toString` conversion this removes the error, and byte
array values get returned as Buffers, as well as also returning numeric
types as numbers, and not as strings.

To try to check that the change would only impact things as expected, I
also added some regression testing for all of the reference files. For
this I've added a check for the first record returned by each reference
file. The process to get this data was to add the test, see the error
message and then fix the test.

With this regression in place, I then removed the `toString` call, and
saw how the tests failed. The errors were all in places where the schema
was returning either numeric types or things like timestamps or byte
arrays. Previously these would be getting stringified if the file was
using a dictionary page.

**Note, this does cause a breaking change to the way that these
dictionary page files are decoded, as the values being returned will no
longer be strings by default, and this may create issues for some
consumers as the runtime types will change.**

There may be an improvement to add something like snapshot testing for
the reference-test files so that the whole file can be compared without
having to manually write it all out. I know this is natively supported
in jest/vitest but I'm not sure how to do it in mocha.

## Change summary:

- Add testing to validate the first record in each reference file
- Remove the `toString` conversion when decoding dictionary pages

## Steps to Verify:

It may be worth double checking that some of the schemas for some of the
reference files line up with the values that I've added. I've split it
into two commits so that the first commit is adding in the regression,
and the second commit is removing the `toString`.

Additionally I've ran the test added in #121, the test needs to be
updated to either parse the age value from a buffer to the number, or to
compare the ages as Buffers. The `id` is also now returned as a bigint
as the column is an int64.

```
assert.deep,Equal(records, [{
    id: 10n,
    age: Buffer.from([0,0,0,0,0,0,0,0,0,0,0,9,199,101,36,0]),
    full_name: 'Jonathan Cohen'
  },
  {
    id: 11n,
    age: Buffer.from([0,0,0,0,0,0,0,0,0,0,0,0,178,208,94,0]),
  }])
```

The data recorded in #131 is returned as a buffer, but I've not fully
checked the contents of that buffer that is returned.
  • Loading branch information
harryalaw authored Sep 4, 2024
1 parent d369b2a commit 01e25e4
Show file tree
Hide file tree
Showing 4 changed files with 671 additions and 14 deletions.
2 changes: 1 addition & 1 deletion lib/reader.ts
Original file line number Diff line number Diff line change
Expand Up @@ -981,7 +981,7 @@ async function decodeDictionaryPage(cursor: Cursor, header: parquet_thrift.PageH
dictCursor,
header.dictionary_page_header!.num_values,
opts
).map((d: unknown[]) => d.toString());
);
}

async function decodeDataPage(cursor: Cursor, header: parquet_thrift.PageHeader, opts: Options) {
Expand Down
Loading

0 comments on commit 01e25e4

Please sign in to comment.