DOC: Table.save() store hash for parquet files #446

hagenw · 2024-06-25T11:41:30Z

Document, that we store a hash value in the metadata of the schema of the PARQUET file, when saving to the file.

ChristianGeng

I find this very relevant. Would maybe a little more detail come in handy?

when it is triggered (on publication). You write implicitly that it is used when "using parquet as the storage format". I think this could be a little bit more explicit.
what is the purpose of having a hash (to determine table identity)
why the md5sums can differ in parquet (mainly compression I believe)

I am approving this anyway.

hagenw · 2024-06-25T13:25:56Z

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

when it is triggered (on publication). You write implicitly that it is used when "using parquet as the storage format". I think this could be a little bit more explicit.

Publication is not part of audformat, but is handled by audb (see next comment). Inside audformat the calculation of the hash is only triggered when calling audformat.Table.save() or audformat.Database.save(), which then internally calls audformat.Table.save().

what is the purpose of having a hash (to determine table identity)

I added the following statement to the docs:

why the md5sums can differ in parquet (mainly compression I believe)

So far I was not able to find documentation on that. There seems to be several factors that might influence the md5 sum:

Is the binary data always stored deterministically to the disc, or is there some optimization involved that includes random processes
The library used to write the PARQUET file
The involved compression
Metadata entries of the file that store the used library version and/or pandas version when converted from pandas

At least the last point could be avoided, but I found it nice that storing the hash inside the file also provides the advantage that the hash needs to be calculated only ones. So I didn't bother trying to fix it. I'm also pretty sure that otherwise we will have the problem that pyarrow might decide at one point to update how they store PARQUET files, and then we would need to stick with an old version of pyarrow.

Any suggestion, what and how I should add this to the docstring?

ChristianGeng · 2024-06-25T16:16:36Z

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

I still do not know how to give approval after I had not done so in the first place. I promise I will figure this out :-)

Any suggestion, what and how I should add this to the docstring?

My understanding is that you are asking for the reasoning why parquet files can differ. Probably using a kind of vagueness would help? One could talk of "reasons include" to keep it as something that is not necessarily comprehensive. And one could drop "likely" in order to avoid that it is hard to get such a comprehensive picture, maybe like this:?

"md5sums of parquet files containing identical information often differ. Reasons include various factors like the library that wrote the parquet file and the version of this library, the chosen compression codec and metadata written by the library"

I have so far not added the first point "the binary data always stored deterministically ". I though parquet will not do much more than a (deterministic) run length encoding, and the non-deterninism comes from the compression. But I am not an export on that at all.

hagenw · 2024-06-26T07:07:01Z

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

I still do not know how to give approval after I had not done so in the first place. I promise I will figure this out :-)

That's indeed a hidden feature in Github. I recently figured out how to do it.

Let's assume you did your first review by just adding comments to certain parts of the code, and you didn't selected "Approve".
Now the discussion is finished, and you would like to give your final approval.
You need to switch to the "Files changed" view, then you can click the "Review" button again and start another review, e.g. by just selecting "Approve" and submitting the review:

hagenw · 2024-06-26T07:22:43Z

Any suggestion, what and how I should add this to the docstring?

My understanding is that you are asking for the reasoning why parquet files can differ. Probably using a kind of vagueness would help? One could talk of "reasons include" to keep it as something that is not necessarily comprehensive. And one could drop "likely" in order to avoid that it is hard to get such a comprehensive picture, maybe like this:?

"md5sums of parquet files containing identical information often differ. Reasons include various factors like the library that wrote the parquet file and the version of this library, the chosen compression codec and metadata written by the library"

I have so far not added the first point "the binary data always stored deterministically ". I though parquet will not do much more than a (deterministic) run length encoding, and the non-deterninism comes from the compression. But I am not an export on that at all.

Thanks, I shortened your suggestion slightly and included it as:

ChristianGeng · 2024-06-26T09:20:03Z

I think the description is nice now. As I had prematurely given approval this is not necessary a second time. Or is it?

hagenw · 2024-06-26T09:33:07Z

Thanks, is fine, I will merge now.

hagenw added 3 commits June 25, 2024 13:39

DOC: Table.save() store hash for parquet files

e341105

Provide code example

f8a9960

Update example

2130795

hagenw requested a review from ChristianGeng June 25, 2024 11:58

ChristianGeng approved these changes Jun 25, 2024

View reviewed changes

Mention that audb uses the hash

11ad353

Discuss reasons why md5 sum differs

6717a86

hagenw merged commit 384d99c into main Jun 26, 2024
10 checks passed

hagenw deleted the doc-parquet-hash-metadata branch June 26, 2024 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: Table.save() store hash for parquet files #446

DOC: Table.save() store hash for parquet files #446

hagenw commented Jun 25, 2024 •

edited

Loading

ChristianGeng left a comment

hagenw commented Jun 25, 2024

ChristianGeng commented Jun 25, 2024

hagenw commented Jun 26, 2024

hagenw commented Jun 26, 2024

ChristianGeng commented Jun 26, 2024

hagenw commented Jun 26, 2024

DOC: Table.save() store hash for parquet files #446

DOC: Table.save() store hash for parquet files #446

Conversation

hagenw commented Jun 25, 2024 • edited Loading

ChristianGeng left a comment

Choose a reason for hiding this comment

hagenw commented Jun 25, 2024

ChristianGeng commented Jun 25, 2024

hagenw commented Jun 26, 2024

hagenw commented Jun 26, 2024

ChristianGeng commented Jun 26, 2024

hagenw commented Jun 26, 2024

hagenw commented Jun 25, 2024 •

edited

Loading