Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Table.save() store hash for parquet files #446

Merged
merged 5 commits into from
Jun 26, 2024

Conversation

hagenw
Copy link
Member

@hagenw hagenw commented Jun 25, 2024

Document, that we store a hash value in the metadata of the schema of the PARQUET file, when saving to the file.

image

Copy link
Member

@ChristianGeng ChristianGeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this very relevant. Would maybe a little more detail come in handy?

  • when it is triggered (on publication). You write implicitly that it is used when "using parquet as the storage format". I think this could be a little bit more explicit.
  • what is the purpose of having a hash (to determine table identity)
  • why the md5sums can differ in parquet (mainly compression I believe)

I am approving this anyway.

@hagenw
Copy link
Member Author

hagenw commented Jun 25, 2024

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

when it is triggered (on publication). You write implicitly that it is used when "using parquet as the storage format". I think this could be a little bit more explicit.

Publication is not part of audformat, but is handled by audb (see next comment). Inside audformat the calculation of the hash is only triggered when calling audformat.Table.save() or audformat.Database.save(), which then internally calls audformat.Table.save().

what is the purpose of having a hash (to determine table identity)

I added the following statement to the docs:

image

why the md5sums can differ in parquet (mainly compression I believe)

So far I was not able to find documentation on that. There seems to be several factors that might influence the md5 sum:

  • Is the binary data always stored deterministically to the disc, or is there some optimization involved that includes random processes
  • The library used to write the PARQUET file
  • The involved compression
  • Metadata entries of the file that store the used library version and/or pandas version when converted from pandas

At least the last point could be avoided, but I found it nice that storing the hash inside the file also provides the advantage that the hash needs to be calculated only ones. So I didn't bother trying to fix it. I'm also pretty sure that otherwise we will have the problem that pyarrow might decide at one point to update how they store PARQUET files, and then we would need to stick with an old version of pyarrow.

Any suggestion, what and how I should add this to the docstring?

@ChristianGeng
Copy link
Member

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

I still do not know how to give approval after I had not done so in the first place. I promise I will figure this out :-)

Any suggestion, what and how I should add this to the docstring?

My understanding is that you are asking for the reasoning why parquet files can differ. Probably using a kind of vagueness would help? One could talk of "reasons include" to keep it as something that is not necessarily comprehensive. And one could drop "likely" in order to avoid that it is hard to get such a comprehensive picture, maybe like this:?

"md5sums of parquet files containing identical information often differ. Reasons include various factors like the library that wrote the parquet file and the version of this library, the chosen compression codec and metadata written by the library"

I have so far not added the first point "the binary data always stored deterministically ". I though parquet will not do much more than a (deterministic) run length encoding, and the non-deterninism comes from the compression. But I am not an export on that at all.

@hagenw
Copy link
Member Author

hagenw commented Jun 26, 2024

Your discussion points are valid, maybe you should not directly approve, but first let us discuss them ;)

I still do not know how to give approval after I had not done so in the first place. I promise I will figure this out :-)

That's indeed a hidden feature in Github. I recently figured out how to do it.

Let's assume you did your first review by just adding comments to certain parts of the code, and you didn't selected "Approve".
Now the discussion is finished, and you would like to give your final approval.
You need to switch to the "Files changed" view, then you can click the "Review" button again and start another review, e.g. by just selecting "Approve" and submitting the review:

image

@hagenw
Copy link
Member Author

hagenw commented Jun 26, 2024

Any suggestion, what and how I should add this to the docstring?

My understanding is that you are asking for the reasoning why parquet files can differ. Probably using a kind of vagueness would help? One could talk of "reasons include" to keep it as something that is not necessarily comprehensive. And one could drop "likely" in order to avoid that it is hard to get such a comprehensive picture, maybe like this:?

"md5sums of parquet files containing identical information often differ. Reasons include various factors like the library that wrote the parquet file and the version of this library, the chosen compression codec and metadata written by the library"

I have so far not added the first point "the binary data always stored deterministically ". I though parquet will not do much more than a (deterministic) run length encoding, and the non-deterninism comes from the compression. But I am not an export on that at all.

Thanks, I shortened your suggestion slightly and included it as:

image

@ChristianGeng
Copy link
Member

I think the description is nice now. As I had prematurely given approval this is not necessary a second time. Or is it?

@hagenw
Copy link
Member Author

hagenw commented Jun 26, 2024

Thanks, is fine, I will merge now.

@hagenw hagenw merged commit 384d99c into main Jun 26, 2024
10 checks passed
@hagenw hagenw deleted the doc-parquet-hash-metadata branch June 26, 2024 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants