Skip to content

Commit

Permalink
DOC: Table.save() store hash for parquet files (#446)
Browse files Browse the repository at this point in the history
* DOC: Table.save() store hash for parquet files

* Provide code example

* Update example

* Mention that audb uses the hash

* Discuss reasons why md5 sum differs
  • Loading branch information
hagenw authored Jun 26, 2024
1 parent c132807 commit 384d99c
Showing 1 changed file with 22 additions and 0 deletions.
22 changes: 22 additions & 0 deletions audformat/core/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -586,6 +586,28 @@ def save(
Existing files will be overwritten.
When using ``"parquet"`` as ``storage_format``
a hash,
based on the content of the table,
is stored under the key ``b"hash"``
in the metadata of the schema of the parquet file.
This provides a deterministic hash for the file,
as md5 sums of parquet files,
containing identical information,
often differ.
Reasons include factors like the library
that wrote the parquet file,
the chosen compression codec
and metadata written by the library.
The hash can be accessed with ``pyarrow`` by::
pyarrow.parquet.read_schema(f"{path}.parquet").metadata[b"hash"].decode()
The hash is used by :mod:`audb`
when publishing a database
to track changes of database files.
Args:
path: file path without extension
storage_format: storage format of table.
Expand Down

0 comments on commit 384d99c

Please sign in to comment.