-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store tables as PARQUET files #419
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
2c0f12f
to
625d4c3
Compare
Co-authored-by: ChristianGeng <[email protected]>
This reverts commit 3f21e3c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review
Store tables as PARQUET files #419 #419
minimum dependencies
downgrade to minimum dependencies is a nice feature for having the earliest pandas version supported.
Hashing
Hashing depends on the pandas utility in pandas/pandas/core/util/hashing.py
- via audformat.utils.hash
.
The pandas function has a few input arguments, most of its defaults are not worrying. I would e.g. not assume that utf-encoding would change anytime in the near future. I also would assume that the keeping the index to "True" makes sense as the function only targes series, frames and audformat-type indices which is stricter than only hashing values.
However what I find a bit worrying is that the line defining the _default_hash_key
has been changed only two months ago:
git blame pandas/core/util/hashing.pyp | grep "_default_hash_key ="
The hash key's value seems kind of generic but I wonder whether it only had been introduced recently, or whether it is changed sometimes?
So far it has been only changed in a single commit:
git log -G "_default_hash_key =" --format=%H -- pandas/core/util/hashing.py
b1525c4a3788d161653b04a71a84e44847bedc1b
This is good. But will it be safely backwards compatible in the future? Or does it have to be future proof at all? I have searched the code of the pandas library. From what I have seen, the hashing function is never used in order to verify the identity of an object or artifact - it seems only to be used for algorithmic (speedup) purposes. So the pandas functioning in a future pandas release is unlikely to be hampered when the underlying hashing would be changed - our use as identifying objects would possibly.
Status of _pyarrow_convert_dtypes
Lots of the new functionality is covered by retyping table._pyarrow_convert_dtypes
. Currently it only is used within the table
module. Also it is only tested within the context of tests targeting misc tables and tables, so this is a kind of integration test approach.
I am not saying that it needs to be unit tested, and I don't know whether it would ever be used outside of the context of the table module. In case that one would want to unit-test it at some later stage, or even only stylistically, one would probably look whether a @staticmethod
decorator would make sense. However the method depends on _levels_and_dtypes
, columns
and db.schemes
that describe the table in terms of indices, columns and their respective types - so one would have to pass around these, which would make it cumbersome. In other words, the cohesion of _pyarrow_convert_dtypes
is quite high. Is there a way to decrease this? All of data passed around contain information about indices/columns and how these relate to schemes and dtypes.
I would approve for now as I do not know whether any of these comments will have any consequence codewise.
This is indeed a good point. I created #436 to propose a better hash solution. |
* Use numpy representation for hashing * Enable tests and require pandas>=1.4.1 * Use numpy<2.0 in minimum test * Skip doctests in minimum * Require pandas>=2.1.0 * Require numpy<=2.0.0 in minimum test * Remove print statements * Fix numpy<2.0.0 for minimum test * Remove max_rows argument * Simplify code
Regarding the complexity of So far, I have not found a better way to restructure the code, but I agree that we should make |
Closes #382
Closes #376
Several improvement/speed ups to the table handling by using
pyarrow
:pyarrow
pandas >=2.1.0
pyarrow
pandas >=2.1.0
)I decided to stay with
"csv"
as the default setting for thestorage_format
argument inaudformat.Database.save()
andaudformat.Table.save()
. This way, we could make a new release ofaudformat
and can test storing tables as PARQUET file with some databases, without exposing it to other users as well. In addition,audb
needs to be updated to support publishing PARQUET tables.Loading benchmark
We can benchmark the behavior with loading a dataset from a folder, that contains all tables with
audformat.Database.load(db_root, load_data=True)
. Results are given as average over 10 runs.Benchmark code
The benchmark highlights two important results:
Memory benchmark
I investigated memory consumption using heaptrack, when loading the
phonetic-transcription.train-clean-360
table from thelibrispeech
dataset from our internal server. Stored as CSV file the table has a size of 1,3 G, stored as PARQUET file 49 M.Benchmark code
csv-table-loading.py
The memory consumption is then profiled with:
The execution time was measured without
heaptrack
:Why and when is reading from a CSV file slow?
By far the slowest part when reading a CSV file with
pyarrow
is the conversion topandas.Timedelta
values for columns, that specify a duration.E.g. when reading the CSV from the memory benchmark, the reading with
pyarrow
and conversion topandas.DataFrame
takes 3 s, whereas the conversion of thestart
andend
column topandas.Timedelta
takes roughly 40 s.There is a dedicated dtype with
pyarrow.duration
, but it does not have yet reading support for CSV files. When trying so, you get:When storing a table as a PARQUET file we use
duration[ns]
for time values, and converting those back topandas.Timedelta
seems to be much faster. Reading the PARQUET file needs 0.3 s, converting to the dataframe then takes the remaining 3.4 s.The conversion can very likely be speed up when switching to use
pyarrow
based dtypes in the dataframe as we do foraudb.Dependencies._df
, but at the moment this is not fully supported inpandas
, e.g. timedelta values are not implemented yet (pandas-dev/pandas#52284).Hashing of PARQUET files
As we have experienced already at audeering/audb#372, PARQUET files are not stored in a reproducible fashion and might return different MD5 hash sums, even though they store the same dataframe.
To overcome this problem, I calculate now a hash based on the content of the dataframe and store the resulting value inside the metadata of schema of the PARQUET file. Which means in
audb
we can access it by just loading the schema of a PARQUET file. The corresponding code to access the hash is:This approach is faster than calculating the MD5 sum with
audeer.md5()
.Execution time benchmarked as average over 100 repetitions:
Benchmark code
Writing benchmark
Comparison of saving the
phonetic-transcription.train-clean-360
table fromlibrispeech
in different formats.Note, saving as a PARQUET file includes calculating of the hash of the underlying dataframe.
Benchmark code