Store tables as PARQUET files #419

hagenw · 2024-03-20T12:33:34Z

Closes #382
Closes #376

Several improvement/speed ups to the table handling by using pyarrow:

Support for storing tables as PARQUET files
Reading tables from CSV files with pyarrow
Require pandas >=2.1.0
Require pyarrow
Remove support for Python 3.8 (as we need pandas >=2.1.0)
Add a new CI job "ubuntu-latest, 3.9, minimum", that tests with all Python packages installed at its minimum version

I decided to stay with "csv" as the default setting for the storage_format argument in audformat.Database.save() and audformat.Table.save(). This way, we could make a new release of audformat and can test storing tables as PARQUET file with some databases, without exposing it to other users as well. In addition, audb needs to be updated to support publishing PARQUET tables.

Loading benchmark

We can benchmark the behavior with loading a dataset from a folder, that contains all tables with
audformat.Database.load(db_root, load_data=True). Results are given as average over 10 runs.

Dataset	# table rows	CSV (previous)	CSV (new)	PARQUET	PKL
mozillacommonvoice	1,923,269	8.62 ± 0.3 s	4.76 ± 0.6 s	3.58 ± 0.7 s	2.30 ± 0.8 s
voxceleb2	2,894,987	17.44 ± 0.3 s	6.05 ± 0.2 s	1.33 ± 0.1 s	1.01 ± 0.1 s
librispeech	36,422,346	1036.1 ± 30 s	284.95 ± 17.3 s	26.88 ± 0.7 s	3.01 ± 1.0 s
imda-nsc-read-speech-balanced	4,538,128	8.38 ± 0.2 s	2.86 ± 0.4 s	3.07 ± 0.3 s	2.04 ± 0.3 s
emodb	1,605	0.02 ± 0.00 s	0.05 ± 0.01 s	0.03 ± 0.01 s	0.03 ± 0.07 s

Benchmark code

import os
import time

import numpy as np

import audb
import audeer
import audformat


repetitions = 10

db_root = audeer.path("./db")
audeer.rmdir(db_root)
cache_root = audeer.path("./cache")
audeer.rmdir(cache_root)
audeer.mkdir(cache_root)

datasets = [
    ("mozillacommonvoice", "10.3.0"),
    ("voxceleb2", "2.6.0"),
    ("librispeech", "3.1.0"),
    ("imda-nsc-read-speech-balanced", "2.6.0"),
    ("emodb", "1.4.1"),
]
for name, version in datasets:
    audeer.rmdir(db_root)
    audb.load_to(
        db_root,
        name,
        version=version,
        only_metadata=True,
        cache_root=cache_root,
        num_workers=1,
    )

    # PKL files
    times = []
    for _ in range(repetitions):
        t0 = time.time()
        db = audformat.Database.load(db_root, load_data=True)
        t = time.time() - t0
        times.append(t)
    print(
        f"Execution time with PKL tables {name}: "
        f"{np.mean(times):.3f}±{np.std(times):.3f}s"
    )

    # CSV files
    pkl_files = audeer.list_file_names(db_root, filetype="pkl")
    for pkl_file in pkl_files:
        os.remove(pkl_file)
    times = []
    for _ in range(repetitions):
        t0 = time.time()
        db = audformat.Database.load(db_root, load_data=True)
        t = time.time() - t0
        times.append(t)
    print(
        f"Execution time with CSV tables {name}: "
        f"{np.mean(times):.3f}±{np.std(times):.3f}s"
    )

    # PARQUET files
    if audformat.__version__ > "1.1.4":
        # Replace CSV with PARQUET files
        # and repeat benchmark
        for table_id in list(db):
            path = audeer.path(db_root, f"db.{table_id}")
            db[table_id].save(path, storage_format="parquet")
            os.remove(f"{path}.csv")
        times = []
        for _ in range(repetitions):
            t0 = time.time()
            db = audformat.Database.load(db_root, load_data=True)
            t = time.time() - t0
            times.append(t)
        print(
            f"Execution time with PARQUET tables {name}: "
            f"{np.mean(times):.3f}±{np.std(times):.3f}s"
        )

The benchmark highlights two important results:

The new implementation speeds up loading of CSV files by a factor of ~4.
Reading from PARQUET is at least as fast as reading from CSV, and for some datasets much faster.

Memory benchmark

I investigated memory consumption using heaptrack, when loading the phonetic-transcription.train-clean-360 table from the librispeech dataset from our internal server. Stored as CSV file the table has a size of 1,3 G, stored as PARQUET file 49 M.

branch	peak heap memory	execution time
CSV (main)	1.36 G	197.9 s
CSV (pull request)	1.33 G	43.9 s
PARQUET (pull request)	1.11 G	3.7 s

Benchmark code

csv-table-loading.py

import time

import audb
import audformat


# Measure execution time when loading a CSV table.
# Assumes the file `db.phonetic-transcription.train-clean-360.csv`
# or `db.phonetic-transcription.train-clean-360.parquet`
# exists in the same folder
# as this script.
# Note, that the CSV file is only loaded
# when the PARQUET file is not present.

db = audb.info.header("librispeech")
table_id = "phonetic-transcription.train-clean-360"
t0 = time.time()
db[table_id].load(f"db.{table_id}")
t = time.time() - t0
print(f"Execution time: {t:.1f} s")

The memory consumption is then profiled with:

$ heaptrack python csv-table-loading.py

The execution time was measured without heaptrack:

$ python csv-table-loading.py

Why and when is reading from a CSV file slow?

By far the slowest part when reading a CSV file with pyarrow is the conversion to pandas.Timedelta values for columns, that specify a duration.
E.g. when reading the CSV from the memory benchmark, the reading with pyarrow and conversion to pandas.DataFrame takes 3 s, whereas the conversion of the start and end column to pandas.Timedelta takes roughly 40 s.
There is a dedicated dtype with pyarrow.duration, but it does not have yet reading support for CSV files. When trying so, you get:

pyarrow.lib.ArrowNotImplementedError: CSV conversion to duration[ns] is not supported

When storing a table as a PARQUET file we use duration[ns] for time values, and converting those back to pandas.Timedelta seems to be much faster. Reading the PARQUET file needs 0.3 s, converting to the dataframe then takes the remaining 3.4 s.

The conversion can very likely be speed up when switching to use pyarrow based dtypes in the dataframe as we do for audb.Dependencies._df, but at the moment this is not fully supported in pandas, e.g. timedelta values are not implemented yet (pandas-dev/pandas#52284).

Hashing of PARQUET files

As we have experienced already at audeering/audb#372, PARQUET files are not stored in a reproducible fashion and might return different MD5 hash sums, even though they store the same dataframe.
To overcome this problem, I calculate now a hash based on the content of the dataframe and store the resulting value inside the metadata of schema of the PARQUET file. Which means in audb we can access it by just loading the schema of a PARQUET file. The corresponding code to access the hash is:

import pyarrow.parquet as parquet

hash = parquet.read_schema(path).metadata[b"hash"].decode()

This approach is faster than calculating the MD5 sum with audeer.md5().
Execution time benchmarked as average over 100 repetitions:

method	execution time
read hash	0.0001 ± 0.0000 s
calculate md5	0.0716 ± 0.0007 s

Benchmark code

import time

import numpy as np
import pyarrow.parquet as parquet

import audeer

# Compare execution times
# for calculating the MD5 sum
# vs reading the hash from the PARQUET file.
# Assumes the file `db.phonetic-transcription.train-clean-360.parquet`
# exists in the same folder
# as this script.
table_id = "phonetic-transcription.train-clean-360"
repetitions = 100 

path = f"db.{table_id}.parquet"
times = []
for _ in range(repetitions):
    t0 = time.time()
    audeer.md5(path)
    t = time.time() - t0
    times.append(t)
print(f"Execution time MD5: {np.mean(times):.4f} ± {np.std(times):.4f} s")

times = []
for _ in range(repetitions):
    t0 = time.time()
    parquet.read_schema(path).metadata[b"hash"].decode()
    t = time.time() - t0
    times.append(t)
print(f"Execution time hash: {np.mean(times):.4f} ± {np.std(times):.4f} s")

Writing benchmark

Comparison of saving the phonetic-transcription.train-clean-360 table from librispeech in different formats.
Note, saving as a PARQUET file includes calculating of the hash of the underlying dataframe.

format	execution time
CSV	22.2 s
PARQUET	5.1 s
PKL	1.0 s

Benchmark code

import os
import time

import audb
import audformat


# Measure execution time when saving table.
# Assumes the file `db.phonetic-transcription.train-clean-360.csv`
# or `db.phonetic-transcription.train-clean-360.parquet`
# exists in the same folder
# as this script.
db = audb.info.header("librispeech")
table_id = "phonetic-transcription.train-clean-360"
db[table_id].load(f"db.{table_id}")

t0 = time.time()
db[table_id].save("table", storage_format="csv")
t = time.time() - t0
os.remove("table.csv")
print(f"Execution time CSV: {t:.1f} s")

t0 = time.time()
db[table_id].save("table", storage_format="parquet")
t = time.time() - t0
os.remove("table.parquet")
print(f"Execution time PARQUET: {t:.1f} s")

t0 = time.time()
db[table_id].save("table", storage_format="pkl")
t = time.time() - t0
os.remove("table.pkl")
print(f"Execution time PKL: {t:.1f} s")

codecov · 2024-03-21T08:30:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (eddf224) to head (8e85168).

❗ Current head 8e85168 differs from pull request most recent head 6a9e3d1

Please upload reports for the commit 6a9e3d1 to get more accurate results.

Additional details and impacted files

Files	Coverage Δ
audformat/core/database.py	`100.0% <ø> (ø)`
audformat/core/define.py	`100.0% <100.0%> (ø)`
audformat/core/table.py	`100.0% <100.0%> (ø)`
audformat/core/utils.py	`100.0% <100.0%> (ø)`

audformat/core/table.py

Co-authored-by: ChristianGeng <[email protected]>

This reverts commit 3f21e3c.

tests/test_table.py

ChristianGeng

Review

Store tables as PARQUET files #419 #419

minimum dependencies

downgrade to minimum dependencies is a nice feature for having the earliest pandas version supported.

Hashing

Hashing depends on the pandas utility in pandas/pandas/core/util/hashing.py - via audformat.utils.hash.

The pandas function has a few input arguments, most of its defaults are not worrying. I would e.g. not assume that utf-encoding would change anytime in the near future. I also would assume that the keeping the index to "True" makes sense as the function only targes series, frames and audformat-type indices which is stricter than only hashing values.

However what I find a bit worrying is that the line defining the _default_hash_key has been changed only two months ago:

git blame pandas/core/util/hashing.pyp  | grep "_default_hash_key ="

The hash key's value seems kind of generic but I wonder whether it only had been introduced recently, or whether it is changed sometimes?

So far it has been only changed in a single commit:

git log -G "_default_hash_key =" --format=%H -- pandas/core/util/hashing.py

b1525c4a3788d161653b04a71a84e44847bedc1b

This is good. But will it be safely backwards compatible in the future? Or does it have to be future proof at all? I have searched the code of the pandas library. From what I have seen, the hashing function is never used in order to verify the identity of an object or artifact - it seems only to be used for algorithmic (speedup) purposes. So the pandas functioning in a future pandas release is unlikely to be hampered when the underlying hashing would be changed - our use as identifying objects would possibly.

Status of `_pyarrow_convert_dtypes`

Lots of the new functionality is covered by retyping table._pyarrow_convert_dtypes. Currently it only is used within the table module. Also it is only tested within the context of tests targeting misc tables and tables, so this is a kind of integration test approach.

I am not saying that it needs to be unit tested, and I don't know whether it would ever be used outside of the context of the table module. In case that one would want to unit-test it at some later stage, or even only stylistically, one would probably look whether a @staticmethod decorator would make sense. However the method depends on _levels_and_dtypes, columns and db.schemes that describe the table in terms of indices, columns and their respective types - so one would have to pass around these, which would make it cumbersome. In other words, the cohesion of _pyarrow_convert_dtypes is quite high. Is there a way to decrease this? All of data passed around contain information about indices/columns and how these relate to schemes and dtypes.

I would approve for now as I do not know whether any of these comments will have any consequence codewise.

tests/test_table.py

hagenw · 2024-06-19T13:26:01Z

The hash key's value seems kind of generic but I wonder whether it only had been introduced recently, or whether it is changed sometimes?

This is indeed a good point. I created #436 to propose a better hash solution.

* Use numpy representation for hashing * Enable tests and require pandas>=1.4.1 * Use numpy<2.0 in minimum test * Skip doctests in minimum * Require pandas>=2.1.0 * Require numpy<=2.0.0 in minimum test * Remove print statements * Fix numpy<2.0.0 for minimum test * Remove max_rows argument * Simplify code

hagenw · 2024-06-19T15:11:02Z

Regarding the complexity of audformat.Table._pyarrow_convert_dtypes(). You have correctly spotted, that the conversion can only be done if the table is assigned to a database, and hence has information on the used schemes and columns.
Originally, I thought I might be able to simplify it by adding index_names, index_dtypes, column_names, column_dtypes as arguments, but the problem is that it also needs to inspect self.db.schemes.

So far, I have not found a better way to restructure the code, but I agree that we should make audformat.Table._pyarrow_convert_dtypes() a class method and add unit tests for it. I created #437 to track this.

hagenw marked this pull request as draft March 20, 2024 12:33

hagenw force-pushed the pyarrow branch 2 times, most recently from 2c0f12f to 625d4c3 Compare May 30, 2024 14:36

hagenw changed the title ~~Use pyarrow to read CSV files~~ Store tables as PARQUET files Jun 11, 2024

hagenw force-pushed the pyarrow branch from cf6124a to 9ec9993 Compare June 12, 2024 12:28

hagenw added 24 commits June 18, 2024 10:47

Ensure correct boolean dtype in misc table index

930d242

Remove unneeded code

8d38ba9

Use pyarrow to read CSV files

06f3a34

Start debugging

e5045d0

Continue debugging

463c15f

Fix tests

e0b831e

Remove unneeded code

f48a00b

Improve code

b548774

Fix test for older pandas versions

abb07d9

Exclude benchmark folder from tests

48c9da5

Test other implementation

e556c90

Remove support for Python 3.8

b07f1ac

Store tables as PARQUET

b1e0b69

Cleanup code + Table.levels

68c764c

Use dict for CSV dtype mappings

fdc96bd

Rename helper function

e865813

Simplify code

eee02d3

Add helper function for CSV schema

cb4a42f

Fix typo in docstring

c89bc33

Remove levels attribute

e485d57

Merge stash

2a359f1

Remove levels from doctest output

01678d9

Convert method to property

92306d8

Add comment

2b727b9

hagenw added 7 commits June 18, 2024 13:17

Ensure correct order in time when storing tables

b0ee769

Simplify comment

1e167c1

Add docstring to _load_pickle()

8ad8d74

Fix _save_parquet() docstring

7b3a558

Improve comment in _dataframe_hash()

d414fe7

Document arguments of test_table_update...

a90eaf4

Relax test for table saving order

2749ef9

hagenw marked this pull request as ready for review June 18, 2024 11:40

hagenw requested a review from ChristianGeng June 18, 2024 11:41

ChristianGeng reviewed Jun 19, 2024

View reviewed changes

audformat/core/table.py Show resolved Hide resolved

hagenw and others added 2 commits June 19, 2024 08:52

Update audformat/core/table.py

3f21e3c

Co-authored-by: ChristianGeng <[email protected]>

Revert "Update audformat/core/table.py"

2912f76

This reverts commit 3f21e3c.

ChristianGeng reviewed Jun 19, 2024

View reviewed changes

tests/test_table.py Show resolved Hide resolved

ChristianGeng approved these changes Jun 19, 2024

View reviewed changes

ChristianGeng reviewed Jun 19, 2024

View reviewed changes

tests/test_table.py Outdated Show resolved Hide resolved

hagenw mentioned this pull request Jun 19, 2024

Use numpy representation for hashing #436

Merged

hagenw mentioned this pull request Jun 19, 2024

Transform Table._pyarrow_convert_dtypes into a class method and add unit tests #437

Open

hagenw added 2 commits June 19, 2024 17:12

Use test class

8e85168

CI: remove pyarrow from branch to start test

6a9e3d1

hagenw merged commit f6c475e into main Jun 19, 2024
8 checks passed

hagenw deleted the pyarrow branch June 19, 2024 15:18

This was referenced Jun 20, 2024

Add support for PARQUET file tables audeering/audb#433

Closed

DOC: add PARQUET tables to data format definition #438

Merged

Support handling of parquet tables audeering/audb#434

Merged

Provide table dataframe hashing in public API? #447

Closed

This was referenced Jul 11, 2024

Load csv tables with pandas if pyarrow fails #450

Merged

Add strict argument to audformat.utils.hash() #454

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store tables as PARQUET files #419

Store tables as PARQUET files #419

hagenw commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

ChristianGeng left a comment

hagenw commented Jun 19, 2024

hagenw commented Jun 19, 2024 •

edited

Loading

Store tables as PARQUET files #419

Store tables as PARQUET files #419

Conversation

hagenw commented Mar 20, 2024 • edited Loading

Loading benchmark

Memory benchmark

Why and when is reading from a CSV file slow?

Hashing of PARQUET files

Writing benchmark

codecov bot commented Mar 21, 2024 • edited Loading

Codecov Report

ChristianGeng left a comment

Choose a reason for hiding this comment

Review

minimum dependencies

Hashing

Status of _pyarrow_convert_dtypes

hagenw commented Jun 19, 2024

hagenw commented Jun 19, 2024 • edited Loading

hagenw commented Mar 20, 2024 •

edited

Loading

codecov bot commented Mar 21, 2024 •

edited

Loading

Status of `_pyarrow_convert_dtypes`

hagenw commented Jun 19, 2024 •

edited

Loading