Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store tables as PARQUET files #419

Merged
merged 67 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
930d242
Ensure correct boolean dtype in misc table index
hagenw May 30, 2024
8d38ba9
Remove unneeded code
hagenw May 30, 2024
06f3a34
Use pyarrow to read CSV files
hagenw Mar 20, 2024
e5045d0
Start debugging
hagenw May 30, 2024
463c15f
Continue debugging
hagenw May 30, 2024
e0b831e
Fix tests
hagenw May 30, 2024
f48a00b
Remove unneeded code
hagenw May 31, 2024
b548774
Improve code
hagenw May 31, 2024
abb07d9
Fix test for older pandas versions
hagenw May 31, 2024
48c9da5
Exclude benchmark folder from tests
hagenw May 31, 2024
e556c90
Test other implementation
hagenw May 31, 2024
b07f1ac
Remove support for Python 3.8
hagenw May 31, 2024
b1e0b69
Store tables as PARQUET
hagenw Jun 11, 2024
68c764c
Cleanup code + Table.levels
hagenw Jun 11, 2024
fdc96bd
Use dict for CSV dtype mappings
hagenw Jun 11, 2024
e865813
Rename helper function
hagenw Jun 11, 2024
eee02d3
Simplify code
hagenw Jun 11, 2024
cb4a42f
Add helper function for CSV schema
hagenw Jun 11, 2024
c89bc33
Fix typo in docstring
hagenw Jun 12, 2024
e485d57
Remove levels attribute
hagenw Jun 12, 2024
2a359f1
Merge stash
hagenw Jun 12, 2024
01678d9
Remove levels from doctest output
hagenw Jun 12, 2024
92306d8
Convert method to property
hagenw Jun 12, 2024
2b727b9
Add comment
hagenw Jun 12, 2024
ec50279
Simplify code
hagenw Jun 11, 2024
f6820ea
Simplify code
hagenw Jun 11, 2024
fe50e53
Add test for md5sum of parquet file
hagenw Jun 12, 2024
f9d564e
Switch back to snappy compression
hagenw Jun 12, 2024
c53d8cc
Fix linter
hagenw Jun 12, 2024
0636a30
Store hash inside parquet file
hagenw Jun 12, 2024
77eb826
Fix code coverage
hagenw Jun 12, 2024
4a54cb0
Stay with CSV as default table format
hagenw Jun 12, 2024
13a7769
Test pyarrow==15.0.2
hagenw Jun 13, 2024
6b07a24
Test pyarrow==14.0.2
hagenw Jun 13, 2024
563a892
Test pyarrow==13.0
hagenw Jun 13, 2024
4b451ef
Test pyarrow==12.0
hagenw Jun 13, 2024
63188ae
Test pyarrow==11.0
hagenw Jun 13, 2024
e2eee7f
Test pyarrow==10.0
hagenw Jun 13, 2024
bf8dd59
Test pyarrow==10.0.1
hagenw Jun 13, 2024
83cac4f
Require pyarrow>=10.0.1
hagenw Jun 13, 2024
c78da84
Test pandas<2.1.0
hagenw Jun 13, 2024
263f970
Add explanations for requirements
hagenw Jun 13, 2024
d51d01d
Add test using minimum pip requirements
hagenw Jun 13, 2024
f889b75
Fix alphabetical order of requirements
hagenw Jun 13, 2024
96df9ac
Enhance test matrix definition
hagenw Jun 13, 2024
f37de7e
Debug failing test
hagenw Jun 13, 2024
17ea1d9
Test different hash method
hagenw Jun 13, 2024
495e095
Use different hashing approach
hagenw Jun 13, 2024
f374fe0
Require pandas>=2.2.0 and fix hashes
hagenw Jun 14, 2024
18e3ada
CI: re-enable all minimal requriements
hagenw Jun 14, 2024
bc0c68f
Hashing algorithm to respect row order
hagenw Jun 14, 2024
6c36e0a
Clean up tests
hagenw Jun 14, 2024
407aa91
Fix minimum install of audiofile
hagenw Jun 18, 2024
c9b5760
Fix docstring of Table.load()
hagenw Jun 18, 2024
589da4b
Fix docstring of Database.load()
hagenw Jun 18, 2024
b0ee769
Ensure correct order in time when storing tables
hagenw Jun 18, 2024
1e167c1
Simplify comment
hagenw Jun 18, 2024
8ad8d74
Add docstring to _load_pickle()
hagenw Jun 18, 2024
7b3a558
Fix _save_parquet() docstring
hagenw Jun 18, 2024
d414fe7
Improve comment in _dataframe_hash()
hagenw Jun 18, 2024
a90eaf4
Document arguments of test_table_update...
hagenw Jun 18, 2024
2749ef9
Relax test for table saving order
hagenw Jun 18, 2024
3f21e3c
Update audformat/core/table.py
hagenw Jun 19, 2024
2912f76
Revert "Update audformat/core/table.py"
hagenw Jun 19, 2024
c4c41ff
Use numpy representation for hashing (#436)
hagenw Jun 19, 2024
8e85168
Use test class
hagenw Jun 19, 2024
6a9e3d1
CI: remove pyarrow from branch to start test
hagenw Jun 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ jobs:
os: [ ubuntu-20.04, windows-latest, macOS-latest ]
python-version: [ '3.10' ]
include:
- os: ubuntu-latest
python-version: '3.8'
- os: ubuntu-latest
python-version: '3.9'
- os: ubuntu-latest
python-version: '3.11'
- os: ubuntu-latest
python-version: '3.9'
requirements: 'minimum'

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -50,6 +51,16 @@ jobs:
pip install -r requirements.txt
pip install -r tests/requirements.txt

- name: Downgrade to minimum dependencies
run: |
pip install "audeer==2.0.0"
pip install "audiofile==0.4.0"
pip install "numpy<2.0.0"
pip install "pandas==2.1.0"
pip install "pyarrow==10.0.1"
pip install "pyyaml==5.4.1"
if: matrix.requirements == 'minimum'

- name: Test with pytest
run: |
python -m pytest
Expand Down
6 changes: 3 additions & 3 deletions audformat/core/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -979,7 +979,7 @@ def save(
r"""Save database to disk.

Creates a header ``<root>/<name>.yaml``
and for every table a file ``<root>/<name>.<table-id>.[csv,pkl]``.
and for every table a file ``<root>/<name>.<table-id>.[csv,parquet,pkl]``.

Existing files will be overwritten.
If ``update_other_formats`` is provided,
Expand Down Expand Up @@ -1383,7 +1383,7 @@ def load(
r"""Load database from disk.

Expects a header ``<root>/<name>.yaml``
and for every table a file ``<root>/<name>.<table-id>.[csv|pkl]``
and for every table a file ``<root>/<name>.<table-id>.[csv|parquet|pkl]``
Media files should be located under ``root``.

Args:
Expand All @@ -1409,7 +1409,7 @@ def load(
Raises:
FileNotFoundError: if the database header file cannot be found
under ``root``
RuntimeError: if a CSV table file is newer
RuntimeError: if a CSV or PARQUET table file is newer
than the corresponding PKL file

"""
Expand Down
3 changes: 3 additions & 0 deletions audformat/core/define.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,9 @@ class TableStorageFormat(DefineBase):
CSV = "csv"
"""File extension for tables stored in CSV format."""

PARQUET = "parquet"
"""File extension for tables stored in PARQUET format."""

PICKLE = "pkl"
"""File extension for tables stored in PKL format."""

Expand Down
Loading