Load csv tables with pandas if pyarrow fails #450

hagenw · 2024-07-11T11:40:48Z

Closes #449

Unfortunately, loading csv files with pyarrow.csv.read_csv() as introduced in #419 is not as tolerant to malformed csv files as pandas.read_csv(). I have identified so far three cases in which loading of a csv file might fail (two of them are listed in #449):

Loading a csv file can fail, if the csv file contains more columns, as mentioned in the header of a database
Loading a csv file can also fail, if it is very long and contains a lot of special characters, like ", "", ,. I did not added a test for it, because it turns out that the syntax of the csv file is correct, and it works when splitting the file into smaller ones.
Loading of a csv file can fail, if it contain some offsets in there date values, e.g. +00:00, which is the case for some of our older datasets.

As pyarrow.csv.read_csv() cannot be easily extended to handle those cases, I use now a try-except statement, that falls back to loading the file with pandas.read_csv(). This is very unfortunate as it means when implementing a new feature (e.g. streaming) it needs to be implemented for both cases. But I don't know a better solution at the moment.

In principle, we could solve 1. and 3. by updating the databases, but you will still no longer be able to load old versions then, which is not acceptable. How we could solve 2. otherwise, I don't know.

codecov · 2024-07-11T13:46:39Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.0%. Comparing base (e3fd511) to head (6f53c6b).

Additional details and impacted files

Files	Coverage Δ
audformat/core/table.py	`100.0% <100.0%> (ø)`

tests/test_table.py

hagenw marked this pull request as draft July 11, 2024 11:40

hagenw added 3 commits July 11, 2024 16:12

Add failing test

7d4c39f

Fix test

305953b

Fix tests

62136cc

hagenw force-pushed the fix-csv-loading branch from ab92fb1 to 62136cc Compare July 11, 2024 14:12

hagenw marked this pull request as ready for review July 11, 2024 14:13

hagenw requested a review from ChristianGeng July 11, 2024 14:17

ChristianGeng reviewed Jul 12, 2024

View reviewed changes

tests/test_table.py Outdated Show resolved Hide resolved

hagenw added 2 commits July 12, 2024 14:48

Improve comments

e5a1d82

Specify test in class

6f53c6b

hagenw merged commit 60cd1ed into main Jul 12, 2024
10 checks passed

hagenw deleted the fix-csv-loading branch July 12, 2024 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load csv tables with pandas if pyarrow fails #450

Load csv tables with pandas if pyarrow fails #450

hagenw commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

Load csv tables with pandas if pyarrow fails #450

Load csv tables with pandas if pyarrow fails #450

Conversation

hagenw commented Jul 11, 2024 • edited Loading

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

hagenw commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading