Load csv tables with pandas if pyarrow fails #450
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #449
Unfortunately, loading csv files with
pyarrow.csv.read_csv()
as introduced in #419 is not as tolerant to malformed csv files aspandas.read_csv()
. I have identified so far three cases in which loading of a csv file might fail (two of them are listed in #449):"
,""
,,
. I did not added a test for it, because it turns out that the syntax of the csv file is correct, and it works when splitting the file into smaller ones.date
values, e.g.+00:00
, which is the case for some of our older datasets.As
pyarrow.csv.read_csv()
cannot be easily extended to handle those cases, I use now atry
-except
statement, that falls back to loading the file withpandas.read_csv()
. This is very unfortunate as it means when implementing a new feature (e.g. streaming) it needs to be implemented for both cases. But I don't know a better solution at the moment.In principle, we could solve 1. and 3. by updating the databases, but you will still no longer be able to load old versions then, which is not acceptable. How we could solve 2. otherwise, I don't know.