Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error descriptions if CSV is invalid #215

Closed
wants to merge 2 commits into from
Closed

Conversation

kspurgin
Copy link
Contributor

@kspurgin kspurgin commented Aug 2, 2024

Improve error message for invalid character encoding in CSV
0e1b7e4

  • As long as CSV is getting invalid_encoding error(s), only the new
    message will be shown. This is because encoding errors sometimes cause
    structural errors that cascade through the file, causing hundreds of
    error messages about structure in which the message about encoding
    gets lost. Often if the encoding is fixed, the structural errors
    disappear. This makes it obvious/clear that fixing the invalid
    character encoding is the required.
  • Shows the first occurrence of each invalid character in context, so
    users can more easily find/fix the issues

Show errors for Csvlint-valid files CSV library can't parse
6b3586f

Resolves #213
Currently, you are allowed to proceed to run the pre-processing job
with such a file, but the job fails when processing hits a row that
causes a CSV::MalformedCSVError. Unfortunately, this often occurs at
the very end of the file.

This commit adds a separate csv_parse_validator_for(batch) method
that is only called if Csvlint reports the file is valid.

CSV-Parsing the whole file before creating the batch may seem like
overkill, but (a) I can't come up with a better way to catch all
possible issues; and (b) it seems better than using system
resources/user time to pre-process a whole batch that is only going to
fail at the end.

Thinking forward: If we are eventually moving to using the database to
store information about each row and its state/status, then this step
can be reworked to serve a dual purpose: (1) prepare each row to be
added to database if we are able to prepare all rows successfully;
and (2) if not able to prepare all rows successfully, show error
messages and destroy batch.

@kspurgin kspurgin marked this pull request as draft August 2, 2024 16:53
- As long as CSV is getting invalid_encoding error(s), only the new
message will be shown. This is because encoding errors sometimes cause
structural errors that cascade through the file, causing hundreds of
error messages about structure in which the message about encoding
gets lost. Often if the encoding is fixed, the structural errors
disappear. This makes it obvious/clear that fixing the invalid
character encoding is the required.
- Shows the first occurrence of each invalid character in context, so
users can more easily find/fix the issues

Improve error message for invalid character encoding in CSV

- As long as CSV is getting invalid_encoding error(s), only the new
message will be shown. This is because encoding errors sometimes cause
structural errors that cascade through the file, causing hundreds of
error messages about structure in which the message about encoding
gets lost. Often if the encoding is fixed, the structural errors
disappear. This makes it obvious/clear that fixing the invalid
character encoding is the required.
- Shows the first occurrence of each invalid character in context, so
users can more easily find/fix the issues
@kspurgin kspurgin force-pushed the DRYD-1488 branch 2 times, most recently from 7ad66fe to 4ad4944 Compare August 2, 2024 21:54
Currently, you are allowed to proceed to run the pre-processing job
with such a file, but the job fails when processing hits a row that
causes a CSV::MalformedCSVError. Unfortunately, this often occurs at
the very end of the file.

This commit adds a separate `csv_parse_validator_for(batch)` method
that is only called if Csvlint reports the file is valid.

CSV-Parsing the whole file before creating the batch may seem like
overkill, but (a) I can't come up with a better way to catch all
possible issues; and (b) it seems better than using system
resources/user time to pre-process a whole batch that is only going to
fail at the end.

Thinking forward: If we are eventually moving to using the database to
store information about each row and its state/status, then this step
can be reworked to serve a dual purpose: (1) prepare each row to be
added to database if we are able to prepare all rows successfully;
and (2) if not able to prepare all rows successfully, show error
messages and destroy batch.

Show errors for Csvlint-valid files CSV library can't parse

Currently, you are allowed to proceed to run the pre-processing job
with such a file, but the job fails when processing hits a row that
causes a CSV::MalformedCSVError. Unfortunately, this often occurs at
the very end of the file.

This commit adds a separate `csv_parse_validator_for(batch)` method
that is only called if Csvlint reports the file is valid.

CSV-Parsing the whole file before creating the batch may seem like
overkill, but (a) I can't come up with a better way to catch all
possible issues; and (b) it seems better than using system
resources/user time to pre-process a whole batch that is only going to
fail at the end.

Thinking forward: If we are eventually moving to using the database to
store information about each row and its state/status, then this step
can be reworked to serve a dual purpose: (1) prepare each row to be
added to database if we are able to prepare all rows successfully;
and (2) if not able to prepare all rows successfully, show error
messages and destroy batch.
@kspurgin kspurgin marked this pull request as ready for review August 2, 2024 21:58
@kspurgin kspurgin requested review from mark-cooper and removed request for mark-cooper August 2, 2024 21:58
@kspurgin kspurgin marked this pull request as draft August 2, 2024 22:20
@kspurgin kspurgin marked this pull request as ready for review August 9, 2024 01:36
@kspurgin kspurgin closed this Aug 9, 2024
@kspurgin
Copy link
Contributor Author

kspurgin commented Aug 9, 2024

Splitting this what this was originally going to cover into two PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preprocessing step fails at 99% when extra blank line on end of CSV
1 participant