Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line specific opts #2

Merged
merged 4 commits into from
Jul 30, 2024
Merged

Line specific opts #2

merged 4 commits into from
Jul 30, 2024

Conversation

kspurgin
Copy link
Collaborator

@kspurgin kspurgin commented Sep 30, 2023

My preferred approach to solving Data-Liberation-Front#198

This issue is caused by mixed EOL indicators used in the same file.

The usual pattern we see1, is CRLF (\r\n) at the end of actual CSV rows, but just LF (\n) used inside quoted text field values.

To validate a CSV, csvlint.rb currently does the following:

  • Validator.parse_line goes through the file line-by-line, looking for overall invalid encoding. Notably, this keeps track of quotes in order to join multiple lines where the line breaks occur inside quotes. The issues in question are usually not found in this process. The line it passes to validate_line is usually the whole, correctly joined line.
  • Validator.validate_line calls Validator.parse_contents on the line
  • Validator.parse_contents calls LineCSV.parse_line (LineCSV is a subclass of the standard CSV library) with the same @csv_options derived from the dialect passed to Csvlint::Validator. If no "lineTerminator" (mapped to csv_options[:row_sep]) value is not explicitly given in the dialect specification, it gets set by Csvlint to :auto.
  • While calling CSV.parse on the whole file with row_sep: :auto generally works due to the parser making use of the context of the full file, calling parse_line with row_sep: :auto blows up because the parser assumes the first EOL it hits is the real one or something.

So, to fix this:

  • Before we call LineCSV.parse_line, we check the csv_options[:row_sep] value.
  • If it is NOT :auto, we pass it through as csv_options_for_line.
  • Otherwise we check the actual end of the line we are about to pass through. If the full line ends with \r\n, \r, or \n, we merge that string into the @csv_options instance variable as the new, explicit :row_sep value. (So, on subsequent lines, we pass the no-longer-auto-row_sep @csv_options through -- better for performance, AND we'll likely hear about it if any subsequent lines don't end with the :row_sep value.)

Footnotes

  1. Because it's how CollectionSpace usually exports data containing line breaks entered by Windows users, who export data for data round-tripping via the CSV Importer, which uses csvlint.rb for CSV validation, and it currently can't handle this. RFC 4180 prescribes CRLF EOLs, but entering data inside a field in the CollectionSpace UI appears to enter just LF, and that's what comes out in the quoted text.

@kspurgin
Copy link
Collaborator Author

kspurgin commented Oct 2, 2023

All tests pass, but standardrb lint is failing due to issues in code I did not change or add in this PR, so not sure how to deal with that.

@kspurgin kspurgin merged commit 4cb0fb5 into main Jul 30, 2024
32 checks passed
@kspurgin kspurgin deleted the line-specific-opts branch July 30, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant