Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
My preferred approach to solving Data-Liberation-Front#198
This issue is caused by mixed EOL indicators used in the same file.
The usual pattern we see1, is CRLF (\r\n) at the end of actual CSV rows, but just LF (\n) used inside quoted text field values.
To validate a CSV, csvlint.rb currently does the following:
Validator.parse_line
goes through the file line-by-line, looking for overall invalid encoding. Notably, this keeps track of quotes in order to join multiple lines where the line breaks occur inside quotes. The issues in question are usually not found in this process. The line it passes tovalidate_line
is usually the whole, correctly joined line.Validator.validate_line
callsValidator.parse_contents
on the lineValidator.parse_contents
callsLineCSV.parse_line
(LineCSV is a subclass of the standard CSV library) with the same@csv_options
derived from the dialect passed toCsvlint::Validator
. If no "lineTerminator" (mapped tocsv_options[:row_sep]
) value is not explicitly given in the dialect specification, it gets set by Csvlint to:auto
.CSV.parse
on the whole file withrow_sep: :auto
generally works due to the parser making use of the context of the full file, callingparse_line
withrow_sep: :auto
blows up because the parser assumes the first EOL it hits is the real one or something.So, to fix this:
LineCSV.parse_line
, we check thecsv_options[:row_sep]
value.:auto
, we pass it through ascsv_options_for_line
.\r\n
,\r
, or\n
, we merge that string into the@csv_options
instance variable as the new, explicit:row_sep
value. (So, on subsequent lines, we pass the no-longer-auto-row_sep @csv_options through -- better for performance, AND we'll likely hear about it if any subsequent lines don't end with the:row_sep
value.)Footnotes
Because it's how CollectionSpace usually exports data containing line breaks entered by Windows users, who export data for data round-tripping via the CSV Importer, which uses csvlint.rb for CSV validation, and it currently can't handle this. RFC 4180 prescribes CRLF EOLs, but entering data inside a field in the CollectionSpace UI appears to enter just LF, and that's what comes out in the quoted text. ↩