Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quoted CRs in CRLF file incorrectly result in validation errors #198

Open
andylolz opened this issue Apr 18, 2017 · 7 comments
Open

Quoted CRs in CRLF file incorrectly result in validation errors #198

andylolz opened this issue Apr 18, 2017 · 7 comments

Comments

@andylolz
Copy link

andylolz commented Apr 18, 2017

So e.g.

"Foo","Bsr","Baz"^M
"Biff","Baff","Boff
"^M
"Qux","Teaspoon","Doge"^M

…results in:

$ csvlint test.csv
.!.
test.csv is INVALID
1. unclosed_quote. Row: 2. "Biff","Baff","Boff
"

According to the RFC 4180, though, I think this should be fine.

@andylolz
Copy link
Author

andylolz commented Apr 27, 2017

By default, the type of line endings used are auto-detected:
https://github.com/theodi/csvlint.rb/blob/873617f5/lib/csvlint/validate.rb#L302

…but the CSV is parsed row-by-row:
https://github.com/theodi/csvlint.rb/blob/873617f5/lib/csvlint/validate.rb#L182

So in the case described above, auto-detect gets it wrong:

irb(main):001:0> require 'csv'
=> true
irb(main):002:0> # load the data above
irb(main):003:0* contents = ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
=> ["Foo,Bsr,Baz\r\n", "Biff,Baff,\"Boff\n", "\"\r\n", "Qux,Teaspoon,Doge\r\n"]
irb(main):004:0> 
irb(main):005:0* # try to parse the second row, auto-detecting line endings
irb(main):006:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => :auto})
CSV::MalformedCSVError: Unclosed quoted field on line 1.
  from /csv.rb:1872:in `block in shift'
  from …/csv.rb:1779:in `loop'
  from …/csv.rb:1779:in `shift'
  from /csv.rb:1310:in `parse_line'
  from (irb):6
  from …/irb:11:in `<main>'

…But if you manually specify the line endings, everything works correctly:

irb(main):007:0* # try to parse the second row, manually specifying '\r\n' line endings
irb(main):008:0* CSV.parse_line(contents[1] + contents[2], {:row_sep => '\r\n'})
=> ["Biff", "Baff", "Boff\n"]

@andylolz
Copy link
Author

andylolz commented Apr 27, 2017

I guess the solution would be to set @csv_options[:row_sep] to @line_breaks or something like that.

@andylolz andylolz changed the title Quoted CRs in CRLF file results in linting errors Quoted CRs in CRLF file incorrectly result in validation errors Apr 27, 2017
@Floppy
Copy link
Member

Floppy commented Apr 27, 2017

Ah yes, @JeniT and I were discussing this the other day, and we wondered if it was because of mixed line breaks as you've found. I think your solution will probably work; I need to check what the line-break behaviour inside quotes should be as well.

@andylolz
Copy link
Author

I realised I should be explicit that parsing the whole thing with auto-detect line endings switched on works fine:

irb(main):009:0* CSV.parse(contents.join, {:row_sep => :auto})
=> [["Foo", "Bsr", "Baz"], ["Biff", "Baff", "Boff\n"], ["Qux", "Teaspoon", "Doge"]]

So this is only an issue when parsing row-by-row.

@kspurgin
Copy link

I have run into this same issue or a very similar one. Behavior of Csvlint and stdlib CSV parse documented here. (Test structure described here)

@kspurgin
Copy link

The contributing guidelines say: "Before you start coding, please reach out to us either on our gitter channel or by tagging a repository administrator on the issue ticket you are interested in contributing towards to indicate your interest in helping."

I am not finding a list of the repository admins, and apparently I need push access to a repo to get that info from GitHub API, so I'm making assumptions based on who has merged all recent PRs and tagging @Floppy here to indicate that I'd like to attempt a fix for this issue.

Rationale/use case/background:

We use csvlint in a batch data import application to validate user-generated CSVs before allowing import of data in the files. Using csvlint to prevent the import of CSVs with issues like ragged rows, invalid encoding, and stray/unclosed quotes is important to prevent clients from inadvertently loading malformed data.

This issue is currently blocking some of our clients from importing valid CSVs with mixed EOL characters inside quoted values (an artifact of original data source, and what OS they used to prepare the data, apparently). If you export data from our application as CSV on a Windows machine for roundtripping via our import tool, you get mixed EOL characters and can't re-ingest the data because csvlint flags the CSV as invalid. Our users are primarily collection managers in GLAM institutions, not data wrangling/tech savvy people; they are going to tune out/give up if anyone starts explaining CR, LF, vs CRLF to them.

Mixed EOL characters are handled fine by the standard Ruby CSV library, and are not prohibited by RFC 4180 according to my reading, so csvlint flagging these as invalid needs to be fixed.

@kspurgin
Copy link

kspurgin commented Sep 30, 2023

Hey, I solved this in two different ways, which can be examined here: https://github.com/kspurgin/csvlint.rb/pulls

Everything in the features directory is identical for both PRs. Only lib/csvlint/validate.rb differs.

I prefer PR2 as it feels more expressive of the actual problem, and less hacky than the other.

From what I can tell, the root of the problem is: once we are at Validate.parse_contents, we have already gotten the item of interest down to a single line of the CSV. Csvlint::Validator.parse_line already did a great job of that (handling quoted EOL chars fine), and now we can tell what characters are at the end of this line. But we then call LineCSV.parse_line on it, passing the row_sep: :auto option in many cases. LineCSV subclasses CSV, whose Parser does a terrible job determining the proper row separator from a single line if there are EOL characters inside quotes.

PR2 says "Ok, whether the correct EOL char (according to stated dialect) was used, or whether there were mixed EOL chars is all handled elsewhere. If row_sep is set to something specific, I will continue to pass @csv_options to LineCSV.parse_line, because it might specify something real weird. If it's set to :auto, however, we expect it to be LF, CR, or CRLF, and we can easily determine which of those terminates this line, so we set that as the line-specific row_sep passed to LineCSV with the line, so CSV::Parser doesn't have to guess.

If you let me know which of these is acceptable (if either is), I'll do a real PR with better commit messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants