Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emp::DataFile output should comply with CSV standard RFC 4180 #489

Open
mmore500 opened this issue Nov 26, 2023 · 0 comments
Open

emp::DataFile output should comply with CSV standard RFC 4180 #489

mmore500 opened this issue Nov 26, 2023 · 0 comments

Comments

@mmore500
Copy link
Member

mmore500 commented Nov 26, 2023

Is your feature request related to a problem? Please describe.

Serialization through emp::DataFile and deserialization through emp::File are defaulted to work with CSV format, but by default only support a subset of the format.

For example, this file

"a","b","c,d"
"""g""",x,y

should be read as

a b c,d
"g" x y

according to RFC 4180. However, it would currently read as

"a" "b" "c d"
"""g""" x y

Note that in the RFC 4180-compliant version, the quotes around "g" are interpreted as enclosing a single field, making the actual value g.
In the current reading, the quotes are being interpreted literally, so the field reads as """g""".

Describe the solution you'd like
A clear and concise description of what you want to happen.

Probably, for performance reasons, the emp::DataFile and emp::File default behavior should not change.
However, RFC4180 modes or classes should be available.

In debug mode, emp::DataFile/emp::File should probably warn of RFC4180 noncompliance where pertinent.
An easy way to do this would be comparing results with RFC4180-enabled interpretation and warning naive interpretation differs.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Users could currently get part of the way by setting the beginning, separator, and end delimiters to ", "," and ", respectively for serialization.
This delimiter kludge wouldn't work as a deserialization solution because it would fail on plain csv files like

a,b,c
1,2,3 

For serialization, this delimiter kludge would add unnecessary quotes to lots of csv output without properly escaping "'s in output strings as "".

Additional context
Find RFC 4180 here.
The pertinent content is:

5. Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
  1. Fields containing line breaks (CRLF), double quotes, and commas
    should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
  1. If double-quotes are used to enclose fields, then a double-quote
    appearing inside a field must be escaped by preceding it with
    another double quote. For example:
"aaa","b""bb","ccc"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant