Skip to content

CSV parsing

Adam Hooper edited this page Apr 7, 2014 · 12 revisions

We start with the spec: http://tools.ietf.org/html/rfc4180#page-3

And we specify these restrictions/alterations/clarifications:

  • We allow any encoding. (Instead of relying on ASCII character codes as the spec does, we use Unicode characters.) The user specifies the encoding.
  • The first row must be a header.
  • There must be one row of data.
  • There must be a text header and an id header (case-insensitive). Both are string fields, and the values beneath may be of arbitrary length. (We may truncate one or both when inserting them into the database, but we don't truncate them when we're parsing.)
  • We treat \u0085, \r, \n, or \r\n as newlines.
  • We only allow one separator: ,
  • We only allow one quotation mark: "
  • We do not support a MySQL-like "escape" character. MySQL is weird.
  • We allow \t.
  • We nix \0 and Unicode special characters (byte-order marker, unpaired surrogates).
  • We transform all other control characters into . (Because really, who wants a parse error?)
  • We allow lines that don't have enough values (including blank lines), or lines that have too many values.

Here's what happens on the server when the file has mistakes:

  • The overriding rule: we must not fail, and we can't ask the user for input
  • First is decoding bytes to Java Strings. On failure, we must replace the invalid characters with the Unicode replacement character, so we don't completely balk when the user specifies the wrong character set. Derive from CharsetDecoder to do this: see http://docs.oracle.com/javase/6/docs/api/java/nio/charset/CharsetDecoder.html#onMalformedInput(java.nio.charset.CodingErrorAction) -- you can use that CharsetDecoder in an InputStreamReader.
  • We assume the file is valid. That means if there's an unescaped quote on one line, we keep reading until the end-quote, then until the next comma, etc. (OpenCSV does this automatically. We don't necessarily need to treat OpenCSV as an authoritative spec, but we do need to make sure our parser doesn't throw errors.)
  • If a row is missing either an ID or Text, we skip it. (We ignore all other values, whether or not they have accompanying headers.)

Here's what happens on the client when the file has mistakes:

  • The overriding rule: we must fail, and we must ask the user for input
  • We read the first slice of the file (1MB, say) and guess at the encoding. If multiple encodings seem valid, we offer the user a choice, defaulting to the most likely. (What's the most likely? UTF-8? Windows-1251? ISO-8859-1?) If only one encoding works, we report it to the user. We don't let the user choose an encoding that doesn't match the first slice of the document.
  • If no encodings are valid, we report that to the user.
  • If there is no header with "text" and "id" columns (case-insensitive), we report that to the user.
  • If there is no data, we report that to the user.
  • If the file is INVALID, we show the user the point in the file that is invalid (line/column), and we tell the user to use a spreadsheet program to import and export the file.
  • We show the user a preview of the file, with , and " as separator and quote. We do not let the user change the separator or quote; we tell the user to use a spreadsheet program to do that.

We allow these encodings (all must be supported on client and server):

  • UTF-8: Unicode, ASCII
  • UTF-16: Unicode
  • Windows-1252: (practically) a superset of ISO-8859-1, a.k.a. "latin-1", with smart quotes
Clone this wiki locally