Skip to content

Commit

Permalink
Show errors for Csvlint-valid files CSV library can't parse
Browse files Browse the repository at this point in the history
Currently, you are allowed to proceed to run the pre-processing job
with such a file, but the job fails when processing hits a row that
causes a CSV::MalformedCSVError. Unfortunately, this often occurs at
the very end of the file.

This commit adds a separate `csv_parse_validator_for(batch)` method
that is only called if Csvlint reports the file is valid.

CSV-Parsing the whole file before creating the batch may seem like
overkill, but (a) I can't come up with a better way to catch all
possible issues; and (b) it seems better than using system
resources/user time to pre-process a whole batch that is only going to
fail at the end.

Thinking forward: If we are eventually moving to using the database to
store information about each row and its state/status, then this step
can be reworked to serve a dual purpose: (1) prepare each row to be
added to database if we are able to prepare all rows successfully;
and (2) if not able to prepare all rows successfully, show error
messages and destroy batch.

Show errors for Csvlint-valid files CSV library can't parse

Currently, you are allowed to proceed to run the pre-processing job
with such a file, but the job fails when processing hits a row that
causes a CSV::MalformedCSVError. Unfortunately, this often occurs at
the very end of the file.

This commit adds a separate `csv_parse_validator_for(batch)` method
that is only called if Csvlint reports the file is valid.

CSV-Parsing the whole file before creating the batch may seem like
overkill, but (a) I can't come up with a better way to catch all
possible issues; and (b) it seems better than using system
resources/user time to pre-process a whole batch that is only going to
fail at the end.

Thinking forward: If we are eventually moving to using the database to
store information about each row and its state/status, then this step
can be reworked to serve a dual purpose: (1) prepare each row to be
added to database if we are able to prepare all rows successfully;
and (2) if not able to prepare all rows successfully, show error
messages and destroy batch.
  • Loading branch information
kspurgin committed Aug 9, 2024
1 parent ef8f578 commit 8ccbf7f
Show file tree
Hide file tree
Showing 6 changed files with 183 additions and 35 deletions.
55 changes: 43 additions & 12 deletions app/controllers/batches_controller.rb
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ def create
redirect_to new_batch_step_preprocess_path(@batch)
end
else
@batch.destroy
format.html { return redirect_to new_batch_path }
end
else
Expand Down Expand Up @@ -61,22 +62,29 @@ def format_csv_validation_error(error)
"#{loc}: #{error.category.capitalize} error: #{error.type}"
end

# These tests need to run in order.
# csv_length_ok? depends on csvlint_ok? passing and setting batch.num_rows
# Eventually I want to add csv_encoding_ok? prior to running csvlint_ok? to
# better handle invalid encoding issues.
def spreadsheet_ok?
continue = false
Batch.csv_validator_for(@batch) do |validator|
if validator.valid? && within_csv_row_limit?(validator.row_count)
csvlint_ok? &&
csv_length_ok? &&
csv_parse_ok?
end

def csvlint_ok?
Batch.csvlint_validator_for(@batch) do |validator|
if validator.valid?
@batch.update(num_rows: validator.row_count)
continue = true
elsif !within_csv_row_limit?(validator.row_count)
@batch.destroy # scrap it, they'll have to start over
flash[:csv_too_long] = true
true
else
@batch.destroy # scrap it, they'll have to start over
flash[:csv_lint] = validator.errors
.map{ |err| format_csv_validation_error(err) }
.join('|||')
flash[:csvlint] = validator.errors
.map { |err| format_csv_validation_error(err) }
.join('|||')
false
end
end
end

def csv_length_ok?
return true if @batch.num_rows <= Rails.configuration.csv_max_rows
Expand All @@ -85,7 +93,30 @@ def csv_length_ok?
false
end

continue
def csv_parse_ok?
Batch.csv_parse_validator_for(@batch) do |parsed|
return true if parsed == :success

# Catches blank rows between data rows or at the end of the CSV
# that have mixed EOL characters, which Csvlint does not flag as
# invalid, but that raise a malformed CSV error in the CSV library
#
# This also catches extraneous CRLF EOL at the end of final data
# line, when EOL char used elsewhere in file is CR
if parsed.start_with?('New line must be')
flash[:blank_rows] = true
# Catches extraneous EOL chars at end of final row that do not match
# the EOL char used in the rest of the file, which Csvlint calls
# calls valid, but that raise a malformed CSV error in the CSV library
elsif parsed.start_with?('Unquoted fields do not allow new line')
flash[:last_row_eol] = true
# Catches any other malformed CSV errors raised by the CSV library, for
# files Csvlint called valid
else
flash[:malformed_csv] = parsed
end
end
false
end

def set_batch
Expand Down
18 changes: 17 additions & 1 deletion app/models/batch.rb
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def self.content_types
CONTENT_TYPES
end

def self.csv_validator_for(batch)
def self.csvlint_validator_for(batch)
# raise unless batch.spreadsheet.attached? # TODO

batch.spreadsheet.open do |spreadsheet|
Expand All @@ -111,12 +111,28 @@ def self.csv_validator_for(batch)
end
end

def self.csv_parse_validator_for(batch)
# raise unless batch.spreadsheet.attached? # TODO

batch.spreadsheet.open do |spreadsheet|
yield batch.parse_spreadsheet(spreadsheet.path)
end
end

def self.expired(&block)
all.in_batches do |b|
b.find_all(&:expired?).each(&block)
end
end

def parse_spreadsheet(path)
File.open(path, encoding: 'bom|utf-8') { |file| CSV.table(file) }
rescue CSV::MalformedCSVError => e
e.message
else
:success
end

private

def ensure_batch_config_is_json
Expand Down
108 changes: 88 additions & 20 deletions app/views/batches/_form.html.erb
Original file line number Diff line number Diff line change
@@ -1,24 +1,29 @@
<%= form_with(model: batch, local: true, id: dom_id(batch), class: 'mt-3', data: { 'reflex-root': '#mappers' }) do |form| %>
<% if flash[:csv_lint] %>
<article class="flash message is-danger is-small mt-3">
<div class="message-header">
<p>Uploaded CSV is invalid</p>
</div>
<div class="message-body">
<p class="content">
Consult <a href="https://github.com/Data-Liberation-Front/csvlint.rb/#errors">csvlint's errors documentation</a> for more info on what the different errors mean.</p>
<ul>
<% flash[:csv_lint].split('|||').each do |err| %>
<li><%= err %></li>
<% end %>
</ul>
<p>&nbsp;</p>
<p>If you are getting "unknown_error", some things to check for might include:</p>
<ul>
<li>Line breaks within a CSV cell may be breaking the CSV structure</li>
</ul>
</div>
</article>
<% if flash[:csvlint] %>
<article class="flash message is-danger is-small mt-3">
<div class="message-header">
<p>Uploaded CSV is invalid</p>
</div>
<div class="message-body">
<p class="content">
Consult <a href="https://github.com/Data-Liberation-Front/csvlint.rb/#errors">csvlint's errors documentation</a> for more info on what the different errors mean.</p>
<ul>
<% flash[:csvlint].split('|||').each do |err| %>
<li><%= err %></li>
<% end %>
</ul>
<p>&nbsp;</p>
<p>If you are getting "unknown_error", some things to check for might include:</p>
<ul>
<li>Line breaks within a CSV cell may be breaking the CSV structure</li>
<li>Remove any blank lines in the CSV, including at the end of the file.
</ul>
<p>&nbsp;</p>
<p>The <a href="https://collectionspace.atlassian.net/wiki/spaces/COL/pages/3093004295/User+Manual+CSV+Importer+Dealing+with+invalid+CSVs#Finding-and-removing-blank-lines%2Frows">
CSV Importer User Manual</a> contains some tips and tricks for dealing
with blank rows.</p>
</div>
</article>
<% end %>
<% if flash[:csv_too_long] %>
Expand All @@ -32,6 +37,69 @@
</article>
<% end %>
<% if flash[:blank_rows] %>
<article class="flash message is-danger is-small mt-3">
<div class="message-header">
<p>CSV contains invalid blank rows</p>
</div>
<div class="message-body">
<p class="content">
Please delete any blank/empty rows from your CSV file. This includes empty
rows in between actual data rows, as well as extra empty rows at the end
of your file.
</p>
<p class="content">
The <a href="https://collectionspace.atlassian.net/wiki/spaces/COL/pages/3093004295/User+Manual+CSV+Importer+Dealing+with+invalid+CSVs#Finding-and-removing-blank-lines%2Frows">
CSV Importer User Manual</a> contains some tips and tricks for dealing
with blank rows.
</p>
</div>
</article>
<% end %>
<% if flash[:last_row_eol] %>
<article class="flash message is-danger is-small mt-3">
<div class="message-header">
<p>CSV has invalid end-of-line character on last row</p>
</div>
<div class="message-body">
<p class="content">
The last row of your CSV ends with a line break that does not
match the line breaks used in the rest of the CSV file.
The easiest way to fix this is:
</p>
<ul>
<li>Open your CSV using a plain-text editor like Notepad or Notepad++</li>
<li>Go to the very end of the file</li>
<li>If your cursor can be put at the beginning of a blank line at the
bottom of the file, backspace until the farthest your cursor can go is
the end of the last line of data</li>
</ul>
<p class="content">
If you use Notepad++, the Edit &gt; EOL Conversion feature should also fix
this issue.
</p>
</div>
</article>
<% end %>
<% if flash[:malformed_csv] %>
<article class="flash message is-danger is-small mt-3">
<div class="message-header">
<p>Malformed CSV</p>
</div>
<div class="message-body">
<p class="content">
This CSV has structural issues that cause the following error
when the CSV Importer attempts to parse it: <%= flash[:malformed_csv] %>.
The <a href="https://collectionspace.atlassian.net/wiki/spaces/COL/pages/3093004295/User+Manual+CSV+Importer+Dealing+with+invalid+CSVs">
CSV Importer User Manual</a> has some tips and tricks for dealing with
invalid CSVs which may help.
</p>
</div>
</article>
<% end %>

<div class="columns">
<div class="column">
<div class="columns is-multiline">
Expand Down
30 changes: 28 additions & 2 deletions test/controllers/batches_controller_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,29 @@ class BatchesControllerTest < ActionDispatch::IntegrationTest
@invalid_params = @valid_params.dup
end

test 'cannot create batch with CSV found invalid by Csvlint' do
@invalid_params[:spreadsheet] =
fixture_file_upload('files/csvlint_invalid.csv', 'text/csv')

assert_no_difference('Batch.count') do
post batches_url, params: { batch: @invalid_params }
end
assert(flash.keys.include?('csvlint'), 'Expected flash[:csvlint]')
assert_redirected_to new_batch_path
end

test 'cannot create batch with CSV passed by Csvlint, but which'\
'blows up Ruby CSV library' do
@invalid_params[:spreadsheet] =
fixture_file_upload('files/crlf_eol_final_cr_eol.csv', 'text/csv')

assert_no_difference('Batch.count') do
post batches_url, params: { batch: @invalid_params }
end
assert(flash.keys.include?('last_row_eol'), 'Expected flash[:last_row_eol]')
assert_redirected_to new_batch_path
end

test 'a user can view batches' do
assert_can_view(batches_path)
end
Expand Down Expand Up @@ -70,17 +93,20 @@ class BatchesControllerTest < ActionDispatch::IntegrationTest
end

test 'should not create a batch with malformed csv' do
@invalid_params[:spreadsheet] = fixture_file_upload('files/malformed.csv', 'text.csv')
@invalid_params[:spreadsheet] =
fixture_file_upload('files/malformed.csv', 'text.csv')
assert_no_difference('Batch.count') do
post batches_url, params: { batch: @invalid_params }
end
end

test 'should not create a batch with greater than max limit csv' do
@invalid_params[:spreadsheet] = fixture_file_upload('files/too_large.csv', 'text.csv')
@invalid_params[:spreadsheet] =
fixture_file_upload('files/too_large.csv', 'text.csv')
assert_no_difference('Batch.count') do
post batches_url, params: { batch: @invalid_params }
end
assert(flash.keys.include?('csv_too_long'), 'Expected flash[:csv_too_long]')
end

test 'should not create a batch with malformed json' do
Expand Down
3 changes: 3 additions & 0 deletions test/fixtures/files/crlf_eol_final_cr_eol.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
obj,note
val,val
val,val
4 changes: 4 additions & 0 deletions test/fixtures/files/csvlint_invalid.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
obj,note
val,val
val,val

0 comments on commit 8ccbf7f

Please sign in to comment.