Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NIFI-12674 Modified ValidateCSV to make the schema optional if a head… #8362

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Freedom9339
Copy link
Contributor

…er is provided. Added validate on attribute option.

Summary

NIFI-12674

Made the schema optional for the ValidateCSV processor if a header is provided. In this case, only the structure of the CSV will validated, using the header to determine how many fields each line should have.

Additionally, a new validation strategy was implemented, Validate on Attribute. This works similar to the way ValidateXML works in which the value of a given attribute of a FlowFile will be treated as the contents of a CSV file. Validation will be done on that attribute and not on the content of the FlowFile.

I would also like to note that I made the stream a variable that can be assigned to either the FlowFile content or the value of the attribute. In doing this, I removed the need for an inner method, and thus all of the variables that previously needed to be Atomic References could now be regular variables.

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using mvn clean install -P contrib-check
    • JDK 21

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

@pvillard31
Copy link
Contributor

Any reason for not using ValidateRecord processor if the only requirement is to confirm that the data is valid CSV without any specific constraint?

@pvillard31 pvillard31 changed the title nifi-12674 Modified ValidateCSV to make the schema optional if a head… NIFI-12674 Modified ValidateCSV to make the schema optional if a head… Feb 6, 2024
@Freedom9339
Copy link
Contributor Author

@pvillard31 The ValidateRecord processor splits the input FlowFile into 2 Flowfiles, one for valid and another for invalid records. With the change to ValidateCSV, the whole file will be routed to either valid or invalid.

@dan-s1
Copy link
Contributor

dan-s1 commented Feb 22, 2024

@exceptionfactory Can you please restart the failed job? It does not seem related to the changes. Thanks!

@Freedom9339
Copy link
Contributor Author

Any updates on reviewing this change?

@mattyb149
Copy link
Contributor

There are merge conflicts that need to be resolved

@Freedom9339
Copy link
Contributor Author

@mattyb149 I've rebased against main. Thank You


InputStream stream;
if (context.getProperty(CSV_SOURCE_ATTRIBUTE).isSet()) {
stream = new ByteArrayInputStream(flowFile.getAttribute(context.getProperty(CSV_SOURCE_ATTRIBUTE).getValue()).getBytes(StandardCharsets.UTF_8));
Copy link
Contributor

@jrsteinebrey jrsteinebrey Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Freedom9339 Thanks for making that change. It turned out well except you need to call .evaluateAttributeExpressions() without passing a flowfile after the .getProperty(CSV_SOURCE_ATTRIBUTE) call. After that, the change looks complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Freedom9339 Thanks for your contribution.

@@ -120,11 +120,20 @@ public class ValidateCsv extends AbstractProcessor {
.description("The schema to be used for validation. Is expected a comma-delimited string representing the cell "
+ "processors to apply. The following cell processors are allowed in the schema definition: "
+ allowedOperators.toString() + ". Note: cell processors cannot be nested except with Optional.")
.required(true)
.required(false)
Copy link
Contributor

@mattyb149 mattyb149 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a dependsOn(HEADER, "false"), unless you can provide the explicit schema yet there be a header line that should be ignored. If that's the case maybe update the documentation for Header to reflect the current behavior based on the different combinations of settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "unless" part is true here. But now there are merge conflicts, sorry I lost track of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants