Skip to content

Commit

Permalink
Add in upstream data validations, clean up site links and nested folders
Browse files Browse the repository at this point in the history
  • Loading branch information
pflooky committed Nov 28, 2023
1 parent 1f3c1d0 commit 1074e5e
Show file tree
Hide file tree
Showing 57 changed files with 7,789 additions and 1,307 deletions.
8 changes: 4 additions & 4 deletions docs/setup/advanced/advanced.md → docs/setup/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ There are many options available for you to use when you have a scenario when da

1. Create expression [datafaker](https://www.datafaker.net/documentation/expressions/)
1. Can be used to create names, addresses, or anything that can be found
under [**here**](../../sample/datafaker/expressions.txt)
under [**here**](../sample/datafaker/expressions.txt)
2. Create regex

## Foreign keys across data sets

![Multiple data source foreign key example](../../diagrams/foreign_keys.drawio.png "Multiple data source foreign keys")
![Multiple data source foreign key example](../diagrams/foreign_keys.drawio.png "Multiple data source foreign keys")

Details for how you can configure foreign keys can be found [**here**](../foreign-key/foreign-key.md).
Details for how you can configure foreign keys can be found [**here**](foreign-key.md).

## Edge cases

Expand Down Expand Up @@ -54,7 +54,7 @@ This can be controlled at a column level by including the following flag in the
```

If you want to know all the possible edge cases for each data
type, [can check the documentation here](../generator/generator.md).
type, [can check the documentation here](generator/data-generator.md).

## Scenario testing

Expand Down
2 changes: 1 addition & 1 deletion docs/setup/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Flags are used to control which processes are executed when you run Data Caterer
| `enableFailOnError` | true | N | Whilst saving generated data, if there is an error, it will stop any further data from being generated |
| `enableSaveReports` | true | N | Enable/disable HTML reports summarising data generated, metadata of data generated (if `enableSinkMetadata` is enabled) and validation results (if `enableValidation` is enabled). Sample [**here**](generator/report.md) |
| `enableSinkMetadata` | true | N | Run data profiling for the generated data. Shown in HTML reports if `enableSaveSinkMetadata` is enabled |
| `enableValidation` | false | N | Run validations as described in plan. Results can be viewed from logs or from HTML report if `enableSaveSinkMetadata` is enabled. Sample [**here**](validation/validation.md) |
| `enableValidation` | false | N | Run validations as described in plan. Results can be viewed from logs or from HTML report if `enableSaveSinkMetadata` is enabled. Sample [**here**](validation.md) |
| `enableGeneratePlanAndTasks` | false | Y | Enable/disable plan and task auto generation based off data source connections |
| `enableRecordTracking` | false | Y | Enable/disable which data records have been generated for any data source |
| `enableDeleteGeneratedRecords` | false | Y | Delete all generated records based off record tracking (if `enableRecordTracking` has been set to true) |
Expand Down
16 changes: 8 additions & 8 deletions docs/setup/connection/connection.md → docs/setup/connection.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ These configurations can be done via API or from configuration. Examples of both

## Supported Data Connections

| Data Source Type | Data Source | Paid |
|------------------|----------------------------------------|------------------------|
| Database | Postgres, MySQL, Cassandra | N (Postgres), Y (rest) |
| File | CSV, JSON, ORC, Parquet | N |
| Messaging | Kafka, Solace | Y |
| HTTP | REST API | Y |
| Metadata | Marquez, OpenMetadata, OpenAPI/Swagger | Y |
| Data Source Type | Data Source | Sponsor |
|------------------|----------------------------------------|---------|
| Database | Postgres, MySQL, Cassandra | N |
| File | CSV, JSON, ORC, Parquet | N |
| Messaging | Kafka, Solace | Y |
| HTTP | REST API | Y |
| Metadata | Marquez, OpenMetadata, OpenAPI/Swagger | Y |

### API

Expand Down Expand Up @@ -46,7 +46,7 @@ All connection details follow the same pattern.

## Data sources

To find examples of a task for each type of data source, please check out [this page](../guide/index.md).
To find examples of a task for each type of data source, please check out [this page](guide/index.md).

### File

Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Foreign Keys

![Multiple data source foreign key example](../../diagrams/foreign_keys.drawio.png "Multiple data source foreign keys")
![Multiple data source foreign key example](../diagrams/foreign_keys.drawio.png "Multiple data source foreign keys")

Foreign keys can be defined to represent the relationships between datasets where values are required to match for
particular columns.
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/setup/guide/data-source/solace.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Within our class, we can start by defining the connection properties to connect
);
```

Additional connection options can be found [**here**](../../connection/connection.md#jms).
Additional connection options can be found [**here**](../../connection.md#jms).

=== "Scala"

Expand All @@ -104,7 +104,7 @@ Within our class, we can start by defining the connection properties to connect
)
```

Additional connection options can be found [**here**](../../connection/connection.md#jms).
Additional connection options can be found [**here**](../../connection.md#jms).

#### Schema

Expand Down
4 changes: 2 additions & 2 deletions docs/setup/guide/scenario/data-validation.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
description: "Validate data via basic checks and group by aggregates across columns and the whole dataset."
description: "Validate data via basic checks and group by aggregates, across columns or the whole dataset."
image: "https://data.catering/diagrams/logo/data_catering_logo.svg"
---

Expand Down Expand Up @@ -199,7 +199,7 @@ Line 2: `validation.groupBy("account_id").max("balance").lessThan(900)`
##### Considerations

- Adjust the `errorThreshold` or validation to your specification scenario. The full list
of [types of validations can be found here](../../../setup/validation/validation.md).
of [types of validations can be found here](../../validation.md).
- For the full list of types of group by validations that can be
used, [check this page](../../../setup/validation/group-by-validation.md).

Expand Down
4 changes: 2 additions & 2 deletions docs/setup/guide/scenario/first-data-generation.md
Original file line number Diff line number Diff line change
Expand Up @@ -618,7 +618,7 @@ Let's try to configure data validations for the data that gets pushed into Postg
#### Postgres setup

First, we define our connection properties for Postgres. You can check out the full options available
[**here**](../../connection/connection.md).
[**here**](../../connection.md).

=== "Java"

Expand Down Expand Up @@ -646,7 +646,7 @@ We can connect and access the data inside the table `account.transactions`. Now

#### Validations

For full information about validation options and configurations, check [**here**](../../validation/validation.md).
For full information about validation options and configurations, check [**here**](../../validation.md).
Below, we have an example that should give you a good understanding of what validations are possible.

=== "Java"
Expand Down
12 changes: 6 additions & 6 deletions docs/setup/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ If you want a guided tour of using the Java or Scala API, you can follow one of
</div>

[Configurations]: configuration.md
[Connections]: connection/connection.md
[Generators]: generator/generator.md
[Validations]: validation/validation.md
[Foreign Keys]: foreign-key/foreign-key.md
[Deployment]: deployment/deployment.md
[Advanced]: advanced/advanced.md
[Connections]: connection.md
[Generators]: generator/data-generator.md
[Validations]: validation.md
[Foreign Keys]: foreign-key.md
[Deployment]: deployment.md
[Advanced]: advanced.md

## High Level Run Configurations

Expand Down
15 changes: 5 additions & 10 deletions docs/setup/validation/validation.md → docs/setup/validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,14 @@ summarising the success or failure of the validations is produced and can be exa

- __[Basic]__ - Basic column level validations
- __[Group by/Aggregate]__ - Run aggregates over grouped data, then validate
- __[Relationship (Coming soon)]__ - Ensure record values exist in other datasets based on relationships
- __[Upstream data source]__ - Ensure record values exist in datasets based on other data sources or data generated
- __[Data Profile (Coming soon)]__ - Score how close the data profile of generated data is against the target data profile

</div>

[Basic]: basic-validation.md
[Group by/Aggregate]: group-by-validation.md

Currently, SQL expression validations are supported (can see [**here**](https://spark.apache.org/docs/latest/api/sql/)
for reference what other expressions are valid), but will later be extended out to supported other validations such as
aggregates (group by account_number, sum of amounts should be greater than 100), ordering (transaction dates should be
in descending order), relationships (at least one transaction per account_number) or data profiling (how close produced
data profile is to expected data profile).
[Basic]: validation/basic-validation.md
[Group by/Aggregate]: validation/group-by-validation.md
[Upstream data source]: validation/upstream-data-source-validation.md

## Define Validations

Expand Down Expand Up @@ -249,4 +244,4 @@ validations. This can be via:

## Report

Once run, it will produce a report like [this](../../sample/report/html/validations.html).
Once run, it will produce a report like [this](../sample/report/html/validations.html).
Loading

0 comments on commit 1074e5e

Please sign in to comment.