diff --git a/.cache/plugin/optimize/diagrams/data_validation_report.png b/.cache/plugin/optimize/diagrams/data_validation_report.png
new file mode 100644
index 00000000..ee175055
Binary files /dev/null and b/.cache/plugin/optimize/diagrams/data_validation_report.png differ
diff --git a/.cache/plugin/optimize/manifest.json b/.cache/plugin/optimize/manifest.json
index 15ece90f..fd022fa0 100644
--- a/.cache/plugin/optimize/manifest.json
+++ b/.cache/plugin/optimize/manifest.json
@@ -16,5 +16,6 @@
   "diagrams/3.png": "c869b89258b4a0a6c2b66abb3ab416db925a56bf",
   "diagrams/openmetadata_dashboard.png": "81ff2755f6a7d7fe12680d2ba77427aa4b0a85ef",
   "diagrams/solace_dashboard.png": "aed876e82473b0a4c10dc714bfabdcc4d252ee19",
-  "diagrams/solace_messages_queued.png": "1ee16a1829114c52ef05234d46f0c487669fd6a3"
+  "diagrams/solace_messages_queued.png": "1ee16a1829114c52ef05234d46f0c487669fd6a3",
+  "diagrams/data_validation_report.png": "6b78f44b31983b53eea1bd3d7ae55cbebe7415b9"
 }
\ No newline at end of file
diff --git a/docs/diagrams/data_validation_report.png b/docs/diagrams/data_validation_report.png
new file mode 100644
index 00000000..bafab1f7
Binary files /dev/null and b/docs/diagrams/data_validation_report.png differ
diff --git a/docs/setup/guide/index.md b/docs/setup/guide/index.md
index 3cd072fd..27908e3d 100644
--- a/docs/setup/guide/index.md
+++ b/docs/setup/guide/index.md
@@ -17,7 +17,7 @@ the trial can be found [**here**](../../get-started/docker.md#paid-version-trial
 - __[First Data Generation]__ - If you are new, this is the place to start
 - __[Multiple Records Per Column Value]__ - How you can generate multiple records per set of columns
 - __[Foreign Keys Across Data Sources]__ - Generate matching values across generated data sets
-- __[Data Validations]__ - (Soon to document) Run data validations after generating data
+- __[Data Validations]__ - Run data validations after generating data
 - __[Auto Generate From Data Connection]__ - Automatically generating data from just defining data sources
 - __[Delete Generated Data]__ - Delete the generated data whilst leaving other data
 - __[Generate Batch and Event Data]__ - Generate matching batch and event data
@@ -27,7 +27,7 @@ the trial can be found [**here**](../../get-started/docker.md#paid-version-trial
   [First Data Generation]: scenario/first-data-generation.md
   [Multiple Records Per Column Value]: scenario/records-per-column.md
   [Foreign Keys Across Data Sources]: scenario/batch-and-event.md
-  [Data Validations]: scenario/first-data-generation.md
+  [Data Validations]: scenario/data-validation.md
   [Auto Generate From Data Connection]: scenario/auto-generate-connection.md
   [Delete Generated Data]: scenario/delete-generated-data.md
   [Generate Batch and Event Data]: scenario/batch-and-event.md
diff --git a/docs/setup/guide/scenario/data-validation.md b/docs/setup/guide/scenario/data-validation.md
new file mode 100644
index 00000000..351b41f8
--- /dev/null
+++ b/docs/setup/guide/scenario/data-validation.md
@@ -0,0 +1,290 @@
+---
+description: "Validate data via basic checks and group by aggregates across columns and the whole dataset."
+image: "https://data.catering/diagrams/logo/data_catering_logo.svg"
+---
+
+# Data Validations
+
+Creating a data validator for a JSON file.
+
+![Example data validation report](../../../diagrams/data_validation_report.png)
+
+## Requirements
+
+- 5 minutes
+- Git
+- Gradle
+- Docker
+
+## Get Started
+
+First, we will clone the data-caterer-example repo which will already have the base project setup required.
+
+```shell
+git clone git@github.com:pflooky/data-caterer-example.git
+```
+
+### Data Setup
+
+To aid in showing the functionality of data validations, we will first generate some data that our validations will run
+against. Run the below command and it will generate JSON files under `docker/sample/json` folder.
+
+```shell
+./run.sh JsonPlan
+```
+
+### Plan Setup
+
+Create a new Java or Scala class.
+
+- Java: `src/main/java/com/github/pflooky/plan/MyValidationJavaPlan.java`
+- Scala: `src/main/scala/com/github/pflooky/plan/MyValidationPlan.scala`
+
+Make sure your class extends `PlanRun`.
+
+=== "Java"
+
+    ```java
+    import com.github.pflooky.datacaterer.java.api.PlanRun;
+    ...
+    
+    public class MyValidationJavaPlan extends PlanRun {
+        {
+            var jsonTask = json("my_json", "/opt/app/data/json");
+
+            var config = configuration()
+                    .generatedReportsFolderPath("/opt/app/data/report")
+                    .enableValidation(true)
+                    .enableGenerateData(false);
+    
+            execute(config, jsonTask);
+        }
+    }
+    ```
+
+=== "Scala"
+
+    ```scala
+    import com.github.pflooky.datacaterer.api.PlanRun
+    ...
+    
+    class MyValidationPlan extends PlanRun {
+      val jsonTask = json("my_json", "/opt/app/data/json")
+
+      val config = configuration
+        .generatedReportsFolderPath("/opt/app/data/report")
+        .enableValidation(true)
+        .enableGenerateData(false)
+      
+      execute(config, jsonTask)
+    }
+    ```
+
+As noted above, we create a JSON task that points to where the JSON data has been created at folder `/opt/app/data/json`
+. We also note that `enableValidation` is set to `true` and `enableGenerateData` to `false` to tell Data Catering, we
+only want to validate data.
+
+### Validations
+
+For reference, the schema in which we will be validating against looks like the below.
+
+```shell
+.schema(
+  field.name("account_id"),
+  field.name("year").`type`(IntegerType),
+  field.name("balance").`type`(DoubleType),
+  field.name("date").`type`(DateType),
+  field.name("status"),
+  field.name("update_history").`type`(ArrayType)
+    .schema(
+      field.name("updated_time").`type`(TimestampType),
+      field.name("status").oneOf("open", "closed", "pending", "suspended"),
+    ),
+  field.name("customer_details")
+    .schema(
+      field.name("name").expression("#{Name.name}"),
+      field.name("age").`type`(IntegerType),
+      field.name("city").expression("#{Address.city}")
+    )
+)
+```
+
+#### Basic Validation
+
+Let's say our goal is to validate the `customer_details.name` field to ensure it conforms to the regex
+pattern `[A-Z][a-z]+ [A-Z][a-z]+`. Given the diversity in naming conventions across cultures and countries, variations
+such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The
+validation considers an acceptable error threshold before marking it as failed.
+
+##### Validation Criteria
+
+- Field to Validate: `customer_details.name`
+- Regex Pattern: `[A-Z][a-z]+ [A-Z][a-z]+`
+- Error Tolerance: If more than 10% do not match the regex, then fail.
+
+##### Considerations
+
+- Customisation
+    - Adjust the regex pattern and error threshold based on your specific data schema and validation requirements.
+    - For the full list of types of basic validations that can be
+      used, [check this page](../../../setup/validation/basic-validation.md).
+- Understanding Tolerance
+    - Be mindful of the error threshold, as it directly influences what percentage of deviations from the pattern is
+      acceptable.
+
+=== "Java"
+
+    ```java
+    validation().col("customer_details.name")
+        .matches("[A-Z][a-z]+ [A-Z][a-z]+")
+        .errorThreshold(0.1)                                      //<=10% failure rate is acceptable
+        .description("Names generally follow the same pattern"),  //description to add context in report or other developers
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.col("customer_details.name")
+      .matches("[A-Z][a-z]+ [A-Z][a-z]+")
+      .errorThreshold(0.1)                                      //<=10% failure rate is acceptable
+      .description("Names generally follow the same pattern"),  //description to add context in report or other developers
+    ```
+
+##### Custom Validation
+
+There will be situation where you have a complex data setup and require you own custom logic to use for data validation.
+You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where
+we want to check the array `update_history`, that each entry has `updated_time` greater than a certain timestamp.
+
+=== "Java"
+
+    ```java
+    validation().expr("FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))"),
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.expr("FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))"),
+    ```
+
+If you want to know what other SQL function are available for you to
+use, [check this page](https://spark.apache.org/docs/latest/api/sql/).
+
+#### Group By Validation
+
+There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example
+would be validating that each customer's transactions sum is greater than 0.
+
+##### Validation Criteria
+
+Line 1: `validation.groupBy().count().isEqual(100)`
+
+- Method Chaining
+    - `groupBy()`: Group by whole dataset.
+    - `count()`: Counts the number of dataset elements.
+    - `isEqual(100)`: Checks if the count is equal to 100.
+- Validation Rule
+    - This line ensures that the count of the total dataset is exactly 100.
+
+Line 2: `validation.groupBy("account_id").max("balance").lessThan(900)`
+
+- Method Chaining
+    - `groupBy("account_id")`: Groups the data based on the `account_id` field.
+    - `max("balance")`: Calculates the maximum value of the `balance` field within each group.
+    - `lessThan(900)`: Checks if the maximum balance in each group is less than 900.
+- Validation Rule
+    - This line ensures that, for each group identified by `account_id` the maximum balance is less than 900.
+
+##### Considerations
+
+- Adjust the `errorThreshold` or validation to your specification scenario. The full list
+  of [types of validations can be found here](../../../setup/validation/validation.md).
+- For the full list of types of group by validations that can be
+  used, [check this page](../../../setup/validation/group-by-validation.md).
+
+=== "Java"
+
+    ```java
+    validation().groupBy().count().isEqual(100),
+    validation().groupBy("account_id").max("balance").lessThan(900)
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.groupBy().count().isEqual(100),
+    validation.groupBy("account_id").max("balance").lessThan(900)
+    ```
+
+#### Sample Validation
+
+To try cover the majority of validation cases, the below has been created.
+
+=== "Java"
+
+    ```java
+    var jsonTask = json("my_json", "/opt/app/data/json")
+            .validations(
+                    validation().col("customer_details.name").matches("[A-Z][a-z]+ [A-Z][a-z]+").errorThreshold(0.1).description("Names generally follow the same pattern"),
+                    validation().col("date").isNotNull().errorThreshold(10),
+                    validation().col("balance").greaterThan(500),
+                    validation().expr("YEAR(date) == year"),
+                    validation().col("status").in("open", "closed", "pending").errorThreshold(0.2).description("Could be new status introduced"),
+                    validation().col("customer_details.age").greaterThan(18),
+                    validation().expr("FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))"),
+                    validation().col("update_history").greaterThanSize(2),
+                    validation().unique("account_id"),
+                    validation().groupBy().count().isEqual(1000),
+                    validation().groupBy("account_id").max("balance").lessThan(900)
+            );
+
+    var config = configuration()
+            .generatedReportsFolderPath("/opt/app/data/report")
+            .enableValidation(true)
+            .enableGenerateData(false);
+
+    execute(config, jsonTask);
+    ```
+
+=== "Scala"
+
+    ```scala
+    val jsonTask = json("my_json", "/opt/app/data/json")
+      .validations(
+        validation.col("customer_details.name").matches("[A-Z][a-z]+ [A-Z][a-z]+").errorThreshold(0.1).description("Names generally follow the same pattern"),
+        validation.col("date").isNotNull.errorThreshold(10),
+        validation.col("balance").greaterThan(500),
+        validation.expr("YEAR(date) == year"),
+        validation.col("status").in("open", "closed", "pending").errorThreshold(0.2).description("Could be new status introduced"),
+        validation.col("customer_details.age").greaterThan(18),
+        validation.expr("FORALL(update_history, x -> x.updated_time > TIMESTAMP('2022-01-01 00:00:00'))"),
+        validation.col("update_history").greaterThanSize(2),
+        validation.unique("account_id"),
+        validation.groupBy().count().isEqual(1000),
+        validation.groupBy("account_id").max("balance").lessThan(900)
+      )
+
+    val config = configuration
+      .generatedReportsFolderPath("/opt/app/data/report")
+      .enableValidation(true)
+      .enableGenerateData(false)
+    
+    execute(config, jsonTask)
+    ```
+
+### Run
+
+Let's try run.
+
+```shell
+./run.sh
+#input class MyValidationJavaPlan or MyValidationPlan
+#after completing, check report at docker/sample/report/index.html
+```
+
+It should look something like this.
+
+<video src="https://user-images.githubusercontent.com/26299147/283040918-5de0c992-cddf-4ab1-a501-273ceef0cb30.mov" data-canonical-src="https://user-images.githubusercontent.com/26299147/283040918-5de0c992-cddf-4ab1-a501-273ceef0cb30.mov" controls="controls" muted="muted" style="max-height:640px; min-height: 200px"></video>
+
+Check the full example at `ValidationPlanRun` inside the examples repo.
diff --git a/docs/setup/guide/scenario/records-per-column.md b/docs/setup/guide/scenario/records-per-column.md
index 71306968..492271bd 100644
--- a/docs/setup/guide/scenario/records-per-column.md
+++ b/docs/setup/guide/scenario/records-per-column.md
@@ -41,7 +41,7 @@ Make sure your class extends `PlanRun`.
                             field().name("amount").type(DoubleType.instance()).min(1).max(100),
                             field().name("time").type(TimestampType.instance()).min(java.sql.Date.valueOf("2022-01-01")),
                             field().name("date").type(DateType.instance()).sql("DATE(time)")
-                    )
+                    );
     
             var config = configuration()
                     .generatedReportsFolderPath("/opt/app/data/report")
@@ -60,7 +60,7 @@ Make sure your class extends `PlanRun`.
     
     class MyMultipleRecordsPerColPlan extends PlanRun {
 
-      val transactionTask: ConnectionTaskBuilder[FileBuilder] = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true"))
+      val transactionTask = csv("customer_transactions", "/opt/app/data/customer/transaction", Map("header" -> "true"))
         .schema(
           field.name("account_id").regex("ACC[0-9]{8}"), 
           field.name("full_name").expression("#{Name.name}"), 
diff --git a/docs/setup/validation/group-by-validation.md b/docs/setup/validation/group-by-validation.md
index 46d114f0..50c2bccd 100644
--- a/docs/setup/validation/group-by-validation.md
+++ b/docs/setup/validation/group-by-validation.md
@@ -1,8 +1,40 @@
 # Group By Validation
 
-If you want to run aggregations based on a particular set of columns, you can do so via group by validations. An example
-would be checking that the sum of `amount` is less than 1000 per `account_id, year`. The validations applied can
-be one of the validations from above.
+If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group
+by validations. An example would be checking that the sum of `amount` is less than 1000 per `account_id, year`. The
+validations applied can be one of the validations from the [basic validation set found here](basic-validation.md).
+
+## Record count
+
+Check the number of records across the whole dataset.
+
+=== "Java"
+
+    ```java
+    validation().groupBy().count().lessThan(1000)
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.groupBy().count().lessThan(1000)
+    ```
+
+## Record count per group
+
+Check the number of records for each group.
+
+=== "Java"
+
+    ```java
+    validation().groupBy("account_id", "year").count().lessThan(10)
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.groupBy("account_id", "year").count().lessThan(10)
+    ```
 
 ## Sum
 
@@ -83,3 +115,19 @@ Check the average for each group adheres to validation.
     ```scala
     validation.groupBy("account_id", "year").avg("amount").between(40, 60)
     ```
+
+## Standard deviation
+
+Check the standard deviation for each group adheres to validation.
+
+=== "Java"
+
+    ```java
+    validation().groupBy("account_id", "year").stddev("amount").between(0.5, 0.6)
+    ```
+
+=== "Scala"
+
+    ```scala
+    validation.groupBy("account_id", "year").stddev("amount").between(0.5, 0.6)
+    ```
diff --git a/docs/use-case/roadmap.md b/docs/use-case/roadmap.md
index c8db6fc2..52f9c9f2 100644
--- a/docs/use-case/roadmap.md
+++ b/docs/use-case/roadmap.md
@@ -61,3 +61,4 @@
 - Ordering within data sources that support order for insertion
 - Clean up data in consumer data sinks
 - :white_check_mark: Trial app to try out all features
+- HTTP response data validation
diff --git a/mkdocs.yml b/mkdocs.yml
index 20e84c65..282186e1 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -64,6 +64,8 @@ nav:
           - Scenario:
               - First Data Generation: 'setup/guide/scenario/first-data-generation.md'
               - Multiple Records Per Column Value: 'setup/guide/scenario/records-per-column.md'
+              - Foreign Keys Across Data Sources: 'setup/guide/scenario/batch-and-event.md'
+              - Data Validations: 'setup/guide/scenario/data-validation.md'
               - Auto Generate From Data Connection: 'setup/guide/scenario/auto-generate-connection.md'
               - Delete Generated Data: 'setup/guide/scenario/delete-generated-data.md'
               - Generate Batch and Event Data: 'setup/guide/scenario/batch-and-event.md'
diff --git a/site/404.html b/site/404.html
index 89918cdd..f721e1e4 100644
--- a/site/404.html
+++ b/site/404.html
@@ -659,6 +659,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="/setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="/setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="/setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/about/index.html b/site/about/index.html
index 53d65d13..76bc07f7 100644
--- a/site/about/index.html
+++ b/site/about/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/diagrams/data_validation_report.png b/site/diagrams/data_validation_report.png
new file mode 100644
index 00000000..ee175055
Binary files /dev/null and b/site/diagrams/data_validation_report.png differ
diff --git a/site/get-started/docker/index.html b/site/get-started/docker/index.html
index 14988a2d..61c71509 100644
--- a/site/get-started/docker/index.html
+++ b/site/get-started/docker/index.html
@@ -768,6 +768,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/index.html b/site/index.html
index 4cbabf6c..01fff376 100644
--- a/site/index.html
+++ b/site/index.html
@@ -747,6 +747,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/legal/privacy-policy/index.html b/site/legal/privacy-policy/index.html
index bd959781..54ddf43f 100644
--- a/site/legal/privacy-policy/index.html
+++ b/site/legal/privacy-policy/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/legal/terms-of-service/index.html b/site/legal/terms-of-service/index.html
index 5c0f3d17..34d11a41 100644
--- a/site/legal/terms-of-service/index.html
+++ b/site/legal/terms-of-service/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/pricing/index.html b/site/pricing/index.html
index 6aefc112..c12356d3 100644
--- a/site/pricing/index.html
+++ b/site/pricing/index.html
@@ -670,6 +670,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/search/search_index.json b/site/search/search_index.json
index da104358..8768e54e 100644
--- a/site/search/search_index.json
+++ b/site/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and  testing tool that aids in creating production-like data across both batch and event data systems. Run data validations  to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it <p> Try now </p> Data testing is difficult and fragmented <ul> <li>Data being sent via messages, HTTP requests or files and getting stored in databases, file systems, etc.</li> <li>Maintaining and updating tests with the latest schemas and business definitions</li> <li>Different testing tools for services, jobs or data sources</li> <li>Complex relationships between datasets and fields</li> <li>Different scenarios, permutations, combinations and edge cases to cover</li> </ul> Current solutions only cover half the story <ul> <li>Specific testing frameworks that support one or limited number of data sources or transport protocols</li> <li>Under utilizing metadata from data catalogs or metadata discovery services</li> <li>Testing teams having difficulties understanding when failures occur</li> <li>Integration tests relying on external teams/services</li> <li>Manually generating data, or worse, copying/masking production data into lower environments</li> <li>Observability pushes towards being reactive rather than proactive</li> </ul> <p> Try now </p> What you need is a reliable tool that can handle changes to your data landscape <p> </p> <p>With Data Caterer, you get:</p> <ul> <li>Ability to connect to any type of data source: files, SQL or no-SQL databases, messaging systems, HTTP</li> <li>Discover metadata from your existing infrastructure and services</li> <li>Gain confidence that bugs do not propagate to production</li> <li>Be proactive in ensuring changes do not affect other data producers or consumers</li> <li>Configurability to run the way you want</li> </ul> <p> Try now </p>"},{"location":"#tech-summary","title":"Tech Summary","text":"<p>Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to  get into details? Checkout the setup pages here to get code examples and guides that will take you  through scenarios and data sources.</p> <p>Main features include:</p> <ul> <li> Metadata discovery</li> <li> Batch and  event data generation</li> <li> Maintain referential integrity across any dataset</li> <li> Create custom data generation scenarios</li> <li> Clean up generated data</li> <li> Validate data</li> <li> Suggest data validations</li> </ul> <p></p> <p>Check other run configurations here.</p>"},{"location":"#what-is-it","title":"What is it","text":"<ul> <li> <p> Data generation and testing tool</p> <p>Generate production like data to be consumed and validated.</p> </li> <li> <p> Designed for any data source</p> <p>We aim to support pushing data to any data source, in any format.</p> </li> <li> <p> Low/no code solution</p> <p>Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.</p> </li> <li> <p> Developer productivity tool</p> <p>If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.</p> </li> </ul>"},{"location":"#what-it-is-not","title":"What it is not","text":"<ul> <li> <p> Metadata storage/platform</p> <p>You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.</p> </li> <li> <p> Data contract</p> <p>The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.</p> </li> <li> <p> Metrics from load testing</p> <p>Although millions of records can be generated, there are limited capabilities in terms of metric capturing.</p> </li> </ul> <p> Try now </p> Data Catering vs Other tools vs In-house <p> Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it <p></p>"},{"location":"about/","title":"About","text":"<p>Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.</p> <p>I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.</p> <p>Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.</p>"},{"location":"about/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"about/#terms-of-service","title":"Terms of service","text":"<p>Terms of service can be found here.</p>"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"<p>Privacy policy can be found here.</p>"},{"location":"pricing/","title":"Pricing","text":"<p>To have access to the paid features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.</p>"},{"location":"pricing/#paid-features","title":"Paid Features","text":"<ul> <li> Metadata discovery</li> <li> All data sources (see here for all data sources)</li> <li> Batch and  Event generation</li> <li> Auto generation from data connections or metadata sources</li> <li> Suggest data validations</li> <li> Clean up generated data</li> <li> Run as many times as you want, not charged by usage</li> </ul>"},{"location":"pricing/#pricing-table","title":"Pricing Table","text":""},{"location":"pricing/#manage-subscription","title":"Manage Subscription","text":"<p>Manage via this link</p>"},{"location":"pricing/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"<p>Ensure you have <code>docker</code> installed and running.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; ./run.sh\n#check results under docker/sample/report/index.html folder\n</code></pre>"},{"location":"get-started/docker/#report","title":"Report","text":"<p>Check the report generated under <code>docker/data/custom/report/index.html</code>.</p> <p>Sample report can also be seen here</p>"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"<p>30 day trial of the paid version can be accessed via these steps:</p> <ol> <li>Join the Slack Data Catering Slack group here</li> <li>Get an API_KEY by using slash command <code>/token</code> in the Slack group (will only be visible to you)</li> <li> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; export DATA_CATERING_API_KEY=&lt;insert api key&gt;\n./run.sh\n</code></pre> </li> </ol> <p>If you want to check how long your trial has left, you can check back in the Slack group or type <code>/token</code> again.</p>"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"<p>Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.</p>"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"<p>Last updated September 25, 2023</p>"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"<p>Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.</p>"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"<p>For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.</p>"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"<p>Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:</p> <ul> <li>public telephone directories, where the subscriber can refuse to be listed</li> <li>professional and business directories available to the public</li> <li>public registries and court records</li> <li>other publicly available printed and electronic publications</li> </ul>"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"<p>Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.</p>"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"<p>Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:</p> <ul> <li>communicating with you generally</li> <li>processing your purchases</li> <li>processing and keeping track of transactions and reporting back to you</li> <li>protecting against fraud or error</li> <li>providing product and services requested by you</li> <li>recommending products and services that Peter John Flook believes will be of interest and provide value to you</li> <li>fulfilling any other purpose that would be reasonably apparent to the average person at the time we collect it from   you</li> </ul> <p>Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).</p>"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"<p>Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.</p> <p>We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.</p> <p>Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.</p> <p>We also receive and store certain types of information whenever you interact with us.</p>"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"<p>All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).</p>"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"<p>Peter John Flook does not disclose personal information to any organization or person for any reason except the following:</p> <p>We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.</p> <p>Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.</p>"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"<p>Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.</p> <p>You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"<p>We take steps to safeguard your personal information, regardless of the format in which it is held, including:</p> <p>physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.</p>"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"<p>Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.</p> <p>These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.</p>"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"<p>In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.</p>"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"<p>We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"<p>You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:</p> <p>inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.</p>"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"<ul> <li>by email at <code>peter.flook@data.catering</code></li> </ul> <p>Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.</p>"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"<p>Last updated: September 25, 2023</p> <p>Please read these terms and conditions carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"<p>The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.</p>"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"<p>For the purposes of these Terms and Conditions:</p> <ul> <li>Application means the software program provided by the Company downloaded by You on any electronic device, named   Data Caterer</li> <li>Application Store means the digital distribution service operated and developed by Docker Inc. (\u201cDocker\u201d) in which   the Application has been downloaded.</li> <li>Affiliate means an entity that controls, is controlled by or is under common control with a party, where \"control\"   means ownership of 50% or more of the shares, equity interest or other securities entitled to vote for election of   directors or other managing authority.</li> <li>Country refers to: New South Wales, Australia</li> <li>Company (referred to as either \"the Company\", \"We\", \"Us\" or \"Our\" in this Agreement) refers to Peter John Flook (   ABN: 65153160916), 30 Anne William Drive, West Pennant Hills, 2125, NSW, Australia.</li> <li>Device means any device that can access the Service such as a computer, a cellphone or a digital tablet.</li> <li>Service refers to the Application.</li> <li>Terms and Conditions (also referred as \"Terms\") mean these Terms and Conditions that form the entire agreement   between You and the Company regarding the use of the Service.</li> <li>Third-party Social Media Service means any services or content (including data, information, products or services)   provided by a third party that may be displayed, included or made available by the Service.</li> <li>You means the individual accessing or using the Service, or the company, or other legal entity on behalf of which   such individual is accessing or using the Service, as applicable.</li> </ul>"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"<p>These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.</p> <p>Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.</p> <p>By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.</p> <p>You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.</p> <p>Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"<p>Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.</p> <p>The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.</p> <p>We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.</p>"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"<p>We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.</p> <p>Upon termination, Your right to use the Service will cease immediately.</p>"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"<p>Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.</p> <p>To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.</p> <p>Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.</p>"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"<p>The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.</p> <p>Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.</p> <p>Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.</p>"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"<p>The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.</p>"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"<p>If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.</p>"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"<p>If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.</p>"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"<p>You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.</p>"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"<p>If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.</p>"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"<p>Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.</p>"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"<p>These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.</p>"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"<p>We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.</p> <p>By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.</p>"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"<p>If you have any questions about these Terms and Conditions, You can contact us:</p> <ul> <li>By email: peter.flook@data.catering</li> </ul>"},{"location":"setup/","title":"Setup","text":"<p>All the configurations and customisation related to Data Caterer can be found under here.</p>"},{"location":"setup/#guide","title":"Guide","text":"<p>If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.</p>"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"<ul> <li> Configurations - Configurations relating to feature flags, folder pathways, metadata   analysis</li> <li> Connections - Explore the data source connections available</li> <li> Generators - Choose and configure the type of generator you want used for   fields</li> <li> Validations - How to validate data to ensure your system is performing as expected</li> <li> Foreign Keys - Define links between data elements across data sources</li> <li> Deployment - Deploy Data Caterer as a job to your chosen environment</li> <li> Advanced - Advanced usage of Data Caterer</li> </ul>"},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/configuration/","title":"Configuration","text":"<p>A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.</p> <p>These configurations are defined from within your Java or Scala class via <code>configuration</code> or for YAML file setup, <code>application.conf</code> file as seen  here.</p>"},{"location":"setup/configuration/#flags","title":"Flags","text":"<p>Flags are used to control which processes are executed when you run Data Caterer.</p> Config Default Paid Description <code>enableGenerateData</code> true N Enable/disable data generation <code>enableCount</code> true N Count the number of records generated. Can be disabled to improve performance <code>enableFailOnError</code> true N Whilst saving generated data, if there is an error, it will stop any further data from being generated <code>enableSaveReports</code> true N Enable/disable HTML reports summarising data generated, metadata of data generated (if <code>enableSinkMetadata</code> is enabled) and validation results (if <code>enableValidation</code> is enabled). Sample here <code>enableSinkMetadata</code> true N Run data profiling for the generated data. Shown in HTML reports if <code>enableSaveSinkMetadata</code> is enabled <code>enableValidation</code> false N Run validations as described in plan. Results can be viewed from logs or from HTML report if <code>enableSaveSinkMetadata</code> is enabled. Sample here <code>enableGeneratePlanAndTasks</code> false Y Enable/disable plan and task auto generation based off data source connections <code>enableRecordTracking</code> false Y Enable/disable which data records have been generated for any data source <code>enableDeleteGeneratedRecords</code> false Y Delete all generated records based off record tracking (if <code>enableRecordTracking</code> has been set to true) <code>enableGenerateValidations</code> false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf <pre><code>configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n</code></pre> <pre><code>configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n</code></pre> <pre><code>flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n</code></pre>"},{"location":"setup/configuration/#folders","title":"Folders","text":"<p>Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.</p> <p>These folder pathways can be defined as a cloud storage pathway (i.e. <code>s3a://my-bucket/task</code>).</p> Config Default Paid Description <code>planFilePath</code> /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data <code>taskFolderPath</code> /opt/app/task N Task folder path that contains all the task files (can have nested directories) <code>validationFolderPath</code> /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) <code>generatedReportsFolderPath</code> /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed <code>generatedPlanAndTaskFolderPath</code> /tmp Y Folder path where generated plan and task files will be saved <code>recordTrackingFolderPath</code> /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf <pre><code>configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n</code></pre> <pre><code>configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n</code></pre> <pre><code>folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n</code></pre>"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"<p>When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if <code>enableGeneratePlanAndTasks</code> or 2) if <code>enableSinkMetadata</code> are enabled.</p> <p>During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.</p> Config Default Paid Description <code>numRecordsFromDataSource</code> 10000 Y Number of records read in from the data source that could be used for data profiling <code>numRecordsForAnalysis</code> 10000 Y Number of records used for data profiling from the records gathered in <code>numRecordsFromDataSource</code> <code>oneOfMinCount</code> 1000 Y Minimum number of records required before considering if a field can be of type <code>oneOf</code> <code>oneOfDistinctCountVsCountThreshold</code> 0.2 Y Threshold ratio to determine if a field is of type <code>oneOf</code> (i.e. a field called <code>status</code> that only contains <code>open</code> or <code>closed</code>. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as <code>oneOf</code>) <code>numGeneratedSamples</code> 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf <pre><code>configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n</code></pre> <pre><code>configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n</code></pre> <pre><code>metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n</code></pre>"},{"location":"setup/configuration/#generation","title":"Generation","text":"<p>When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.</p> Config Default Paid Description <code>numRecordsPerBatch</code> 100000 N Number of records across all data sources to generate per batch <code>numRecordsPerStep</code> N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf <pre><code>configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n</code></pre> <pre><code>configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n</code></pre> <pre><code>generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n</code></pre>"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"<p>Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your  specifications via configuration as seen here.</p> JavaScalaapplication.conf <pre><code>configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n</code></pre> <pre><code>configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -&gt; \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -&gt; \"10g\")\n</code></pre> <pre><code>runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n</code></pre>"},{"location":"setup/advanced/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/advanced/#special-data-formats","title":"Special data formats","text":"<p>There are many options available for you to use when you have a scenario when data has to be a certain format.</p> <ol> <li>Create expression datafaker<ol> <li>Can be used to create names, addresses, or anything that can be found    under here</li> </ol> </li> <li>Create regex</li> </ol>"},{"location":"setup/advanced/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"<p>Details for how you can configure foreign keys can be found here.</p>"},{"location":"setup/advanced/advanced/#edge-cases","title":"Edge cases","text":"<p>For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:</p> JavaScalaYAML <pre><code>field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n</code></pre> <p>If you want to know all the possible edge cases for each data type, can check the documentation here.</p>"},{"location":"setup/advanced/advanced/#scenario-testing","title":"Scenario testing","text":"<p>You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the <code>status</code> column in the account data to only generate <code>open</code> accounts and define a foreign key between Postgres and parquet to ensure the same <code>account_id</code> is being used. Then in the parquet task, define 1 to 10 transactions per <code>account_id</code> to be generated.</p> <p>Postgres account generation example task Parquet transaction generation example task Plan</p>"},{"location":"setup/advanced/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/advanced/#data-source","title":"Data source","text":"<p>If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.</p> JavaScalaYAML <pre><code>var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/advanced/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"<p>You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the <code>application.conf</code> file where you can set something like the below:</p> JavaScalaYAML <pre><code>configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n</code></pre> <pre><code>configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/connection/connection/","title":"Data Source Connections","text":"<p>Details of all the connection configuration supported can be found in the below subsections for each type of connection.</p> <p>These configurations can be done via API or from configuration. Examples of both are shown for each data source below.</p>"},{"location":"setup/connection/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Paid Database Postgres, MySQL, Cassandra N (Postgres), Y (rest) File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/connection/#api","title":"API","text":"<p>All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.</p>"},{"location":"setup/connection/connection/#configuration-file","title":"Configuration file","text":"<p>All connection details follow the same pattern.</p> <pre><code>&lt;connection format&gt; {\n    &lt;connection name&gt; {\n        &lt;key&gt; = &lt;value&gt;\n    }\n}\n</code></pre> <p>Overriding configuration</p> <p>When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:</p> <pre><code>url = \"localhost\"\nurl = ${?POSTGRES_URL}\n</code></pre> <p>The above defines that if there is a system property or environment variable named <code>POSTGRES_URL</code>, then that value will be used for the <code>url</code>, otherwise, it will default to <code>localhost</code>.</p>"},{"location":"setup/connection/connection/#data-sources","title":"Data sources","text":"<p>To find examples of a task for each type of data source, please check out this page.</p>"},{"location":"setup/connection/connection/#file","title":"File","text":"<p>Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.</p>"},{"location":"setup/connection/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for CSV can be found here</p>"},{"location":"setup/connection/connection/#json","title":"JSON","text":"JavaScalaapplication.conf <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for JSON can be found here</p>"},{"location":"setup/connection/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for ORC can be found here</p>"},{"location":"setup/connection/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for Parquet can be found here</p>"},{"location":"setup/connection/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n</code></pre>"},{"location":"setup/connection/connection/#rmdbs","title":"RMDBS","text":"<p>Follows the same configuration used by Spark as found here. Sample can be found below</p> JavaScalaapplication.conf <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n</code></pre> <p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> SQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/connection/#postgres","title":"Postgres","text":"<p>Can see example API or Config definition for Postgres connection above.</p>"},{"location":"setup/connection/connection/#permissions","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.tables TO &lt; user &gt;;\nGRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\nGRANT SELECT ON information_schema.table_constraints TO &lt; user &gt;;\nGRANT SELECT ON information_schema.constraint_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#permissions_1","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.statistics TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/connection/#cassandra","title":"Cassandra","text":"<p>Follows same configuration as defined by the Spark Cassandra Connector as found here</p> JavaScalaapplication.conf <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n</code></pre> <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n</code></pre> <pre><code>org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#permissions_2","title":"Permissions","text":"<p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO &lt;user&gt;;\nGRANT SELECT ON system_schema.columns TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/connection/#kafka","title":"Kafka","text":"<p>Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here</p> JavaScalaapplication.conf <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n</code></pre> <p>When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.</p>"},{"location":"setup/connection/connection/#jms","title":"JMS","text":"<p>Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.</p> JavaScalaapplication.conf <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#http","title":"HTTP","text":"<p>Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.</p> JavaScalaapplication.conf <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/deployment/deployment/","title":"Deployment","text":"<p>Two main ways to deploy and run Data Caterer:</p> <ul> <li>Docker</li> <li>Helm</li> </ul>"},{"location":"setup/deployment/deployment/#docker","title":"Docker","text":"<p>To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.</p> <p>Then you can run the following:</p> <pre><code>./gradlew clean build\ndocker build -t &lt;my_image_name&gt;:&lt;my_image_tag&gt; .\n</code></pre>"},{"location":"setup/deployment/deployment/#helm","title":"Helm","text":"<p>Link to sample helm on GitHub here</p> <p>Update the configuration to your own data connections and configuration or own image created from above.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/","title":"Foreign Keys","text":"<p>Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.</p>"},{"location":"setup/foreign-key/foreign-key/#single-column","title":"Single column","text":"<p>Define a column in one data source to match against another column. Below example shows a <code>postgres</code> data source with two tables, <code>accounts</code> and <code>transactions</code> that have a foreign key for <code>account_id</code>.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -&gt; \"account_id\")\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/#multiple-columns","title":"Multiple columns","text":"<p>You may have a scenario where multiple columns need to be aligned. From the same example, we want <code>account_id</code> and <code>name</code> from <code>accounts</code> to match with <code>account_id</code> and <code>full_name</code> to match in <code>transactions</code> respectively.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -&gt; List(\"account_id\", \"full_name\"))\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/#nested-column","title":"Nested column","text":"<p>Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.</p> <p>In the example below, the nested <code>customer_details.name</code> field inside the <code>json</code> task needs to match with <code>name</code> from <code>postgres</code>. A new field in the <code>json</code> called <code>_txn_name</code> is used as a temporary column to facilitate the foreign key definition.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -&gt; List(\"account_id\", \"_txn_name\"))\n)\n</code></pre> <pre><code>---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n</code></pre>"},{"location":"setup/generator/count/","title":"Record Count","text":"<p>There are options related to controlling the number of records generated that can help in generating the scenarios or data required.</p>"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"<p>Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file  </p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n</code></pre>"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"<p>As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n</code></pre>"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"<p>When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.  </p> <p>One example of this would be when generating transactions relating to a customer, a customer may be defined by columns <code>account_id, name</code>. A number of transactions would be generated per <code>account_id, name</code>.  </p> <p>You can also use a combination of the above two methods to generate the number of records per column.</p>"},{"location":"setup/generator/count/#records","title":"Records","text":"<p>When defining a base number of records within the <code>perColumn</code> configuration, it translates to creating <code>(count.records * count.recordsPerColumn)</code> records. This is a fixed number of records that will be generated each time, with no variation between runs.</p> <p>In the example below, we have <code>count.records = 1000</code> and <code>count.recordsPerColumn = 2</code>. Which means that <code>1000 * 2 = 2000</code> records will be generated in total.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n</code></pre>"},{"location":"setup/generator/count/#generated","title":"Generated","text":"<p>You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.</p> <p>In the example below, it will generate between <code>(count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000</code> and <code>(count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000</code> records.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n</code></pre>"},{"location":"setup/generator/generator/","title":"Data Generators","text":""},{"location":"setup/generator/generator/#data-types","title":"Data Types","text":"<p>Below is a list of all supported data types for generating data:</p> Data Type Spark Data Type Options Description string StringType <code>minLen, maxLen, expression, enableNull</code> integer IntegerType <code>min, max, stddev, mean</code> long LongType <code>min, max, stddev, mean</code> short ShortType <code>min, max, stddev, mean</code> decimal(precision, scale) DecimalType(precision, scale) <code>min, max, stddev, mean</code> double DoubleType <code>min, max, stddev, mean</code> float FloatType <code>min, max, stddev, mean</code> date DateType <code>min, max, enableNull</code> timestamp TimestampType <code>min, max, enableNull</code> boolean BooleanType binary BinaryType <code>minLen, maxLen, enableNull</code> byte ByteType array ArrayType <code>arrayMinLen, arrayMaxLen, arrayType</code> _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/generator/#options","title":"Options","text":""},{"location":"setup/generator/generator/#all-data-types","title":"All data types","text":"<p>Some options are available to use for all types of data generators. Below is the list along with example and descriptions:</p> Option Default Example Description <code>enableEdgeCase</code> false <code>enableEdgeCase: \"true\"</code> Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) <code>edgeCaseProbability</code> 0.0 <code>edgeCaseProb: \"0.1\"</code> Probability of generating a random edge case value if <code>enableEdgeCase</code> is true <code>isUnique</code> false <code>isUnique: \"true\"</code> Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data <code>seed</code> <code>seed: \"1\"</code> Defines the random seed for generating data for that particular column. It will override any seed defined at a global level <code>sql</code> <code>sql: \"CASE WHEN amount &lt; 10 THEN true ELSE false END\"</code> Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/generator/#string","title":"String","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated strings have at least length <code>minLen</code> <code>maxLen</code> 10 <code>maxLen: \"15\"</code> Ensures that all generated strings have at most length <code>maxLen</code> <code>expression</code> <code>expression: \"#{Name.name}\"</code><code>expression:\"#{Address.city}/#{Demographic.maritalStatus}\"</code> Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format <code>#{&lt;faker expression name&gt;}</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")</p>"},{"location":"setup/generator/generator/#sample","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n</code></pre>"},{"location":"setup/generator/generator/#numeric","title":"Numeric","text":"<p>For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).</p>"},{"location":"setup/generator/generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)</p>"},{"location":"setup/generator/generator/#sample_1","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n</code></pre>"},{"location":"setup/generator/generator/#decimal","title":"Decimal","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <code>numericPrecision</code> 10 <code>precision: \"25\"</code> The maximum number of digits <code>numericScale</code> 0 <code>scale: \"25\"</code> The number of digits on the right side of the decimal point (has to be less than or equal to precision) <p>Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)</p>"},{"location":"setup/generator/generator/#sample_2","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n</code></pre>"},{"location":"setup/generator/generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description <code>min</code> 0.0 <code>min: \"2.1\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000.0 <code>max: \"25.9\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)</p>"},{"location":"setup/generator/generator/#sample_3","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n</code></pre>"},{"location":"setup/generator/generator/#date","title":"Date","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)</p>"},{"location":"setup/generator/generator/#sample_4","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n</code></pre>"},{"location":"setup/generator/generator/#timestamp","title":"Timestamp","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31 23:10:10\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31 23:10:10\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)</p>"},{"location":"setup/generator/generator/#sample_5","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n</code></pre>"},{"location":"setup/generator/generator/#binary","title":"Binary","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated array of bytes have at least length <code>minLen</code> <code>maxLen</code> 20 <code>maxLen: \"15\"</code> Ensures that all generated array of bytes have at most length <code>maxLen</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)</p>"},{"location":"setup/generator/generator/#sample_6","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n</code></pre>"},{"location":"setup/generator/generator/#array","title":"Array","text":"Option Default Example Description <code>arrayMinLen</code> 0 <code>arrayMinLen: \"2\"</code> Ensures that all generated arrays have at least length <code>arrayMinLen</code> <code>arrayMaxLen</code> 5 <code>arrayMaxLen: \"15\"</code> Ensures that all generated arrays have at most length <code>arrayMaxLen</code> <code>arrayType</code> <code>arrayType: \"double\"</code> Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true"},{"location":"setup/generator/generator/#sample_7","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array&lt;double&gt;\"\n</code></pre>"},{"location":"setup/generator/report/","title":"Report","text":"<p>Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much  data was generated, where it was generated, validation results and any associated metadata. </p>"},{"location":"setup/generator/report/#sample","title":"Sample","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/guide/","title":"Guides","text":"<p>Below are a list of guides you can follow to create your data generation for your use case.</p> <p>For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.</p>"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"<ul> <li>First Data Generation - If you are new, this is the place to start</li> <li>Multiple Records Per Column Value - How you can generate multiple records per set of columns</li> <li>Foreign Keys Across Data Sources - Generate matching values across generated data sets</li> <li>Data Validations - (Soon to document) Run data validations after generating data</li> <li>Auto Generate From Data Connection - Automatically generating data from just defining data sources</li> <li>Delete Generated Data - Delete the generated data whilst leaving other data</li> <li>Generate Batch and Event Data - Generate matching batch and event data</li> </ul>"},{"location":"setup/guide/#data-sources","title":"Data Sources","text":"<ul> <li>Files (CSV, JSON, ORC, Parquet) - Generate data for popular file formats</li> <li>Postgres - JDBC Postgres tables</li> <li>Cassandra - Cassandra tables</li> <li>Kafka - Kafka topics</li> <li>Solace - Solace messages</li> <li>Marquez - Generate data based on metadata in Marquez</li> <li>OpenMetadata - Generate data based on metadata in OpenMetadata</li> <li>HTTP - HTTP requests</li> <li>Files (Fixed width) - (Soon to document) A variant of CSV but with no separator</li> <li>MySql - (Soon to document) JDBC MySql tables</li> </ul>"},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"<p>The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.</p>"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"<p>Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2</p>"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"<p>Basic configuration</p>"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"<p>To see how it runs against different data sources, you can run using <code>docker-compose</code> and set <code>DATA_SOURCE</code> like below</p> <pre><code>./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n</code></pre> <p>Can set it to one of the following:</p> <ul> <li>postgres</li> <li>mysql</li> <li>cassandra</li> <li>solace</li> <li>kafka</li> <li>http</li> </ul>"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"<p>Info</p> <p>Writing data to Cassandra is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.</p>"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Cassandra</li> </ul>"},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Cassandra instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"<p>Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d cassandra\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"<p>Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO data_caterer_user;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedCassandraJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedCassandraPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Cassandra.</p> JavaScala <pre><code>var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p> <pre><code>val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p>"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account.accounts</code> and <code>account.account_status_history</code> tables as defined under<code>docker/data/cql/customer.cql</code>. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:</p> <pre><code>docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n</code></pre> <p>Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.</p> <pre><code>CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n</code></pre> <p>Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code> which corresponds to <code>text</code> in Cassandra.</p> JavaScala <pre><code>{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"<p><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.</p> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"<p><code>amount</code> the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between <code>1</code> and <code>1000</code>.</p> JavaScala <pre><code>field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"<p><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker expressions can be found here</p> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"<p><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by using <code>java.sql.Date</code> like below.</p> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"<p><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</p> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"<p><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the logic: <code>if status is open or closed, then it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</p> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n</code></pre> <p>Your output should look like this.</p> <pre><code> count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"<p>Info</p> <p>Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on an OpenAPI/Swagger document.</p> <p></p>"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"<p>We will be using the http-bin docker image to help simulate a service with HTTP endpoints.</p> <p>Start it via:</p> <pre><code>cd docker\ndocker-compose up -d http\ndocker ps\n</code></pre>"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedHttpJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedHttpPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"<p>We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under <code>docker/mount/http/petstore.json</code> in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.</p> <p>We have kept the following endpoints to test out:</p> <ul> <li>GET /pets - get all pets</li> <li>POST /pets - create a new pet</li> <li>GET /pets/{id} - get a pet by id</li> <li>DELETE /pets/{id} - delete a pet by id</li> </ul> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n</code></pre> <p>The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.</p>"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n</code></pre> <p>It should look something like this.</p> <pre><code>172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"<p>The four different requests that get sent could have the same <code>id</code> passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular <code>id</code> value. We note that the <code>id</code> value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.</p> <p>To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.</p> HTTP Type Column Prefix Example Request Body <code>bodyContent</code> <code>bodyContent.id</code> Path Parameter <code>pathParam</code> <code>pathParamid</code> Query Parameter <code>queryParam</code> <code>queryParamid</code> Header <code>header</code> <code>headerContent_Type</code> <p>Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of <code>{http method}{http path}</code>. For example, <code>POST/pets</code>. Let's apply this knowledge to link all the <code>id</code> values together.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Now we have the same <code>id</code> values being produced across the POST, DELETE and GET requests! What if we knew that the <code>id</code> values should follow a particular pattern?</p>"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"<p>So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the <code>id</code> column for the POST request and it will proliferate to the other endpoints as well. Given the <code>id</code> column is a nested column as noted in the foreign key, we can alter its metadata via the following:</p> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n</code></pre> <p>We first get the column <code>bodyContent</code>, then get the nested schema and get the column <code>id</code> and add metadata stating that <code>id</code> should follow the patter <code>ID[0-9]{8}</code>.</p> <p>Let's try run again, and hopefully we should see some proper ID values.</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Great! Now we have replicated a production-like flow of HTTP requests.</p>"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"<p>If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).</p>"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"<p>By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -&gt; \"1\"))\n...\n</code></pre> <p>Check out the full example under <code>AdvancedHttpPlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"<p>Info</p> <p>Writing data to Kafka is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.</p>"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Kafka</li> </ul>"},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Kafka instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"<p>Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d kafka\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedKafkaJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedKafkaPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Kafka.</p> JavaScala <pre><code>var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n</code></pre> <p>Additional options can be found here.</p> <pre><code>val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n</code></pre> <p>Additional options can be found here.</p>"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:</p> <pre><code>docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n</code></pre> <pre><code>val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"<p>The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value</p> <p>Whilst, the other fields are optional: - key - partition - headers</p>"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the  <code>value</code> part, it refers to <code>content.account_id</code> where <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will  sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code> .</p>"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>Your output should look like this.</p> <pre><code>{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"<p>You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.</p> <p>The command that was run for this example to help with setup of dummy data was <code>./docker/up.sh -a 5001 -m 5002 --seed</code>.</p> <p>Check that the following url shows some data like below once you click on <code>food_delivery</code> from the <code>ns</code> drop down in the top right corner.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"<p>Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:</p> <pre><code>docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a <code>namespace</code>, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific <code>namespace</code> and <code>dataset</code>.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -&gt; \"overwrite\", \"header\" -&gt; \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>food_delivery</code> namespace and <code>public.categories</code> dataset to retrieve the schema information from.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala <pre><code>var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n</code></pre> <pre><code>val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n</code></pre> <p>We now have pointed this Postgres instance to produce multiple schemas that are defined under the <code>food_delivery</code> namespace. Also note that we are using database <code>food_delivery</code> in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <p>It should look something like this.</p> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n</code></pre> <p>You can also try query some other tables. Let's also check what is in the CSV file.</p> <pre><code>$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p> <p>What if we wanted the same records in Postgres <code>public.delivery_7_days</code> to also show up in the CSV file? That's where we can use a foreign key definition.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"<p>We can take a look at the report (under <code>docker/sample/report/index.html</code>) to see what we need to do to create the  foreign key. From the overview, you should see under <code>Tasks</code> there is a <code>my_postgres</code> task which has  <code>food_delivery_public.delivery_7_days</code> as a step. Click on the link for <code>food_delivery_public.delivery_7_days</code> and it  will take us to a page where we can find out about the columns used in this table. Click on the <code>Fields</code> button on the  far right to see.</p> <p>We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will  take all the fields.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n</code></pre> <pre><code>val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n</code></pre> <p>Notice how we have defined the <code>csvTask</code> and <code>foreignCols</code> as the main foreign key but for <code>postgresTask</code>, we had to  define it as a <code>foreignField</code>. This is because <code>postgresTask</code> has multiple tables within it, and we only want to define our foreign key with respect to the <code>public.delivery_7_days</code> table. We use the step name (can be seen from the report)  to specify the table to target. </p> <p>To test this out, we will truncate the <code>public.delivery_7_days</code> table in Postgres first, and then try run again.</p> <pre><code>docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n</code></pre> <p>Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.</p> <pre><code>$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate  data.</p> <p>Check out the full example under <code>AdvancedMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for a JSON file based on metadata stored in OpenMetadata.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"<p>You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.</p> <p>If that page becomes outdated or the link doesn't work, below are the commands I used to run it:</p> <pre><code>mkdir openmetadata-docker &amp;&amp; cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml &gt; docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n</code></pre> <p>Check that the following url works and login with <code>admin:admin</code>. Then you should see some data  like below:</p> <p></p>"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedOpenMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -&gt; \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -&gt; \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>sample_data.ecommerce_db.shopify.raw_customer</code> table. You can check out the schema here to see what it looks like.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n</code></pre> <p>It should look something like this.</p> <pre><code>{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"<p>We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.</p> <p>Let's make the <code>platform</code> field a choice field that can only be a set of certain values and the nested field <code>customer.sex</code> is also from a predefined set of values.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n</code></pre> <pre><code>{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own metadata and generate  data.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"<p>Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be  incorporated into your Data Caterer job as well by enabling data validations via <code>enableGenerateValidations</code> in  <code>configuration</code>.</p> JavaScala <pre><code>var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n</code></pre> <pre><code>val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n</code></pre> <p>Check out the full example under <code>AdvancedOpenMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"<p>Info</p> <p>Writing data to Solace is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Solace</li> </ul>"},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Solace instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"<p>Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d solace\n</code></pre> <p>Open up localhost:8080 and login with <code>admin:admin</code> and check there is the <code>default</code> VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under <code>docker/data/solace/setup_solace.sh</code> and change the <code>host</code> to <code>localhost</code>.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedSolaceJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedSolacePlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Solace.</p> JavaScala <pre><code>var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n</code></pre> <p>Additional connection options can be found here.</p> <pre><code>val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n</code></pre> <p>Additional connection options can be found here.</p>"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>rest_test_queue</code> or <code>rest_test_topic</code> that is already created for us from this step.</p> <p>Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n</code></pre> <pre><code>val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n</code></pre>"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"<p>The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:</p> <ul> <li>value</li> </ul> <p>Whilst, the other fields are optional:</p> <ul> <li>partition - refers to JMS priority of the message</li> <li>headers - refers to JMS message properties</li> </ul>"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>HeaderType.getType</code> which behind the scenes, translates to<code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the<code>value</code> part, it refers to <code>content.account_id</code> where  <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have  already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code>.</p>"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n</code></pre> <p>Your output should look like this.</p> <p></p> <p>Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.</p> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed. Or view the sample report found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"<p>Info</p> <p>Auto data generation from data connection is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on only a data connection to Postgres.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedAutomatedJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedAutomatedPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code, we note the following:</p> <ol> <li>Data source configuration to a Postgres data source called <code>my_postgres</code></li> <li>We have enabled the flag <code>enableGeneratePlanAndTasks</code> which tells Data Caterer to go to <code>my_postgres</code> and generate    data for all the tables found under the database <code>customer</code> (which is defined in the connection string).</li> <li>The config <code>generatedPlanAndTaskFolderPath</code> defines where the metadata that is gathered from <code>my_postgres</code> should be    saved at so that we could re-use it later.</li> <li><code>enableUniqueCheck</code> is set to true to ensure that generated data is unique based on primary key or foreign key    definitions.</li> </ol> <p>Note</p> <p>Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into  account, so generated data may fail to insert depending on the data source restrictions</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Also check the HTML report that gets generated under <code>docker/sample/report/index.html</code>. You can see a summary of what was generated along with other metadata.</p> <p>You can now look to play around with other tables or data sources and auto generate for them.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"<p>If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"<p>As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the <code>history</code> and <code>audit</code> schemas. Also, any table with the name <code>balances</code> or <code>transactions</code> in any schema will also not have data generated.</p> JavaScala <pre><code>var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n</code></pre> <pre><code>val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -&gt; \"history, audit\",\n\"filterOutTable\" -&gt; \"balances, transactions\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"<p>Info</p> <p>Generating event data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka topic with matching records in a CSV file.</p>"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"<p>If you don't have your own Kafka up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>.</p>"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedBatchEventJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedBatchEventPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n</code></pre> <p>We will borrow the Kafka task that is already defined under the class <code>AdvancedKafkaPlanRun</code> or <code>AdvancedKafkaJavaPlanRun</code>. You can go through the Kafka guide here for more details.</p>"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"<p>Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.</p> JavaScala <pre><code>var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n</code></pre> <pre><code>val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n</code></pre> <p>This is a simple schema where we want to use the values and metadata that is already defined in the <code>kafkaTask</code> to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.</p>"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"<p>From the above CSV schema, we see note the following against the Kafka schema:</p> <ul> <li><code>account_number</code> in CSV needs to match with the <code>account_id</code> in Kafka<ul> <li>We see that <code>account_id</code> is referred to in the <code>key</code> column as <code>field.name(\"key\").sql(\"content.account_id\")</code></li> </ul> </li> <li><code>year</code> needs to match with <code>content.year</code> in Kafka, which is a nested field<ul> <li>We can only do foreign key relationships with top level fields, not nested fields. So we define a new column   called <code>tmp_year</code> which will not appear in the final output for the Kafka messages but is used as an intermediate   step <code>field.name(\"tmp_year\").sql(\"content.year\").omit(true)</code></li> </ul> </li> <li><code>name</code> needs to match with <code>content.details.name</code> in Kafka, also a nested field<ul> <li>Using the same logic as above, we define a temporary column called <code>tmp_name</code> which will take the value of the   nested field but will be omitted <code>field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)</code></li> </ul> </li> <li><code>payload</code> represents the whole JSON message sent to Kafka, which matches to <code>value</code> column</li> </ul> <p>Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -&gt; List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>It should look something like this.</p> <pre><code>{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n</code></pre> <p>Let's also check if there is a corresponding record in the CSV file.</p> <pre><code>$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n</code></pre> <p>Great! The account, year, name and payload look to all match up.</p>"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"<p>You may notice that the events are generated first, then the CSV file. This is because as part of the <code>execute</code> function, we passed in the <code>kafkaTask</code> first, before the <code>csvTask</code>. You can change the order of execution by passing in <code>csvTask</code> before <code>kafkaTask</code> into the <code>execute</code> function.</p>"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"<p>Info</p> <p>Delete generated data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres and delete the generated data after using it.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedDeleteJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedDeletePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code we note the following:</p> <ol> <li>We have defined a Postgres connection called <code>my_postgres</code></li> <li><code>enableGeneratePlanAndTasks</code> is enabled to auto generate data for all tables under <code>customer</code> database</li> <li><code>enableRecordTracking</code> is enabled to ensure that all generated records are tracked. This will get used when we want    to delete data afterwards</li> <li><code>enableDeleteGeneratedRecords</code> is disabled for now. We want to see the generated data first and delete sometime after</li> <li><code>generatedPlanAndTaskFolderPath</code> is the folder path where we saved the metadata we have gathered from <code>my_postgres</code></li> <li><code>recordTrackingFolderPath</code> is the folder path where record tracking is maintained. We need to persist this data to    ensure it is still available when we want to delete data</li> </ol>"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Check the number of records via:</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"<p>We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.</p> <pre><code>.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n</code></pre> <p>Enable delete generated records and disable generating data. </p> <p>Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n</code></pre> <p>We now should have 1001 records in our <code>account.accounts</code> table. Let's delete the generated data now.</p> <pre><code>./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n</code></pre> <p>You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably  and also be able to clean it up.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"<p>Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - <code>recordTrackingFolderPath</code> needs to be set to the same value</p>"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"<p>Creating a data generator for a CSV file.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyCsvPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyCsvPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"<p>When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.</p> JavaScala <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p> <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -&gt; \"true\")           //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p>"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"<p>Our CSV file that we generate should adhere to a defined schema where we can also define data types.</p> <p>Let's define each field along with their corresponding data type. You will notice that the <code>string</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"<ul> <li><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it.   This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that   unique values are generated.</li> </ul> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"<ul> <li><code>balance</code> let's make the numbers not too large, so we can define a min and max for the generated numbers to be between   <code>1</code> and <code>1000</code>.</li> </ul> JavaScala <pre><code>field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"<ul> <li><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to   leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker   expressions   can be found here</li> </ul> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"<ul> <li><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by   using   <code>java.sql.Date</code> like below.</li> </ul> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"<ul> <li><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</li> </ul> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"<ul> <li><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the   logic: <code>if status is open or closed, then   it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</li> </ul> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"<p>We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the <code>accountTask</code> level like below. If you want to generate more records, set it to the value you want.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n</code></pre> <p>Your output should look like this.</p> <pre><code>account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"<p>Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.</p> <p>We can define our schema the same way along with any additional metadata.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"<p>Usually, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"<p>Above, you will notice that we are generating 5 records per <code>account_id, full_name</code>. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5.</p>"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"<p>In this scenario, we want to match the <code>account_id</code> in <code>account</code> to match the same column values in <code>transaction</code>. We also want to match <code>name</code> in <code>account</code> to <code>full_name</code> in <code>transaction</code>. This can be done via plan configuration like below.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n</code></pre> <p>Now, stitching it all together for the <code>execute</code> function, our final plan should look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n</code></pre> <p>Let's try run again.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the <code>DocumentationJavaPlanRun.java</code> or <code>DocumentationPlanRun.scala</code> files as well to check that your plan is the same.</p> <p>We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"<p>In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.</p>"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"<p>First, we define our connection properties for Postgres. You can check out the full options available here.</p> JavaScala <pre><code>var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n</code></pre> <pre><code>val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n</code></pre> <p>We can connect and access the data inside the table <code>account.transactions</code>. Now to define our data validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"<p>For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.</p> JavaScala <pre><code>var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n</code></pre> <pre><code>val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"<p>For all values in the <code>name</code> column, we check if they match the regex <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. As we know in the real world, names do not always follow the same pattern, so we allow for an <code>errorThreshold</code> before marking the validation as failed. Here, we define the <code>errorThreshold</code> to be <code>0.2</code>, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"<p>We check that all <code>balance</code> values are greater than or equal to 0. This time, we have a slightly different <code>errorThreshold</code> as it is set to <code>10</code>, which means, if the number of errors is greater than 10, then fail the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"<p>Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use <code>expr</code> to define a SQL expression that returns a boolean. In this scenario, we are checking if the <code>status</code> column has value <code>closed</code>, then the <code>close_date</code> should be not null, otherwise, <code>close_date</code> is null.</p>"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"<p>We check whether the combination of <code>account_id</code> and <code>name</code> are unique within the dataset. You can define one or more columns for <code>unique</code> validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"<p>There may be some business rule that states the number of <code>login_retry</code> should be less than 10 for each account. We can check this via a group by validation where we group by the <code>account_id, name</code>, take the maximum value for <code>login_retry</code> per <code>account_id,name</code> combination, then check if it is less than 10.</p> <p>You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.</p>"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"<p>Creating a data generator for a CSV file where there are multiple records per column values.</p>"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyMultipleRecordsPerColJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyMultipleRecordsPerColPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask: ConnectionTaskBuilder[FileBuilder] = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"<p>By default, tasks will generate 1000 records. You can alter this value via the <code>count</code> configuration which can be applied to individual tasks. For example, in Scala, <code>csv(...).count(count.records(100))</code> to generate only 100 records.</p>"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"<p>In this scenario, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre> <p>This will generate <code>1000 * 5 = 5000</code> records as the default number of records is set (1000) and per <code>account_id, full_name</code> from the initial 1000 records, 5 records will be generated.</p>"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"<p>Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set <code>standardDeviation</code> and <code>mean</code> for the number of records generated per column to follow a normal distribution.</p>"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>You can now look to play around with other count configurations found here.</p>"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"<p>Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).</p>"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"<p>Ensure all data in column is equal to certain value. Value can be of any data type.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isEqual(2021)\n</code></pre> <pre><code>validation.col(\"year\").isEqual(2021)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"<p>Ensure all data in column is not equal to certain value. Value can be of any data type.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotEqual(2021)\n</code></pre> <pre><code>validation.col(\"year\").isNotEqual(2021)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"<p>Ensure all data in column is null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNull()\n</code></pre> <pre><code>validation.col(\"year\").isNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"<p>Ensure all data in column is not null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotNull()\n</code></pre> <pre><code>validation.col(\"year\").isNotNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"<p>Ensure all data in column is contains certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"<p>Ensure all data in column does not contain certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#unqiue","title":"Unqiue","text":"<p>Ensure all data in column is unique.</p> JavaScalaYAML <pre><code>validation().unique(\"account_id\", \"name\")\n</code></pre> <pre><code>validation.unique(\"account_id\", \"name\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"<p>Ensure all data in column is less than certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThan(100)\n</code></pre> <pre><code>validation.col(\"amount\").lessThan(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt; 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"<p>Ensure all data in column is less than or equal to certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.col(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt;= 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"<p>Ensure all data in column is greater than certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThan(100)\n</code></pre> <pre><code>validation.col(\"amount\").greaterThan(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt; 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"<p>Ensure all data in column is greater than or equal to certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThanOrEqual(100)\n</code></pre> <pre><code>validation.col(\"amount\").greaterThanOrEqual(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt;= 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"<p>Ensure all data in column is between two values.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").between(100, 200)\n</code></pre> <pre><code>validation.col(\"amount\").between(100, 200)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"<p>Ensure all data in column is not between two values.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").notBetween(100, 200)\n</code></pre> <pre><code>validation.col(\"amount\").notBetween(100, 200)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"<p>Ensure all data in column is in set of defined values.</p> JavaScalaYAML <pre><code>validation().col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>validation.col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"<p>Ensure all data in column matches certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"<p>Ensure all data in column does not match certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>validation.col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"<p>Ensure all data in column starts with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"<p>Ensure all data in column does not start with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"<p>Ensure all data in column ends with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"<p>Ensure all data in column does not end with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"<p>Ensure all data in column has certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").size(5)\n</code></pre> <pre><code>validation.col(\"transactions\").size(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"<p>Ensure all data in column does not have certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").notSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").notSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"<p>Ensure all data in column has size less than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"<p>Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"<p>Ensure all data in column has size greater than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"<p>Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"<p>Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).</p> JavaScalaYAML <pre><code>validation().col(\"credit_card\").luhnCheck()\n</code></pre> <pre><code>validation.col(\"credit_card\").luhnCheck\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"<p>Ensure all data in column has certain data type.</p> JavaScalaYAML <pre><code>validation().col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>validation.col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"<p>Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.</p> <p>For example, <code>CASE WHEN status == 'open' THEN balance &gt; 0 ELSE balance == 0 END</code> would check all rows with <code>status</code> open to have <code>balance</code> greater than 0, otherwise, check the <code>balance</code> is 0.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount &lt; 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.expr(\"amount &lt; 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n</code></pre>"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"<p>If you want to run aggregations based on a particular set of columns, you can do so via group by validations. An example would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The validations applied can be one of the validations from above.</p>"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"<p>Check the sum of a columns values for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"<p>Check the count for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"<p>Check the min for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"<p>Check the max for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"<p>Check the average for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre>"},{"location":"setup/validation/validation/","title":"Validations","text":"<p>Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.</p> <ul> <li>Basic - Basic column level validations</li> <li>Group by/Aggregate - Run aggregates over grouped data, then validate</li> <li>[Relationship (Coming soon)] - Ensure record values exist in other datasets based on relationships</li> <li>[Data Profile (Coming soon)] - Score how close the data profile of generated data is against the target data profile</li> </ul> <p>Currently, SQL expression validations are supported (can see here for reference what other expressions are valid), but will later be extended out to supported other validations such as aggregates (group by account_number, sum of amounts should be greater than 100), ordering (transaction dates should be in descending order), relationships (at least one transaction per account_number) or data profiling (how close produced data profile is to expected data profile).</p>"},{"location":"setup/validation/validation/#define-validations","title":"Define Validations","text":"<p>Full example validation can be found below. For more details, check out each of the subsections defined further below.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/validation/#wait-condition","title":"Wait Condition","text":"<p>Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:</p> <ul> <li>Pause for seconds</li> <li>When file is available</li> <li>Data exists</li> <li>Webhook</li> </ul>"},{"location":"setup/validation/validation/#pause","title":"Pause","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.pause(1))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\");\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date &gt; DATE('2023-01-01')\"\n</code></pre>"},{"location":"setup/validation/validation/#webhook","title":"Webhook","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n</code></pre>"},{"location":"setup/validation/validation/#file-available","title":"File available","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n</code></pre>"},{"location":"setup/validation/validation/#report","title":"Report","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"use-case/business-value/","title":"Business Value","text":"<p>Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.</p> Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"<p>I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.</p> <p>The companies/products not shown below either have:</p> <ul> <li>a website with insufficient information about the technology side of data generation/validation</li> <li>no/little documentation</li> <li>don't have a free, no sign-up version of their app to use</li> </ul>"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear  Python SDK UI interface Detect private data Report generation  Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear  Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing  No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear  Python SDK Report generation Data quality checks Business logic constraints  No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free  SDK for many languages Simple, easy to use Extensible Open source Generate realistic values  No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free  Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries  Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing  Kotlin, Java &amp; Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation  Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD  CLI &amp; Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios  Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear  Python SDK Playground to try Open source library Customisable scenarios  No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD  Report generation Non-technical users can use UI Customisable scenarios  Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear  Python &amp; R SDK Report generation API for metadata Customisable scenarios  Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear  CLI &amp; Python SDK API for metadata IDE setup Data quality checks  Not sure what is SDK &amp; TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear  UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting  Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear  Python SDK Open source Detect private data Compare datasets Report generation  No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"<p>You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.</p> <p>Pros</p> <ul> <li>Simple setup</li> <li>Ability to reproduce complex logic</li> <li>Flexible to accept all types of data</li> </ul> <p>Cons</p> <ul> <li>Long time for model learning</li> <li>Black box of logic</li> <li>Maintain, store and update of ML models</li> <li>Restriction on input data lengths</li> <li>May not maintain referential integrity</li> <li>Require deeper understanding of ML models for fine-tuning</li> <li>Accuracy may be worse than non-ML models</li> </ul>"},{"location":"use-case/roadmap/","title":"Roadmap","text":"<ul> <li>Support for other data sources<ul> <li>GCP and Azure related data services ( cloud storage)</li> <li>Deltalake</li> <li>RabbitMQ</li> <li>ActiveMQ</li> <li>MongoDB</li> <li>Airflow</li> <li>DBT</li> </ul> </li> <li>Further support for metadata discovery<ul> <li> HTTP (OpenAPI spec)</li> <li>JMS</li> <li>Read from samples</li> </ul> </li> <li> API for developers and testers<ul> <li> Scala</li> <li> Java</li> </ul> </li> <li>UI for metadata and data generation</li> <li> Report for data generated and validation rules</li> <li>Metadata stored in database</li> <li>Integration with existing metadata services (i.e. Amundsen, Datahub, Schema Registry, DBT)<ul> <li>Populate metadata back to metadata services</li> <li> OpenLineage metadata (Marquez)</li> <li> OpenMetadata</li> </ul> </li> <li>Integration with existing data validations<ul> <li>Great Expectation</li> <li>DBT constraints</li> <li>SodaCL</li> <li>MonteCarlo</li> </ul> </li> <li> Suggest data validations</li> <li>Data dictionary<ul> <li>Business definitions of fields that can be referenced for metadata across all data sources</li> </ul> </li> <li> Verification rules after data generation</li> <li> Validation waiting conditions<ul> <li> Webhook</li> <li> File exists</li> <li> Data exists via SQL expression</li> <li> Pause</li> </ul> </li> <li>Extend validation types<ul> <li> Aggregates (sum of amount per account is &gt; 500)</li> <li>Ordering (transactions are ordered by date)</li> <li>Relationship (at least one account entry in history table per account in accounts table)</li> <li>Data profile (how close the generated data profile is compared to the expected data profile)</li> </ul> </li> <li>Extend count<ul> <li>Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)</li> <li>Similar to edge cases</li> </ul> </li> <li>Alerting<ul> <li>Slack</li> <li>Email</li> </ul> </li> <li>Overriding tasks<ul> <li>Can customise tasks without copying whole schema definitions</li> <li>Easier to create scenarios</li> </ul> </li> <li>Gradle plugin</li> <li>Metadata improvements<ul> <li>PII detection (can integrate with Presidio)</li> <li>Relationship detection across data sources</li> <li>SQL generation</li> <li>Ordering information</li> </ul> </li> <li>Code generation</li> <li>Schema generation from Scala/Java class</li> <li>Ordering within data sources that support order for insertion</li> <li>Clean up data in consumer data sinks</li> <li> Trial app to try out all features</li> </ul>"},{"location":"use-case/use-case/","title":"Use cases","text":""},{"location":"use-case/use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"<p>Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:</p> <ol> <li>Generates data with the latest schema changes and production like field values</li> <li>Run as a job on a daily/regular basis to replicate production traffic or data flows</li> <li>Validate data to ensure your system runs as expected</li> <li>Clean up data to avoid build up of generated data</li> </ol> <p></p>"},{"location":"use-case/use-case/#local-development","title":"Local development","text":"<p>Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:</p> <ol> <li>Fewer assumptions or ambiguities when the developer codes</li> <li>Direct feedback loop in local computer rather than waiting for test environment for more reliable test data</li> <li>No domain expertise required to understand the data</li> <li>Easy for new developers to be onboarded and developing/testing code for jobs/services</li> </ol>"},{"location":"use-case/use-case/#systemintegration-testing","title":"System/integration testing","text":"<p>When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.</p>"},{"location":"use-case/use-case/#scenario-testing","title":"Scenario testing","text":"<p>If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (<code>enableEdgeCases</code> flag within <code>&lt;field&gt;.generator.options</code>, see more here).</p>"},{"location":"use-case/use-case/#data-debugging","title":"Data debugging","text":"<p>When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.</p>"},{"location":"use-case/use-case/#data-profiling","title":"Data profiling","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable <code>enableGenerateData</code>)  so that you can focus on the profile of the data you are utilising. This can be run against your production data sources  to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data  Caterer as no direct production connections need to be maintained to generate data in other environments (which can  lead to serious concerns about data security as seen here).</p>"},{"location":"use-case/use-case/#schema-gathering","title":"Schema gathering","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"Data Caterer is a metadata-driven data generation and  testing tool that aids in creating production-like data across both batch and event data systems. Run data validations  to ensure your systems have ingested it as expected, then clean up the data afterwards. Simplify your data testing Take away the pain and complexity of your data landscape and let Data Caterer handle it <p> Try now </p> Data testing is difficult and fragmented <ul> <li>Data being sent via messages, HTTP requests or files and getting stored in databases, file systems, etc.</li> <li>Maintaining and updating tests with the latest schemas and business definitions</li> <li>Different testing tools for services, jobs or data sources</li> <li>Complex relationships between datasets and fields</li> <li>Different scenarios, permutations, combinations and edge cases to cover</li> </ul> Current solutions only cover half the story <ul> <li>Specific testing frameworks that support one or limited number of data sources or transport protocols</li> <li>Under utilizing metadata from data catalogs or metadata discovery services</li> <li>Testing teams having difficulties understanding when failures occur</li> <li>Integration tests relying on external teams/services</li> <li>Manually generating data, or worse, copying/masking production data into lower environments</li> <li>Observability pushes towards being reactive rather than proactive</li> </ul> <p> Try now </p> What you need is a reliable tool that can handle changes to your data landscape <p> </p> <p>With Data Caterer, you get:</p> <ul> <li>Ability to connect to any type of data source: files, SQL or no-SQL databases, messaging systems, HTTP</li> <li>Discover metadata from your existing infrastructure and services</li> <li>Gain confidence that bugs do not propagate to production</li> <li>Be proactive in ensuring changes do not affect other data producers or consumers</li> <li>Configurability to run the way you want</li> </ul> <p> Try now </p>"},{"location":"#tech-summary","title":"Tech Summary","text":"<p>Use the Java, Scala API, or YAML files to help with setup or customisation that are all run via a Docker image. Want to  get into details? Checkout the setup pages here to get code examples and guides that will take you  through scenarios and data sources.</p> <p>Main features include:</p> <ul> <li> Metadata discovery</li> <li> Batch and  event data generation</li> <li> Maintain referential integrity across any dataset</li> <li> Create custom data generation scenarios</li> <li> Clean up generated data</li> <li> Validate data</li> <li> Suggest data validations</li> </ul> <p></p> <p>Check other run configurations here.</p>"},{"location":"#what-is-it","title":"What is it","text":"<ul> <li> <p> Data generation and testing tool</p> <p>Generate production like data to be consumed and validated.</p> </li> <li> <p> Designed for any data source</p> <p>We aim to support pushing data to any data source, in any format.</p> </li> <li> <p> Low/no code solution</p> <p>Can use the tool via either Scala, Java or YAML. Connect to data or metadata sources to generate data and validate.</p> </li> <li> <p> Developer productivity tool</p> <p>If you are a new developer or seasoned veteran, cut down on your feedback loop when developing with data.</p> </li> </ul>"},{"location":"#what-it-is-not","title":"What it is not","text":"<ul> <li> <p> Metadata storage/platform</p> <p>You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.</p> </li> <li> <p> Data contract</p> <p>The focus of Data Caterer is on the data generation and testing, which can include details about how the data looks like and how it behaves. But it does not encompass all the additional metadata that comes with a data contract such as SLAs, security, etc.</p> </li> <li> <p> Metrics from load testing</p> <p>Although millions of records can be generated, there are limited capabilities in terms of metric capturing.</p> </li> </ul> <p> Try now </p> Data Catering vs Other tools vs In-house <p> Data Catering Other tools In-house Data flow Batch and events generation with validation Batch generation only or validation only Depends on architecture and design Time to results 1 day 1+ month to integrate, deploy and onboard 1+ month to build and deploy Solution Connect with your existing data ecosystem, automatic generation and validation Manual UI data entry or via SDK Depends on engineer(s) building it <p></p>"},{"location":"about/","title":"About","text":"<p>Hi, my name is Peter. I am a independent Software Developer, mainly focussing on data related services. My experience can be found on my LinkedIn.</p> <p>I have created Data Caterer to help serve individuals and companies with data generation and data testing. It is a complex area that has many edge cases or intricacies that are hard to summarise or turn into something actionable and repeatable. Through the use of metadata, Data Caterer can help simplify your data testing, simulating production environment data, aid in data debugging, or whatever your data use case may be.</p> <p>Given that it is going to save you and your team time and money, please help in considering financial support. This will help the product grow into a sustainable and feature-full service.</p>"},{"location":"about/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"about/#terms-of-service","title":"Terms of service","text":"<p>Terms of service can be found here.</p>"},{"location":"about/#privacy-policy","title":"Privacy policy","text":"<p>Privacy policy can be found here.</p>"},{"location":"pricing/","title":"Pricing","text":"<p>To have access to the paid features of Data Caterer, you can subscribe according to your situation. You will not be charged by usage. As you continue to subscribe, you will have access to the latest version of Data Caterer as new bug fixes and features get published.</p>"},{"location":"pricing/#paid-features","title":"Paid Features","text":"<ul> <li> Metadata discovery</li> <li> All data sources (see here for all data sources)</li> <li> Batch and  Event generation</li> <li> Auto generation from data connections or metadata sources</li> <li> Suggest data validations</li> <li> Clean up generated data</li> <li> Run as many times as you want, not charged by usage</li> </ul>"},{"location":"pricing/#pricing-table","title":"Pricing Table","text":""},{"location":"pricing/#manage-subscription","title":"Manage Subscription","text":"<p>Manage via this link</p>"},{"location":"pricing/#contact","title":"Contact","text":"<p>Please contact Peter Flook via Slack or via email <code>peter.flook@data.catering</code> if you have any questions or queries.</p>"},{"location":"get-started/docker/","title":"Run Data Caterer","text":""},{"location":"get-started/docker/#quick-start","title":"Quick start","text":"<p>Ensure you have <code>docker</code> installed and running.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; ./run.sh\n#check results under docker/sample/report/index.html folder\n</code></pre>"},{"location":"get-started/docker/#report","title":"Report","text":"<p>Check the report generated under <code>docker/data/custom/report/index.html</code>.</p> <p>Sample report can also be seen here</p>"},{"location":"get-started/docker/#paid-version-trial","title":"Paid Version Trial","text":"<p>30 day trial of the paid version can be accessed via these steps:</p> <ol> <li>Join the Slack Data Catering Slack group here</li> <li>Get an API_KEY by using slash command <code>/token</code> in the Slack group (will only be visible to you)</li> <li> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\ncd data-caterer-example &amp;&amp; export DATA_CATERING_API_KEY=&lt;insert api key&gt;\n./run.sh\n</code></pre> </li> </ol> <p>If you want to check how long your trial has left, you can check back in the Slack group or type <code>/token</code> again.</p>"},{"location":"get-started/docker/#guided-tour","title":"Guided tour","text":"<p>Check out the starter guide here that will take your through step by step. You can also check the other guides here to see the other possibilities of what Data Caterer can achieve for you.</p>"},{"location":"legal/privacy-policy/","title":"Privacy Policy","text":"<p>Last updated September 25, 2023</p>"},{"location":"legal/privacy-policy/#data-caterer-policy-on-privacy-of-customer-personal-information","title":"Data Caterer Policy on Privacy of Customer Personal Information","text":"<p>Peter John Flook is committed to protecting the privacy and security of your personal information obtained by reason of your use of Data Caterer. This policy explains the types of customer personal information we collect, how it is used, and the steps we take to ensure your personal information is handled appropriately.</p>"},{"location":"legal/privacy-policy/#who-is-peter-john-flook","title":"Who is Peter John Flook?","text":"<p>For purposes of this Privacy Policy, \u201cPeter John Flook\u201d means Peter John Flook, the company developing and providing Data Caterer and related websites and services.</p>"},{"location":"legal/privacy-policy/#what-is-personal-information","title":"What is personal information?","text":"<p>Personal information is information that refers to an individual specifically and is recorded in any form. Personal information includes such things as age, income, date of birth, ethnic origin and credit records. Information about individuals contained in the following documents is not considered personal information:</p> <ul> <li>public telephone directories, where the subscriber can refuse to be listed</li> <li>professional and business directories available to the public</li> <li>public registries and court records</li> <li>other publicly available printed and electronic publications</li> </ul>"},{"location":"legal/privacy-policy/#we-are-accountable-to-you","title":"We are accountable to you","text":"<p>Peter John Flook is responsible for all personal information under its control. Our team is accountable for compliance with these privacy and security principles.</p>"},{"location":"legal/privacy-policy/#we-let-you-know-why-we-collect-and-use-your-personal-information-and-get-your-consent","title":"We let you know why we collect and use your personal information and get your consent","text":"<p>Peter John Flook identifies the purpose for which your personal information is collected and will be used or disclosed. If that purpose is not listed below we will do this before or at the time the information is actually being collected. You will be deemed to consent to our use of your personal information for the purpose of:</p> <ul> <li>communicating with you generally</li> <li>processing your purchases</li> <li>processing and keeping track of transactions and reporting back to you</li> <li>protecting against fraud or error</li> <li>providing product and services requested by you</li> <li>recommending products and services that Peter John Flook believes will be of interest and provide value to you</li> <li>fulfilling any other purpose that would be reasonably apparent to the average person at the time we collect it from   you</li> </ul> <p>Otherwise, Peter John Flook will obtain your express consent (by verbal, written or electronic agreement) to collect, use or disclose your personal information. You can change your consent preferences at any time by contacting Peter John Flook (please refer to the \u201cHow to contact us\u201d section below).</p>"},{"location":"legal/privacy-policy/#we-limit-collection-of-your-personal-information","title":"We limit collection of your personal information","text":"<p>Peter John Flook collects only the information required to provide products and services to you. Peter John Flook will collect personal information only by clear, fair and lawful means.</p> <p>We receive and store any information you enter on our website or give us in any other way. You can choose not to provide certain information, but then you might not be able to take advantage of many of our features.</p> <p>Peter John Flook does not receive or store personal content saved to your local device while using Data Caterer.</p> <p>We also receive and store certain types of information whenever you interact with us.</p>"},{"location":"legal/privacy-policy/#information-provided-to-stripe","title":"Information provided to Stripe","text":"<p>All purchases that are made through this site are processed securely and externally by Stripe. Unless you expressly consent otherwise, we do not see or have access to any personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address).</p>"},{"location":"legal/privacy-policy/#we-limit-disclosure-and-retention-of-your-personal-information","title":"We limit disclosure and retention of your personal information","text":"<p>Peter John Flook does not disclose personal information to any organization or person for any reason except the following:</p> <p>We employ other companies and individuals to perform functions on our behalf. Examples include fulfilling orders, delivering packages, sending postal mail and e-mail, removing repetitive information from customer lists, analyzing data, providing marketing assistance, processing credit card payments, and providing customer service. They have access to personal information needed to perform their functions, but may not use it for other purposes. We may use service providers located outside of Australia, and, if applicable, your personal information may be processed and stored in other countries and therefore may be subject to disclosure under the laws of those countries. As we continue to develop our business, we might sell or buy stores, subsidiaries, or business units. In such transactions, customer information generally is one of the transferred business assets but remains subject to the promises made in any pre-existing Privacy Notice (unless, of course, the customer consents otherwise). Also, in the unlikely event that Peter John Flook or substantially all of its assets are acquired, customer information of course will be one of the transferred assets. You are deemed to consent to disclosure of your personal information for those purposes. If your personal information is shared with third parties, those third parties are bound by appropriate agreements with Peter John Flook to secure and protect the confidentiality of your personal information.</p> <p>Peter John Flook retains your personal information only as long as it is required for our business relationship or as required by federal and provincial laws.</p>"},{"location":"legal/privacy-policy/#we-keep-your-personal-information-up-to-date-and-accurate","title":"We keep your personal information up to date and accurate","text":"<p>Peter John Flook keeps your personal information up to date, accurate and relevant for its intended use.</p> <p>You may request access to the personal information we have on record in order to review and amend the information, as appropriate. In circumstances where your personal information has been provided by a third party, we will refer you to that party (e.g. credit bureaus). To access your personal information, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#the-security-of-your-personal-information-is-a-priority-for-peter-john-flook","title":"The security of your personal information is a priority for Peter John Flook","text":"<p>We take steps to safeguard your personal information, regardless of the format in which it is held, including:</p> <p>physical security measures such as restricted access facilities and locked filing cabinets electronic security measures for computerized personal information such as password protection, database encryption and personal identification numbers. We work to protect the security of your information during transmission by using \u201cTransport Layer Security\u201d (TLS) protocol. organizational processes such as limiting access to your personal information to a selected group of individuals contractual obligations with third parties who need access to your personal information requiring them to protect and secure your personal information It\u2019s important for you to protect against unauthorized access to your password and your computer. Be sure to sign off when you\u2019ve finished using any shared computer.</p>"},{"location":"legal/privacy-policy/#what-about-third-party-advertisers-and-links-to-other-websites","title":"What About Third-Party Advertisers and Links to Other Websites?","text":"<p>Our site may include third-party advertising and links to other websites. We do not provide any personally identifiable customer information to these advertisers or third-party websites.</p> <p>These third-party websites and advertisers, or Internet advertising companies working on their behalf, sometimes use technology to send (or \u201cserve\u201d) the advertisements that appear on our website directly to your browser. They automatically receive your IP address when this happens. They may also use cookies, JavaScript, web beacons (also known as action tags or single-pixel gifs), and other technologies to measure the effectiveness of their ads and to personalize advertising content. We do not have access to or control over cookies or other features that they may use, and the information practices of these advertisers and third-party websites are not covered by this Privacy Notice. Please contact them directly for more information about their privacy practices. In addition, the Network Advertising Initiative offers useful information about Internet advertising companies (also called \u201cad networks\u201d or \u201cnetwork advertisers\u201d), including information about how to opt-out of their information collection. You can access the Network Advertising Initiative at http://www.networkadvertising.org.</p>"},{"location":"legal/privacy-policy/#redirection-to-stripe","title":"Redirection to Stripe","text":"<p>In particular, when you submit an order to us, you may be automatically redirected to Stripe in order to complete the required payment. The payment page that is provided by Stripe is not part of this site. As noted above, we are not privy to any of the bank account, credit card or other personal information that you may provide to Stripe, other than information that is required in order to process your order and deliver your purchased items to you (eg, your name, email address and billing/postal address). We recommend that you refer to Stripe\u2019s privacy statement if you would like more information about how Stripe collects and handles your personal information.</p>"},{"location":"legal/privacy-policy/#we-are-open-about-our-privacy-and-security-policy","title":"We are open about our privacy and security policy","text":"<p>We are committed to providing you with understandable and easily available information about our policy and practices related to management of your personal information. This policy and any related information is available at all times on our website, https://data.catering/about/ under Privacy or on request. To contact us, refer to the \u201cHow to contact us\u201d section below.</p>"},{"location":"legal/privacy-policy/#we-provide-access-to-your-personal-information-stored-by-peter-john-flook","title":"We provide access to your personal information stored by Peter John Flook","text":"<p>You can request access to your personal information stored by Peter John Flook. To contact us, refer to the \u201cHow to contact us\u201d section below. Upon receiving such a request, Peter John Flook will:</p> <p>inform you about what type of personal information we have on record or in our control, how it is used and to whom it may have been disclosed provide you with access to your information so you can review and verify the accuracy and completeness and request changes to the information make any necessary updates to your personal information We respond to your questions, concerns and complaints about privacy Peter John Flook responds in a timely manner to your questions, concerns and complaints about the privacy of your personal information and our privacy policies and procedures.</p>"},{"location":"legal/privacy-policy/#how-to-contact-us","title":"How to contact us","text":"<ul> <li>by email at <code>peter.flook@data.catering</code></li> </ul> <p>Our business changes constantly, and this privacy notice will change also. We may e-mail periodic reminders of our notices and conditions, unless you have instructed us not to, but you should check our website frequently to see recent changes. We are, however, committed to protecting your information and will never materially change our policies and practices to make them less protective of customer information collected in the past without the consent of affected customers.</p>"},{"location":"legal/terms-of-service/","title":"Terms and Conditions","text":"<p>Last updated: September 25, 2023</p> <p>Please read these terms and conditions carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#interpretation-and-definitions","title":"Interpretation and Definitions","text":""},{"location":"legal/terms-of-service/#interpretation","title":"Interpretation","text":"<p>The words of which the initial letter is capitalized have meanings defined under the following conditions. The following definitions shall have the same meaning regardless of whether they appear in singular or in plural.</p>"},{"location":"legal/terms-of-service/#definitions","title":"Definitions","text":"<p>For the purposes of these Terms and Conditions:</p> <ul> <li>Application means the software program provided by the Company downloaded by You on any electronic device, named   Data Caterer</li> <li>Application Store means the digital distribution service operated and developed by Docker Inc. (\u201cDocker\u201d) in which   the Application has been downloaded.</li> <li>Affiliate means an entity that controls, is controlled by or is under common control with a party, where \"control\"   means ownership of 50% or more of the shares, equity interest or other securities entitled to vote for election of   directors or other managing authority.</li> <li>Country refers to: New South Wales, Australia</li> <li>Company (referred to as either \"the Company\", \"We\", \"Us\" or \"Our\" in this Agreement) refers to Peter John Flook (   ABN: 65153160916), 30 Anne William Drive, West Pennant Hills, 2125, NSW, Australia.</li> <li>Device means any device that can access the Service such as a computer, a cellphone or a digital tablet.</li> <li>Service refers to the Application.</li> <li>Terms and Conditions (also referred as \"Terms\") mean these Terms and Conditions that form the entire agreement   between You and the Company regarding the use of the Service.</li> <li>Third-party Social Media Service means any services or content (including data, information, products or services)   provided by a third party that may be displayed, included or made available by the Service.</li> <li>You means the individual accessing or using the Service, or the company, or other legal entity on behalf of which   such individual is accessing or using the Service, as applicable.</li> </ul>"},{"location":"legal/terms-of-service/#acknowledgment","title":"Acknowledgment","text":"<p>These are the Terms and Conditions governing the use of this Service and the agreement that operates between You and the Company. These Terms and Conditions set out the rights and obligations of all users regarding the use of the Service.</p> <p>Your access to and use of the Service is conditioned on Your acceptance of and compliance with these Terms and Conditions. These Terms and Conditions apply to all visitors, users and others who access or use the Service.</p> <p>By accessing or using the Service You agree to be bound by these Terms and Conditions. If You disagree with any part of these Terms and Conditions then You may not access the Service.</p> <p>You represent that you are over the age of 18. The Company does not permit those under 18 to use the Service.</p> <p>Your access to and use of the Service is also conditioned on Your acceptance of and compliance with the Privacy Policy of the Company. Our Privacy Policy describes Our policies and procedures on the collection, use and disclosure of Your personal information when You use the Application or the Website and tells You about Your privacy rights and how the law protects You. Please read Our Privacy Policy carefully before using Our Service.</p>"},{"location":"legal/terms-of-service/#links-to-other-websites","title":"Links to Other Websites","text":"<p>Our Service may contain links to third-party websites or services that are not owned or controlled by the Company.</p> <p>The Company has no control over, and assumes no responsibility for, the content, privacy policies, or practices of any third party websites or services. You further acknowledge and agree that the Company shall not be responsible or liable, directly or indirectly, for any damage or loss caused or alleged to be caused by or in connection with the use of or reliance on any such content, goods or services available on or through any such websites or services.</p> <p>We strongly advise You to read the terms and conditions and privacy policies of any third-party websites or services that You visit.</p>"},{"location":"legal/terms-of-service/#termination","title":"Termination","text":"<p>We may terminate or suspend Your access immediately, without prior notice or liability, for any reason whatsoever, including without limitation if You breach these Terms and Conditions.</p> <p>Upon termination, Your right to use the Service will cease immediately.</p>"},{"location":"legal/terms-of-service/#limitation-of-liability","title":"Limitation of Liability","text":"<p>Notwithstanding any damages that You might incur, the entire liability of the Company and any of its suppliers under any provision of these Terms and Your exclusive remedy for all the foregoing shall be limited to the amount actually paid by You through the Service or 100 USD if You haven't purchased anything through the Service.</p> <p>To the maximum extent permitted by applicable law, in no event shall the Company or its suppliers be liable for any special, incidental, indirect, or consequential damages whatsoever (including, but not limited to, damages for loss of profits, loss of data or other information, for business interruption, for personal injury, loss of privacy arising out of or in any way related to the use of or inability to use the Service, third-party software and/or third-party hardware used with the Service, or otherwise in connection with any provision of these Terms), even if the Company or any supplier has been advised of the possibility of such damages and even if the remedy fails of its essential purpose.</p> <p>Some states do not allow the exclusion of implied warranties or limitation of liability for incidental or consequential damages, which means that some of the above limitations may not apply. In these states, each party's liability will be limited to the greatest extent permitted by law.</p>"},{"location":"legal/terms-of-service/#as-is-and-as-available-disclaimer","title":"\"AS IS\" and \"AS AVAILABLE\" Disclaimer","text":"<p>The Service is provided to You \"AS IS\" and \"AS AVAILABLE\" and with all faults and defects without warranty of any kind. To the maximum extent permitted under applicable law, the Company, on its own behalf and on behalf of its Affiliates and its and their respective licensors and service providers, expressly disclaims all warranties, whether express, implied, statutory or otherwise, with respect to the Service, including all implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and warranties that may arise out of course of dealing, course of performance, usage or trade practice. Without limitation to the foregoing, the Company provides no warranty or undertaking, and makes no representation of any kind that the Service will meet Your requirements, achieve any intended results, be compatible or work with any other software, applications, systems or services, operate without interruption, meet any performance or reliability standards or be error free or that any errors or defects can or will be corrected.</p> <p>Without limiting the foregoing, neither the Company nor any of the company's provider makes any representation or warranty of any kind, express or implied: (i) as to the operation or availability of the Service, or the information, content, and materials or products included thereon; (ii) that the Service will be uninterrupted or error-free; (iii) as to the accuracy, reliability, or currency of any information or content provided through the Service; or (iv) that the Service, its servers, the content, or e-mails sent from or on behalf of the Company are free of viruses, scripts, trojan horses, worms, malware, time-bombs or other harmful components.</p> <p>Some jurisdictions do not allow the exclusion of certain types of warranties or limitations on applicable statutory rights of a consumer, so some or all of the above exclusions and limitations may not apply to You. But in such a case the exclusions and limitations set forth in this section shall be applied to the greatest extent enforceable under applicable law.</p>"},{"location":"legal/terms-of-service/#governing-law","title":"Governing Law","text":"<p>The laws of the Country, excluding its conflicts of law rules, shall govern this Terms and Your use of the Service. Your use of the Application may also be subject to other local, state, national, or international laws.</p>"},{"location":"legal/terms-of-service/#disputes-resolution","title":"Disputes Resolution","text":"<p>If You have any concern or dispute about the Service, You agree to first try to resolve the dispute informally by contacting the Company.</p>"},{"location":"legal/terms-of-service/#for-european-union-eu-users","title":"For European Union (EU) Users","text":"<p>If You are a European Union consumer, you will benefit from any mandatory provisions of the law of the country in which you are resident in.</p>"},{"location":"legal/terms-of-service/#united-states-legal-compliance","title":"United States Legal Compliance","text":"<p>You represent and warrant that (i) You are not located in a country that is subject to the United States government embargo, or that has been designated by the United States government as a \"terrorist supporting\" country, and (ii) You are not listed on any United States government list of prohibited or restricted parties.</p>"},{"location":"legal/terms-of-service/#severability-and-waiver","title":"Severability and Waiver","text":""},{"location":"legal/terms-of-service/#severability","title":"Severability","text":"<p>If any provision of these Terms is held to be unenforceable or invalid, such provision will be changed and interpreted to accomplish the objectives of such provision to the greatest extent possible under applicable law and the remaining provisions will continue in full force and effect.</p>"},{"location":"legal/terms-of-service/#waiver","title":"Waiver","text":"<p>Except as provided herein, the failure to exercise a right or to require performance of an obligation under these Terms shall not affect a party's ability to exercise such right or require such performance at any time thereafter nor shall the waiver of a breach constitute a waiver of any subsequent breach.</p>"},{"location":"legal/terms-of-service/#translation-interpretation","title":"Translation Interpretation","text":"<p>These Terms and Conditions may have been translated if We have made them available to You on our Service. You agree that the original English text shall prevail in the case of a dispute.</p>"},{"location":"legal/terms-of-service/#changes-to-these-terms-and-conditions","title":"Changes to These Terms and Conditions","text":"<p>We reserve the right, at Our sole discretion, to modify or replace these Terms at any time. If a revision is material We will make reasonable efforts to provide at least 30 days' notice prior to any new terms taking effect. What constitutes a material change will be determined at Our sole discretion.</p> <p>By continuing to access or use Our Service after those revisions become effective, You agree to be bound by the revised terms. If You do not agree to the new terms, in whole or in part, please stop using the website and the Service.</p>"},{"location":"legal/terms-of-service/#contact-us","title":"Contact Us","text":"<p>If you have any questions about these Terms and Conditions, You can contact us:</p> <ul> <li>By email: peter.flook@data.catering</li> </ul>"},{"location":"setup/","title":"Setup","text":"<p>All the configurations and customisation related to Data Caterer can be found under here.</p>"},{"location":"setup/#guide","title":"Guide","text":"<p>If you want a guided tour of using the Java or Scala API, you can follow one of the guides found here.</p>"},{"location":"setup/#specific-configuration","title":"Specific Configuration","text":"<ul> <li> Configurations - Configurations relating to feature flags, folder pathways, metadata   analysis</li> <li> Connections - Explore the data source connections available</li> <li> Generators - Choose and configure the type of generator you want used for   fields</li> <li> Validations - How to validate data to ensure your system is performing as expected</li> <li> Foreign Keys - Define links between data elements across data sources</li> <li> Deployment - Deploy Data Caterer as a job to your chosen environment</li> <li> Advanced - Advanced usage of Data Caterer</li> </ul>"},{"location":"setup/#high-level-run-configurations","title":"High Level Run Configurations","text":""},{"location":"setup/configuration/","title":"Configuration","text":"<p>A number of configurations can be made and customised within Data Caterer to help control what gets run and/or where any metadata gets saved.</p> <p>These configurations are defined from within your Java or Scala class via <code>configuration</code> or for YAML file setup, <code>application.conf</code> file as seen  here.</p>"},{"location":"setup/configuration/#flags","title":"Flags","text":"<p>Flags are used to control which processes are executed when you run Data Caterer.</p> Config Default Paid Description <code>enableGenerateData</code> true N Enable/disable data generation <code>enableCount</code> true N Count the number of records generated. Can be disabled to improve performance <code>enableFailOnError</code> true N Whilst saving generated data, if there is an error, it will stop any further data from being generated <code>enableSaveReports</code> true N Enable/disable HTML reports summarising data generated, metadata of data generated (if <code>enableSinkMetadata</code> is enabled) and validation results (if <code>enableValidation</code> is enabled). Sample here <code>enableSinkMetadata</code> true N Run data profiling for the generated data. Shown in HTML reports if <code>enableSaveSinkMetadata</code> is enabled <code>enableValidation</code> false N Run validations as described in plan. Results can be viewed from logs or from HTML report if <code>enableSaveSinkMetadata</code> is enabled. Sample here <code>enableGeneratePlanAndTasks</code> false Y Enable/disable plan and task auto generation based off data source connections <code>enableRecordTracking</code> false Y Enable/disable which data records have been generated for any data source <code>enableDeleteGeneratedRecords</code> false Y Delete all generated records based off record tracking (if <code>enableRecordTracking</code> has been set to true) <code>enableGenerateValidations</code> false Y If enabled, it will generate validations based on the data sources defined. JavaScalaapplication.conf <pre><code>configuration()\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false);\n</code></pre> <pre><code>configuration\n.enableGenerateData(true)\n.enableCount(true)\n.enableFailOnError(true)\n.enableSaveReports(true)\n.enableSinkMetadata(true)\n.enableValidation(false)\n.enableGeneratePlanAndTasks(false)\n.enableRecordTracking(false)\n.enableDeleteGeneratedRecords(false)\n.enableGenerateValidations(false)\n</code></pre> <pre><code>flags {\n  enableCount = false\n  enableCount = ${?ENABLE_COUNT}\n  enableGenerateData = true\n  enableGenerateData = ${?ENABLE_GENERATE_DATA}\n  enableFailOnError = true\n  enableFailOnError = ${?ENABLE_FAIL_ON_ERROR}\n  enableGeneratePlanAndTasks = false\n  enableGeneratePlanAndTasks = ${?ENABLE_GENERATE_PLAN_AND_TASKS}\n  enableRecordTracking = false\n  enableRecordTracking = ${?ENABLE_RECORD_TRACKING}\n  enableDeleteGeneratedRecords = false\n  enableDeleteGeneratedRecords = ${?ENABLE_DELETE_GENERATED_RECORDS}\n  enableGenerateValidations = false\n  enableGenerateValidations = ${?ENABLE_GENERATE_VALIDATIONS}\n}\n</code></pre>"},{"location":"setup/configuration/#folders","title":"Folders","text":"<p>Depending on which flags are enabled, there are folders that get used to save metadata, store HTML reports or track the records generated.</p> <p>These folder pathways can be defined as a cloud storage pathway (i.e. <code>s3a://my-bucket/task</code>).</p> Config Default Paid Description <code>planFilePath</code> /opt/app/plan/customer-create-plan.yaml N Plan file path to use when generating and/or validating data <code>taskFolderPath</code> /opt/app/task N Task folder path that contains all the task files (can have nested directories) <code>validationFolderPath</code> /opt/app/validation N Validation folder path that contains all the validation files (can have nested directories) <code>generatedReportsFolderPath</code> /opt/app/report N Where HTML reports get generated that contain information about data generated along with any validations performed <code>generatedPlanAndTaskFolderPath</code> /tmp Y Folder path where generated plan and task files will be saved <code>recordTrackingFolderPath</code> /opt/app/record-tracking Y Where record tracking parquet files get saved JavaScalaapplication.conf <pre><code>configuration()\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\");\n</code></pre> <pre><code>configuration\n.planFilePath(\"/opt/app/custom/plan/postgres-plan.yaml\")\n.taskFolderPath(\"/opt/app/custom/task\")\n.validationFolderPath(\"/opt/app/custom/validation\")\n.generatedReportsFolderPath(\"/opt/app/custom/report\")\n.generatedPlanAndTaskFolderPath(\"/opt/app/custom/generated\")\n.recordTrackingFolderPath(\"/opt/app/custom/record-tracking\")\n</code></pre> <pre><code>folders {\n  planFilePath = \"/opt/app/custom/plan/postgres-plan.yaml\"\n  planFilePath = ${?PLAN_FILE_PATH}\n  taskFolderPath = \"/opt/app/custom/task\"\n  taskFolderPath = ${?TASK_FOLDER_PATH}\n  validationFolderPath = \"/opt/app/custom/validation\"\n  validationFolderPath = ${?VALIDATION_FOLDER_PATH}\n  generatedReportsFolderPath = \"/opt/app/custom/report\"\n  generatedReportsFolderPath = ${?GENERATED_REPORTS_FOLDER_PATH}\n  generatedPlanAndTaskFolderPath = \"/opt/app/custom/generated\"\n  generatedPlanAndTaskFolderPath = ${?GENERATED_PLAN_AND_TASK_FOLDER_PATH}\n  recordTrackingFolderPath = \"/opt/app/custom/record-tracking\"\n  recordTrackingFolderPath = ${?RECORD_TRACKING_FOLDER_PATH}\n}\n</code></pre>"},{"location":"setup/configuration/#metadata","title":"Metadata","text":"<p>When metadata gets generated, there are some configurations that can be altered to help with performance or accuracy related issues. Metadata gets generated from two processes: 1) if <code>enableGeneratePlanAndTasks</code> or 2) if <code>enableSinkMetadata</code> are enabled.</p> <p>During the generation of plan and tasks, data profiling is used to create the metadata for each of the fields defined in the data source. You may face issues if the number of records in the data source is large as data profiling is an expensive task. Similarly, it can be expensive when analysing the generated data if the number of records generated is large.</p> Config Default Paid Description <code>numRecordsFromDataSource</code> 10000 Y Number of records read in from the data source that could be used for data profiling <code>numRecordsForAnalysis</code> 10000 Y Number of records used for data profiling from the records gathered in <code>numRecordsFromDataSource</code> <code>oneOfMinCount</code> 1000 Y Minimum number of records required before considering if a field can be of type <code>oneOf</code> <code>oneOfDistinctCountVsCountThreshold</code> 0.2 Y Threshold ratio to determine if a field is of type <code>oneOf</code> (i.e. a field called <code>status</code> that only contains <code>open</code> or <code>closed</code>. Distinct count = 2, total count = 10, ratio = 2 / 10 = 0.2 therefore marked as <code>oneOf</code>) <code>numGeneratedSamples</code> 10 N Number of sample records from generated data to take. Shown in HTML report JavaScalaapplication.conf <pre><code>configuration()\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10);\n</code></pre> <pre><code>configuration\n.numRecordsFromDataSourceForDataProfiling(10000)\n.numRecordsForAnalysisForDataProfiling(10000)\n.oneOfMinCount(1000)\n.oneOfDistinctCountVsCountThreshold(1000)\n.numGeneratedSamples(10)\n</code></pre> <pre><code>metadata {\n  numRecordsFromDataSource = 10000\n  numRecordsForAnalysis = 10000\n  oneOfMinCount = 1000\n  oneOfDistinctCountVsCountThreshold = 0.2\n  numGeneratedSamples = 10\n}\n</code></pre>"},{"location":"setup/configuration/#generation","title":"Generation","text":"<p>When generating data, you may have some limitations such as limited CPU or memory, large number of data sources, or data sources prone to failure under load. To help alleviate these issues or speed up performance, you can control the number of records that get generated in each batch.</p> Config Default Paid Description <code>numRecordsPerBatch</code> 100000 N Number of records across all data sources to generate per batch <code>numRecordsPerStep</code> N Overrides the count defined in each step with this value if defined (i.e. if set to 1000, for each step, 1000 records will be generated) ScalaScalaapplication.conf <pre><code>configuration()\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000);\n</code></pre> <pre><code>configuration\n.numRecordsPerBatch(100000)\n.numRecordsPerStep(1000)\n</code></pre> <pre><code>generation {\n  numRecordsPerBatch = 100000\n  numRecordsPerStep = 1000\n}\n</code></pre>"},{"location":"setup/configuration/#runtime","title":"Runtime","text":"<p>Given Data Caterer uses Spark as the base framework for data processing, you can configure the job as to your  specifications via configuration as seen here.</p> JavaScalaapplication.conf <pre><code>configuration()\n.master(\"local[*]\")\n.runtimeConfig(Map.of(\"spark.driver.cores\", \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\", \"10g\");\n</code></pre> <pre><code>configuration\n.master(\"local[*]\")\n.runtimeConfig(Map(\"spark.driver.cores\" -&gt; \"5\"))\n.addRuntimeConfig(\"spark.driver.memory\" -&gt; \"10g\")\n</code></pre> <pre><code>runtime {\n  master = \"local[*]\"\n  master = ${?DATA_CATERER_MASTER}\n  config {\n    \"spark.driver.cores\" = \"5\"\n    \"spark.driver.memory\" = \"10g\"\n  }\n}\n</code></pre>"},{"location":"setup/advanced/advanced/","title":"Advanced use cases","text":""},{"location":"setup/advanced/advanced/#special-data-formats","title":"Special data formats","text":"<p>There are many options available for you to use when you have a scenario when data has to be a certain format.</p> <ol> <li>Create expression datafaker<ol> <li>Can be used to create names, addresses, or anything that can be found    under here</li> </ol> </li> <li>Create regex</li> </ol>"},{"location":"setup/advanced/advanced/#foreign-keys-across-data-sets","title":"Foreign keys across data sets","text":"<p>Details for how you can configure foreign keys can be found here.</p>"},{"location":"setup/advanced/advanced/#edge-cases","title":"Edge cases","text":"<p>For each given data type, there are edge cases which can cause issues when your application processes the data. This can be controlled at a column level by including the following flag in the generator options:</p> JavaScalaYAML <pre><code>field()\n.name(\"amount\")\n.type(DoubleType.instance())\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>field\n.name(\"amount\")\n.`type`(DoubleType)\n.enableEdgeCases(true)\n.edgeCaseProbability(0.1)\n</code></pre> <pre><code>fields:\n- name: \"amount\"\ntype: \"double\"\ngenerator:\ntype: \"random\"\noptions:\nenableEdgeCases: \"true\"\nedgeCaseProb: 0.1\n</code></pre> <p>If you want to know all the possible edge cases for each data type, can check the documentation here.</p>"},{"location":"setup/advanced/advanced/#scenario-testing","title":"Scenario testing","text":"<p>You can create specific scenarios by adjusting the metadata found in the plan and tasks to your liking. For example, if you had two data sources, a Postgres database and a parquet file, and you wanted to save account data into Postgres and transactions related to those accounts into a parquet file. You can alter the <code>status</code> column in the account data to only generate <code>open</code> accounts and define a foreign key between Postgres and parquet to ensure the same <code>account_id</code> is being used. Then in the parquet task, define 1 to 10 transactions per <code>account_id</code> to be generated.</p> <p>Postgres account generation example task Parquet transaction generation example task Plan</p>"},{"location":"setup/advanced/advanced/#cloud-storage","title":"Cloud storage","text":""},{"location":"setup/advanced/advanced/#data-source","title":"Data source","text":"<p>If you want to save the file types CSV, JSON, Parquet or ORC into cloud storage, you can do so via adding extra configurations. Below is an example for S3.</p> JavaScalaYAML <pre><code>var csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield().name(\"account_id\"),\n...\n);\n\nvar s3Configuration = configuration()\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n\nexecute(s3Configuration, csvTask);\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"s3a://my-bucket/csv/accounts\")\n.schema(\nfield.name(\"account_id\"),\n...\n)\n\nval s3Configuration = configuration\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n\nexecute(s3Configuration, csvTask)\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/advanced/advanced/#storing-plantasks","title":"Storing plan/task(s)","text":"<p>You can generate and store the plan/task files inside either AWS S3, Azure Blob Storage or Google GCS. This can be controlled via configuration set in the <code>application.conf</code> file where you can set something like the below:</p> JavaScalaYAML <pre><code>configuration()\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map.of(\n\"spark.hadoop.fs.s3a.directory.marker.retention\", \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\", \"true\",\n\"spark.hadoop.fs.defaultFS\", \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\", \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\", \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\", \"secret_key\"\n));\n</code></pre> <pre><code>configuration\n.generatedReportsFolderPath(\"s3a://my-bucket/data-caterer/generated\")\n.planFilePath(\"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\")\n.taskFolderPath(\"s3a://my-bucket/data-caterer/generated/task\")\n.runtimeConfig(Map(\n\"spark.hadoop.fs.s3a.directory.marker.retention\" -&gt; \"keep\",\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" -&gt; \"true\",\n\"spark.hadoop.fs.defaultFS\" -&gt; \"s3a://my-bucket\",\n//can change to other credential providers as shown here\n//https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" -&gt; \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\",\n\"spark.hadoop.fs.s3a.access.key\" -&gt; \"access_key\",\n\"spark.hadoop.fs.s3a.secret.key\" -&gt; \"secret_key\"\n))\n</code></pre> <pre><code>folders {\ngeneratedPlanAndTaskFolderPath = \"s3a://my-bucket/data-caterer/generated\"\nplanFilePath = \"s3a://my-bucket/data-caterer/generated/plan/customer-create-plan.yaml\"\ntaskFolderPath = \"s3a://my-bucket/data-caterer/generated/task\"\n}\n\nruntime {\nconfig {\n...\n#S3\n\"spark.hadoop.fs.s3a.directory.marker.retention\" = \"keep\"\n\"spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled\" = \"true\"\n\"spark.hadoop.fs.defaultFS\" = \"s3a://my-bucket\"\n#can change to other credential providers as shown here\n#https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Changing_Authentication_Providers\n\"spark.hadoop.fs.s3a.aws.credentials.provider\" = \"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider\"\n\"spark.hadoop.fs.s3a.access.key\" = \"access_key\"\n\"spark.hadoop.fs.s3a.secret.key\" = \"secret_key\"\n}\n}\n</code></pre>"},{"location":"setup/connection/connection/","title":"Data Source Connections","text":"<p>Details of all the connection configuration supported can be found in the below subsections for each type of connection.</p> <p>These configurations can be done via API or from configuration. Examples of both are shown for each data source below.</p>"},{"location":"setup/connection/connection/#supported-data-connections","title":"Supported Data Connections","text":"Data Source Type Data Source Paid Database Postgres, MySQL, Cassandra N (Postgres), Y (rest) File CSV, JSON, ORC, Parquet N Messaging Kafka, Solace Y HTTP REST API Y Metadata Marquez, OpenMetadata, OpenAPI/Swagger Y"},{"location":"setup/connection/connection/#api","title":"API","text":"<p>All connection details require a name. Depending on the data source, you can define additional options which may be used by the driver or connector for connecting to the data source.</p>"},{"location":"setup/connection/connection/#configuration-file","title":"Configuration file","text":"<p>All connection details follow the same pattern.</p> <pre><code>&lt;connection format&gt; {\n    &lt;connection name&gt; {\n        &lt;key&gt; = &lt;value&gt;\n    }\n}\n</code></pre> <p>Overriding configuration</p> <p>When defining a configuration value that can be defined by a system property or environment variable at runtime, you can define that via the following:</p> <pre><code>url = \"localhost\"\nurl = ${?POSTGRES_URL}\n</code></pre> <p>The above defines that if there is a system property or environment variable named <code>POSTGRES_URL</code>, then that value will be used for the <code>url</code>, otherwise, it will default to <code>localhost</code>.</p>"},{"location":"setup/connection/connection/#data-sources","title":"Data sources","text":"<p>To find examples of a task for each type of data source, please check out this page.</p>"},{"location":"setup/connection/connection/#file","title":"File","text":"<p>Linked here is a list of generic options that can be included as part of your file data source configuration if required. Links to specific file type configurations can be found below.</p>"},{"location":"setup/connection/connection/#csv","title":"CSV","text":"JavaScalaapplication.conf <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>csv {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?CSV_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for CSV can be found here</p>"},{"location":"setup/connection/connection/#json","title":"JSON","text":"JavaScalaapplication.conf <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>json {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?JSON_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for JSON can be found here</p>"},{"location":"setup/connection/connection/#orc","title":"ORC","text":"JavaScalaapplication.conf <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>orc {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?ORC_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for ORC can be found here</p>"},{"location":"setup/connection/connection/#parquet","title":"Parquet","text":"JavaScalaapplication.conf <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>parquet {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?PARQUET_PATH}\n  }\n}\n</code></pre> <p>Other available configuration for Parquet can be found here</p>"},{"location":"setup/connection/connection/#delta-not-supported-yet","title":"Delta (not supported yet)","text":"JavaScalaapplication.conf <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta(\"customer_transactions\", \"/data/customer/transaction\")\n</code></pre> <pre><code>delta {\n  customer_transactions {\n    path = \"/data/customer/transaction\"\n    path = ${?DELTA_PATH}\n  }\n}\n</code></pre>"},{"location":"setup/connection/connection/#rmdbs","title":"RMDBS","text":"<p>Follows the same configuration used by Spark as found here. Sample can be found below</p> JavaScalaapplication.conf <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>postgres(\n\"customer_postgres\",                            #name\n\"jdbc:postgresql://localhost:5432/customer\",    #url\n\"postgres\",                                     #username\n\"postgres\"                                      #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_postgres {\n        url = \"jdbc:postgresql://localhost:5432/customer\"\n        url = ${?POSTGRES_URL}\n        user = \"postgres\"\n        user = ${?POSTGRES_USERNAME}\n        password = \"postgres\"\n        password = ${?POSTGRES_PASSWORD}\n        driver = \"org.postgresql.Driver\"\n    }\n}\n</code></pre> <p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> SQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/connection/#postgres","title":"Postgres","text":"<p>Can see example API or Config definition for Postgres connection above.</p>"},{"location":"setup/connection/connection/#permissions","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.tables TO &lt; user &gt;;\nGRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\nGRANT SELECT ON information_schema.table_constraints TO &lt; user &gt;;\nGRANT SELECT ON information_schema.constraint_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/connection/#mysql","title":"MySQL","text":"JavaScalaapplication.conf <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>mysql(\n\"customer_mysql\",                       #name\n\"jdbc:mysql://localhost:3306/customer\", #url\n\"root\",                                 #username\n\"root\"                                  #password\n)\n</code></pre> <pre><code>jdbc {\n    customer_mysql {\n        url = \"jdbc:mysql://localhost:3306/customer\"\n        user = \"root\"\n        password = \"root\"\n        driver = \"com.mysql.cj.jdbc.Driver\"\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#permissions_1","title":"Permissions","text":"<p>Following permissions are required when generating plan and tasks:</p> SQL Permission Statements <pre><code>GRANT SELECT ON information_schema.columns TO &lt; user &gt;;\nGRANT SELECT ON information_schema.statistics TO &lt; user &gt;;\nGRANT SELECT ON information_schema.key_column_usage TO &lt; user &gt;;\n</code></pre>"},{"location":"setup/connection/connection/#cassandra","title":"Cassandra","text":"<p>Follows same configuration as defined by the Spark Cassandra Connector as found here</p> JavaScalaapplication.conf <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap.of()                #optional additional connection options\n)\n</code></pre> <pre><code>cassandra(\n\"customer_cassandra\",   #name\n\"localhost:9042\",       #url\n\"cassandra\",            #username\n\"cassandra\",            #password\nMap()                #optional additional connection options\n)\n</code></pre> <pre><code>org.apache.spark.sql.cassandra {\n    customer_cassandra {\n        spark.cassandra.connection.host = \"localhost\"\n        spark.cassandra.connection.host = ${?CASSANDRA_HOST}\n        spark.cassandra.connection.port = \"9042\"\n        spark.cassandra.connection.port = ${?CASSANDRA_PORT}\n        spark.cassandra.auth.username = \"cassandra\"\n        spark.cassandra.auth.username = ${?CASSANDRA_USERNAME}\n        spark.cassandra.auth.password = \"cassandra\"\n        spark.cassandra.auth.password = ${?CASSANDRA_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#permissions_2","title":"Permissions","text":"<p>Ensure that the user has write permission, so it is able to save the table to the target tables.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO &lt;user&gt;;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO &lt;user&gt;;\nGRANT SELECT ON system_schema.columns TO &lt;user&gt;;\n</code></pre>"},{"location":"setup/connection/connection/#kafka","title":"Kafka","text":"<p>Define your Kafka bootstrap server to connect and send generated data to corresponding topics. Topic gets set at a step level. Further details can be found here</p> JavaScalaapplication.conf <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka(\n\"customer_kafka\",   #name\n\"localhost:9092\"    #url\n)\n</code></pre> <pre><code>kafka {\n    customer_kafka {\n        kafka.bootstrap.servers = \"localhost:9092\"\n        kafka.bootstrap.servers = ${?KAFKA_BOOTSTRAP_SERVERS}\n    }\n}\n</code></pre> <p>When defining your schema for pushing data to Kafka, it follows a specific top level schema. An example can be found here . You can define the key, value, headers, partition or topic by following the linked schema.</p>"},{"location":"setup/connection/connection/#jms","title":"JMS","text":"<p>Uses JNDI lookup to send messages to JMS queue. Ensure that the messaging system you are using has your queue/topic registered via JNDI otherwise a connection cannot be created.</p> JavaScalaapplication.conf <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>solace(\n\"customer_solace\",                                      #name\n\"smf://localhost:55554\",                                #url\n\"admin\",                                                #username\n\"admin\",                                                #password\n\"default\",                                              #vpn name\n\"/jms/cf/default\",                                      #connection factory\n\"com.solacesystems.jndi.SolJNDIInitialContextFactory\"   #initial context factory\n)\n</code></pre> <pre><code>jms {\n    customer_solace {\n        initialContextFactory = \"com.solacesystems.jndi.SolJNDIInitialContextFactory\"\n        connectionFactory = \"/jms/cf/default\"\n        url = \"smf://localhost:55555\"\n        url = ${?SOLACE_URL}\n        user = \"admin\"\n        user = ${?SOLACE_USER}\n        password = \"admin\"\n        password = ${?SOLACE_PASSWORD}\n        vpnName = \"default\"\n        vpnName = ${?SOLACE_VPN}\n    }\n}\n</code></pre>"},{"location":"setup/connection/connection/#http","title":"HTTP","text":"<p>Define any username and/or password needed for the HTTP requests. The url is defined in the tasks to allow for generated data to be populated in the url.</p> JavaScalaapplication.conf <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http(\n\"customer_api\", #name\n\"admin\",        #username\n\"admin\"         #password\n)\n</code></pre> <pre><code>http {\n    customer_api {\n        user = \"admin\"\n        user = ${?HTTP_USER}\n        password = \"admin\"\n        password = ${?HTTP_PASSWORD}\n    }\n}\n</code></pre>"},{"location":"setup/deployment/deployment/","title":"Deployment","text":"<p>Two main ways to deploy and run Data Caterer:</p> <ul> <li>Docker</li> <li>Helm</li> </ul>"},{"location":"setup/deployment/deployment/#docker","title":"Docker","text":"<p>To package up your class along with the Data Caterer base image, you can follow the Dockerfile that is created for you here.</p> <p>Then you can run the following:</p> <pre><code>./gradlew clean build\ndocker build -t &lt;my_image_name&gt;:&lt;my_image_tag&gt; .\n</code></pre>"},{"location":"setup/deployment/deployment/#helm","title":"Helm","text":"<p>Link to sample helm on GitHub here</p> <p>Update the configuration to your own data connections and configuration or own image created from above.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\nhelm install data-caterer ./data-caterer-example/helm/data-caterer\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/","title":"Foreign Keys","text":"<p>Foreign keys can be defined to represent the relationships between datasets where values are required to match for particular columns.</p>"},{"location":"setup/foreign-key/foreign-key/#single-column","title":"Single column","text":"<p>Define a column in one data source to match against another column. Below example shows a <code>postgres</code> data source with two tables, <code>accounts</code> and <code>transactions</code> that have a foreign key for <code>account_id</code>.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList.of(Map.entry(postgresTxn, \"account_id\"))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, \"account_id\",\nList(postgresTxn -&gt; \"account_id\")\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"postgres.accounts.account_id\":\n- \"postgres.transactions.account_id\"\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/#multiple-columns","title":"Multiple columns","text":"<p>You may have a scenario where multiple columns need to be aligned. From the same example, we want <code>account_id</code> and <code>name</code> from <code>accounts</code> to match with <code>account_id</code> and <code>full_name</code> to match in <code>transactions</code> respectively.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\n...\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(postgresTxn, List.of(\"account_id\", \"full_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nval postgresTxn = postgres(postgresAcc)\n.table(\"public.transactions\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\n...\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(postgresTxn -&gt; List(\"account_id\", \"full_name\"))\n)\n</code></pre> <pre><code>---\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n- name: \"transactions\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"full_name\"\n---\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_postgres.transactions.account_id,full_name\"\n</code></pre>"},{"location":"setup/foreign-key/foreign-key/#nested-column","title":"Nested column","text":"<p>Your schema structure can have nested fields which can also be referenced as foreign keys. But to do so, you need to create a proxy field that gets omitted from the final saved data.</p> <p>In the example below, the nested <code>customer_details.name</code> field inside the <code>json</code> task needs to match with <code>name</code> from <code>postgres</code>. A new field in the <code>json</code> called <code>_txn_name</code> is used as a temporary column to facilitate the foreign key definition.</p> JavaScalaYAML <pre><code>var postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\n...\n);\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"customer_details\")\n.schema(\nfield().name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n),\nfield().name(\"_txn_name\").omit(true)       #value will not be included in output\n);\n\nplan().addForeignKeyRelationship(\npostgresAcc, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(jsonTask, List.of(\"account_id\", \"_txn_name\")))\n);\n</code></pre> <pre><code>val postgresAcc = postgres(\"my_postgres\", \"jdbc:...\")\n.table(\"public.accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\n...\n)\nvar jsonTask = json(\"my_json\", \"/tmp/json\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"customer_details\")\n.schema(\nfield.name(\"name\").sql(\"_txn_name\"), #nested field will get value from '_txn_name'\n...\n), field.name(\"_txn_name\").omit(true)       #value will not be included in output\n)\n\nplan.addForeignKeyRelationship(\npostgresAcc, List(\"account_id\", \"name\"),\nList(jsonTask -&gt; List(\"account_id\", \"_txn_name\"))\n)\n</code></pre> <pre><code>---\n#postgres task yaml\nname: \"postgres_data\"\nsteps:\n- name: \"accounts\"\ntype: \"postgres\"\noptions:\ndbtable: \"account.accounts\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"name\"\n---\n#json task yaml\nname: \"json_data\"\nsteps:\n- name: \"transactions\"\ntype: \"json\"\noptions:\ndbtable: \"account.transactions\"\nschema:\nfields:\n- name: \"account_id\"\n- name: \"_txn_name\"\ngenerator:\noptions:\nomit: true\n- name: \"cusotmer_details\"\nschema:\nfields:\nname: \"name\"\ngenerator:\ntype: \"sql\"\noptions:\nsql: \"_txn_name\"\n\n---\n#plan yaml\nname: \"customer_create_plan\"\ndescription: \"Create customers in JDBC\"\ntasks:\n- name: \"postgres_data\"\ndataSourceName: \"my_postgres\"\n- name: \"json_data\"\ndataSourceName: \"my_json\"\n\nsinkOptions:\nforeignKeys:\n\"my_postgres.accounts.account_id,name\":\n- \"my_json.transactions.account_id,_txn_name\"\n</code></pre>"},{"location":"setup/generator/count/","title":"Record Count","text":"<p>There are options related to controlling the number of records generated that can help in generating the scenarios or data required.</p>"},{"location":"setup/generator/count/#record-count_1","title":"Record Count","text":"<p>Record count is the simplest as you define the total number of records you require for that particular step. For example, in the below step, it will generate 1000 records for the CSV file  </p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(1000)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\n</code></pre>"},{"location":"setup/generator/count/#generated-count","title":"Generated Count","text":"<p>As like most things in Data Caterer, the count can be generated based on some metadata. For example, if I wanted to generate between 1000 and 2000 records, I could define that by the below configuration:</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator().min(1000).max(2000));\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(generator.min(1000).max(2000))\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\ngenerator:\ntype: \"random\"\noptions:\nmin: 1000\nmax: 2000\n</code></pre>"},{"location":"setup/generator/count/#per-column-count","title":"Per Column Count","text":"<p>When defining a per column count, this allows you to generate records \"per set of columns\". This means that for a given set of columns, it will generate a particular amount of records per combination of values for those columns.  </p> <p>One example of this would be when generating transactions relating to a customer, a customer may be defined by columns <code>account_id, name</code>. A number of transactions would be generated per <code>account_id, name</code>.  </p> <p>You can also use a combination of the above two methods to generate the number of records per column.</p>"},{"location":"setup/generator/count/#records","title":"Records","text":"<p>When defining a base number of records within the <code>perColumn</code> configuration, it translates to creating <code>(count.records * count.recordsPerColumn)</code> records. This is a fixed number of records that will be generated each time, with no variation between runs.</p> <p>In the example below, we have <code>count.records = 1000</code> and <code>count.recordsPerColumn = 2</code>. Which means that <code>1000 * 2 = 2000</code> records will be generated in total.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumn(2, \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\nrecords: 2\ncolumnNames:\n- \"account_id\"\n- \"name\"\n</code></pre>"},{"location":"setup/generator/count/#generated","title":"Generated","text":"<p>You can also define a generator for the count per column. This can be used in scenarios where you want a variable number of records per set of columns.</p> <p>In the example below, it will generate between <code>(count.records * count.perColumnGenerator.generator.min) = (1000 * 1) = 1000</code> and <code>(count.records * count.perColumnGenerator.generator.max) = (1000 * 2) = 2000</code> records.</p> JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount()\n.records(1000)\n.recordsPerColumnGenerator(generator().min(1).max(2), \"account_id\", \"name\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.count(\ncount\n.records(1000)\n.recordsPerColumnGenerator(generator.min(1).max(2), \"account_id\", \"name\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\ncount:\nrecords: 1000\nperColumn:\ncolumnNames:\n- \"account_id\"\n- \"name\"\ngenerator:\ntype: \"random\"\noptions:\nmin: 1\nmax: 2\n</code></pre>"},{"location":"setup/generator/generator/","title":"Data Generators","text":""},{"location":"setup/generator/generator/#data-types","title":"Data Types","text":"<p>Below is a list of all supported data types for generating data:</p> Data Type Spark Data Type Options Description string StringType <code>minLen, maxLen, expression, enableNull</code> integer IntegerType <code>min, max, stddev, mean</code> long LongType <code>min, max, stddev, mean</code> short ShortType <code>min, max, stddev, mean</code> decimal(precision, scale) DecimalType(precision, scale) <code>min, max, stddev, mean</code> double DoubleType <code>min, max, stddev, mean</code> float FloatType <code>min, max, stddev, mean</code> date DateType <code>min, max, enableNull</code> timestamp TimestampType <code>min, max, enableNull</code> boolean BooleanType binary BinaryType <code>minLen, maxLen, enableNull</code> byte ByteType array ArrayType <code>arrayMinLen, arrayMaxLen, arrayType</code> _ StructType Implicitly supported when a schema is defined for a field"},{"location":"setup/generator/generator/#options","title":"Options","text":""},{"location":"setup/generator/generator/#all-data-types","title":"All data types","text":"<p>Some options are available to use for all types of data generators. Below is the list along with example and descriptions:</p> Option Default Example Description <code>enableEdgeCase</code> false <code>enableEdgeCase: \"true\"</code> Enable/disable generated data to contain edge cases based on the data type. For example, integer data type has edge cases of (Int.MaxValue, Int.MinValue and 0) <code>edgeCaseProbability</code> 0.0 <code>edgeCaseProb: \"0.1\"</code> Probability of generating a random edge case value if <code>enableEdgeCase</code> is true <code>isUnique</code> false <code>isUnique: \"true\"</code> Enable/disable generated data to be unique for that column. Errors will be thrown when it is unable to generate unique data <code>seed</code> <code>seed: \"1\"</code> Defines the random seed for generating data for that particular column. It will override any seed defined at a global level <code>sql</code> <code>sql: \"CASE WHEN amount &lt; 10 THEN true ELSE false END\"</code> Define any SQL statement for generating that columns value. Computation occurs after all non-SQL fields are generated. This means any columns used in the SQL cannot be based on other SQL generated columns. Data type of generated value from SQL needs to match data type defined for the field"},{"location":"setup/generator/generator/#string","title":"String","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated strings have at least length <code>minLen</code> <code>maxLen</code> 10 <code>maxLen: \"15\"</code> Ensures that all generated strings have at most length <code>maxLen</code> <code>expression</code> <code>expression: \"#{Name.name}\"</code><code>expression:\"#{Address.city}/#{Demographic.maritalStatus}\"</code> Will generate a string based on the faker expression provided. All possible faker expressions can be found here Expression has to be in format <code>#{&lt;faker expression name&gt;}</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", \"\u0130yi g\u00fcnler\", \"\u0421\u043f\u0430\u0441\u0438\u0431\u043e\", \"\u039a\u03b1\u03bb\u03b7\u03bc\u03ad\u03c1\u03b1\", \"\u0635\u0628\u0627\u062d \u0627\u0644\u062e\u064a\u0631\", \" F\u00f6rl\u00e5t\", \"\u4f60\u597d\u5417\", \"Nh\u00e0 v\u1ec7 sinh \u1edf \u0111\u00e2u\", \"\u3053\u3093\u306b\u3061\u306f\", \"\u0928\u092e\u0938\u094d\u0924\u0947\", \"\u0532\u0561\u0580\u0565\u0582\", \"\u0417\u0434\u0440\u0430\u0432\u0435\u0439\u0442\u0435\")</p>"},{"location":"setup/generator/generator/#sample","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield()\n.name(\"name\")\n.type(StringType.instance())\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield\n.name(\"name\")\n.`type`(StringType)\n.expression(\"#{Name.name}\")\n.enableNull(true)\n.nullProbability(0.1)\n.minLength(4)\n.maxLength(20)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\ntype: \"csv\"\noptions:\npath: \"app/src/test/resources/sample/csv/transactions\"\nschema:\nfields:\n- name: \"name\"\ntype: \"string\"\ngenerator:\noptions:\nexpression: \"#{Name.name}\"\nenableNull: true\nnullProb: 0.1\nminLength: 4\nmaxLength: 20\n</code></pre>"},{"location":"setup/generator/generator/#numeric","title":"Numeric","text":"<p>For all the numeric data types, there are 4 options to choose from: min, max and maxValue. Generally speaking, you only need to define one of min or minValue, similarly with max or maxValue. The reason why there are 2 options for each is because of when metadata is automatically gathered, we gather the statistics of the observed min and max values. Also, it will attempt to gather any restriction on the min or max value as defined by the data source (i.e. max value as per database type).</p>"},{"location":"setup/generator/generator/#integerlongshort","title":"Integer/Long/Short","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Integer: (2147483647, -2147483648, 0) Edge cases Long: (9223372036854775807, -9223372036854775808, 0) Edge cases Short: (32767, -32768, 0)</p>"},{"location":"setup/generator/generator/#sample_1","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"year\").type(IntegerType.instance()).min(2020).max(2023),\nfield().name(\"customer_id\").type(LongType.instance()),\nfield().name(\"customer_group\").type(ShortType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"year\").`type`(IntegerType).min(2020).max(2023),\nfield.name(\"customer_id\").`type`(LongType),\nfield.name(\"customer_group\").`type`(ShortType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"year\"\ntype: \"integer\"\ngenerator:\noptions:\nmin: 2020\nmax: 2023\n- name: \"customer_id\"\ntype: \"long\"\n- name: \"customer_group\"\ntype: \"short\"\n</code></pre>"},{"location":"setup/generator/generator/#decimal","title":"Decimal","text":"Option Default Example Description <code>min</code> 0 <code>min: \"2\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000 <code>max: \"25\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <code>numericPrecision</code> 10 <code>precision: \"25\"</code> The maximum number of digits <code>numericScale</code> 0 <code>scale: \"25\"</code> The number of digits on the right side of the decimal point (has to be less than or equal to precision) <p>Edge cases Decimal: (9223372036854775807, -9223372036854775808, 0)</p>"},{"location":"setup/generator/generator/#sample_2","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"balance\").type(DecimalType.instance()).numericPrecision(10).numericScale(5)\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"balance\").`type`(DecimalType).numericPrecision(10).numericScale(5)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"balance\"\ntype: \"decimal\"\ngenerator:\noptions:\nprecision: 10\nscale: 5\n</code></pre>"},{"location":"setup/generator/generator/#doublefloat","title":"Double/Float","text":"Option Default Example Description <code>min</code> 0.0 <code>min: \"2.1\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> 1000.0 <code>max: \"25.9\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>stddev</code> 1.0 <code>stddev: \"2.0\"</code> Standard deviation for normal distributed data <code>mean</code> <code>max - min</code> <code>mean: \"5.0\"</code> Mean for normal distributed data <p>Edge cases Double: (+infinity, 1.7976931348623157e+308, 4.9e-324, 0.0, -0.0, -1.7976931348623157e+308, -infinity, NaN) Edge cases Float: (+infinity, 3.4028235e+38, 1.4e-45, 0.0, -0.0, -3.4028235e+38, -infinity, NaN)</p>"},{"location":"setup/generator/generator/#sample_3","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"amount\").type(DoubleType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"amount\").`type`(DoubleType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"amount\"\ntype: \"double\"\n</code></pre>"},{"location":"setup/generator/generator/#date","title":"Date","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01, 1582-10-15, 1970-01-01, 9999-12-31) (reference)</p>"},{"location":"setup/generator/generator/#sample_4","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_date\").type(DateType.instance()).min(java.sql.Date.valueOf(\"2020-01-01\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_date\").`type`(DateType).min(java.sql.Date.valueOf(\"2020-01-01\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_date\"\ntype: \"date\"\ngenerator:\noptions:\nmin: \"2020-01-01\"\n</code></pre>"},{"location":"setup/generator/generator/#timestamp","title":"Timestamp","text":"Option Default Example Description <code>min</code> now() - 365 days <code>min: \"2023-01-31 23:10:10\"</code> Ensures that all generated values are greater than or equal to <code>min</code> <code>max</code> now() <code>max: \"2023-12-31 23:10:10\"</code> Ensures that all generated values are less than or equal to <code>max</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (0001-01-01 00:00:00, 1582-10-15 23:59:59, 1970-01-01 00:00:00, 9999-12-31 23:59:59)</p>"},{"location":"setup/generator/generator/#sample_5","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"created_time\").type(TimestampType.instance()).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"created_time\").`type`(TimestampType).min(java.sql.Timestamp.valueOf(\"2020-01-01 00:00:00\"))\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"created_time\"\ntype: \"timestamp\"\ngenerator:\noptions:\nmin: \"2020-01-01 00:00:00\"\n</code></pre>"},{"location":"setup/generator/generator/#binary","title":"Binary","text":"Option Default Example Description <code>minLen</code> 1 <code>minLen: \"2\"</code> Ensures that all generated array of bytes have at least length <code>minLen</code> <code>maxLen</code> 20 <code>maxLen: \"15\"</code> Ensures that all generated array of bytes have at most length <code>maxLen</code> <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true <p>Edge cases: (\"\", \"\\n\", \"\\r\", \"\\t\", \" \", \"\\u0000\", \"\\ufff\", -128, 127)</p>"},{"location":"setup/generator/generator/#sample_6","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"payload\").type(BinaryType.instance())\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"payload\").`type`(BinaryType)\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"payload\"\ntype: \"binary\"\n</code></pre>"},{"location":"setup/generator/generator/#array","title":"Array","text":"Option Default Example Description <code>arrayMinLen</code> 0 <code>arrayMinLen: \"2\"</code> Ensures that all generated arrays have at least length <code>arrayMinLen</code> <code>arrayMaxLen</code> 5 <code>arrayMaxLen: \"15\"</code> Ensures that all generated arrays have at most length <code>arrayMaxLen</code> <code>arrayType</code> <code>arrayType: \"double\"</code> Inner data type of the array. Optional when using Java/Scala API. Allows for nested data types to be defined like struct <code>enableNull</code> false <code>enableNull: \"true\"</code> Enable/disable null values being generated <code>nullProbability</code> 0.0 <code>nullProb: \"0.1\"</code> Probability to generate null values if <code>enableNull</code> is true"},{"location":"setup/generator/generator/#sample_7","title":"Sample","text":"JavaScalaYAML <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield().name(\"last_5_amounts\").type(ArrayType.instance()).arrayType(\"double\")\n);\n</code></pre> <pre><code>csv(\"transactions\", \"app/src/test/resources/sample/csv/transactions\")\n.schema(\nfield.name(\"last_5_amounts\").`type`(ArrayType).arrayType(\"double\")\n)\n</code></pre> <pre><code>name: \"csv_file\"\nsteps:\n- name: \"transactions\"\n...\nschema:\nfields:\n- name: \"last_5_amounts\"\ntype: \"array&lt;double&gt;\"\n</code></pre>"},{"location":"setup/generator/report/","title":"Report","text":"<p>Data Caterer can be configured to produce a report of the data generated to help users understand what was run, how much  data was generated, where it was generated, validation results and any associated metadata. </p>"},{"location":"setup/generator/report/#sample","title":"Sample","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"setup/guide/","title":"Guides","text":"<p>Below are a list of guides you can follow to create your data generation for your use case.</p> <p>For any of the paid tier guides, you can use the trial version fo the app to try it out. Details on how to get the trial can be found here.</p>"},{"location":"setup/guide/#scenarios","title":"Scenarios","text":"<ul> <li>First Data Generation - If you are new, this is the place to start</li> <li>Multiple Records Per Column Value - How you can generate multiple records per set of columns</li> <li>Foreign Keys Across Data Sources - Generate matching values across generated data sets</li> <li>Data Validations - Run data validations after generating data</li> <li>Auto Generate From Data Connection - Automatically generating data from just defining data sources</li> <li>Delete Generated Data - Delete the generated data whilst leaving other data</li> <li>Generate Batch and Event Data - Generate matching batch and event data</li> </ul>"},{"location":"setup/guide/#data-sources","title":"Data Sources","text":"<ul> <li>Files (CSV, JSON, ORC, Parquet) - Generate data for popular file formats</li> <li>Postgres - JDBC Postgres tables</li> <li>Cassandra - Cassandra tables</li> <li>Kafka - Kafka topics</li> <li>Solace - Solace messages</li> <li>Marquez - Generate data based on metadata in Marquez</li> <li>OpenMetadata - Generate data based on metadata in OpenMetadata</li> <li>HTTP - HTTP requests</li> <li>Files (Fixed width) - (Soon to document) A variant of CSV but with no separator</li> <li>MySql - (Soon to document) JDBC MySql tables</li> </ul>"},{"location":"setup/guide/#yaml-files","title":"YAML Files","text":""},{"location":"setup/guide/#base-concept","title":"Base Concept","text":"<p>The execution of the data generator is based on the concept of plans and tasks. A plan represent the set of tasks that need to be executed, along with other information that spans across tasks, such as foreign keys between data sources. A task represent the component(s) of a data source and its associated metadata so that it understands what the data should look like and how many steps (sub data sources) there are (i.e. tables in a database, topics in Kafka). Tasks can define one or more steps.</p>"},{"location":"setup/guide/#plan","title":"Plan","text":""},{"location":"setup/guide/#foreign-keys","title":"Foreign Keys","text":"<p>Define foreign keys across data sources in your plan to ensure generated data can match Link to associated task 1 Link to associated task 2</p>"},{"location":"setup/guide/#task","title":"Task","text":"Data Source Type Data Source Sample Task Notes Database Postgres Sample Database MySQL Sample Database Cassandra Sample File CSV Sample File JSON Sample Contains nested schemas and use of SQL for generated values File Parquet Sample Partition by year column Kafka Kafka Sample Specific base schema to be used, define headers, key, value, etc. JMS Solace Sample JSON formatted message HTTP PUT Sample JSON formatted PUT body"},{"location":"setup/guide/#configuration","title":"Configuration","text":"<p>Basic configuration</p>"},{"location":"setup/guide/#docker-compose","title":"Docker-compose","text":"<p>To see how it runs against different data sources, you can run using <code>docker-compose</code> and set <code>DATA_SOURCE</code> like below</p> <pre><code>./gradlew build\ncd docker\nDATA_SOURCE=postgres docker-compose up -d datacaterer\n</code></pre> <p>Can set it to one of the following:</p> <ul> <li>postgres</li> <li>mysql</li> <li>cassandra</li> <li>solace</li> <li>kafka</li> <li>http</li> </ul>"},{"location":"setup/guide/data-source/cassandra/","title":"Cassandra","text":"<p>Info</p> <p>Writing data to Cassandra is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Cassandra. You will build a Docker image that will be able to populate data in Cassandra for the tables you configure.</p>"},{"location":"setup/guide/data-source/cassandra/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Cassandra</li> </ul>"},{"location":"setup/guide/data-source/cassandra/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Cassandra instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/cassandra/#cassandra-setup","title":"Cassandra Setup","text":"<p>Next, let's make sure you have an instance of Cassandra up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d cassandra\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#permissions","title":"Permissions","text":"<p>Let's make a new user that has the required permissions needed to push data into the Cassandra tables we want.</p> CQL Permission Statements <pre><code>GRANT INSERT ON &lt;schema&gt;.&lt;table&gt; TO data_caterer_user;\n</code></pre> <p>Following permissions are required when enabling <code>configuration.enableGeneratePlanAndTasks(true)</code> as it will gather metadata information about tables and columns from the below tables.</p> CQL Permission Statements <pre><code>GRANT SELECT ON system_schema.tables TO data_caterer_user;\nGRANT SELECT ON system_schema.columns TO data_caterer_user;\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedCassandraJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedCassandraPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedCassandraJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedCassandraPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/cassandra/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Cassandra.</p> JavaScala <pre><code>var accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap.of()                //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p> <pre><code>val accountTask = cassandra(\n\"customer_cassandra\",   //name\n\"localhost:9042\",       //url\n\"cassandra\",            //username\n\"cassandra\",            //password\nMap()                   //optional additional connection options\n)\n</code></pre> <p>Additional options such as SSL configuration, etc can be found here.</p>"},{"location":"setup/guide/data-source/cassandra/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account.accounts</code> and <code>account.account_status_history</code> tables as defined under<code>docker/data/cql/customer.cql</code>. This table should already be setup for you if you followed this step. We can check if the table is setup already via the following command:</p> <pre><code>docker exec host.docker.internal cqlsh -e 'describe account.accounts; describe account.account_status_history;'\n</code></pre> <p>Here we should see some output that looks like the below. This tells us what schema we need to follow when generating data. We need to define that alongside any metadata that is useful to add constraints on what are possible values the generated data should contain.</p> <pre><code>CREATE TABLE account.accounts (\naccount_id text PRIMARY KEY,\n    amount double,\n    created_by text,\n    name text,\n    open_time timestamp,\n    status text\n)...\n\nCREATE TABLE account.account_status_history (\naccount_id text,\n    eod_date date,\n    status text,\n    updated_by text,\n    updated_time timestamp,\n    PRIMARY KEY (account_id, eod_date)\n)...\n</code></pre> <p>Trimming the connection details to work with the docker-compose Cassandra, we have a base Cassandra connection to define the table and schema required. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code> which corresponds to <code>text</code> in Cassandra.</p> JavaScala <pre><code>{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n}\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/data-source/cassandra/#account_id","title":"account_id","text":"<p><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it. This can be defined via a regex like below. Alongside, we also mention that it is the primary key to prompt ensure that unique values are generated.</p> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#amount","title":"amount","text":"<p><code>amount</code> the numbers shouldn't be too large, so we can define a min and max for the generated numbers to be between <code>1</code> and <code>1000</code>.</p> JavaScala <pre><code>field().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"amount\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#name","title":"name","text":"<p><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker expressions can be found here</p> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#open_time","title":"open_time","text":"<p><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by using <code>java.sql.Date</code> like below.</p> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#status","title":"status","text":"<p><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</p> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#created_by","title":"created_by","text":"<p><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the logic: <code>if status is open or closed, then it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</p> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyAdvancedCassandraJavaPlan extends PlanRun {\n{\nvar accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").primaryKey(true),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyAdvancedCassandraPlan extends PlanRun {\nval accountTask = cassandra(\"customer_cassandra\", \"host.docker.internal:9042\")\n.table(\"account\", \"accounts\")\n.schema(\nfield.name(\"account_id\").primaryKey(true),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/data-source/cassandra/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyAdvancedCassandraJavaPlan or MyAdvancedCassandraPlan\n#after completing\ndocker exec docker-cassandraserver-1 cqlsh -e 'select count(1) from account.accounts;select * from account.accounts limit 10;'\n</code></pre> <p>Your output should look like this.</p> <pre><code> count\n-------\n  1000\n\n(1 rows)\n\nWarnings :\nAggregation query used without partition key\n\n\n account_id  | amount    | created_by         | name                   | open_time                       | status\n-------------+-----------+--------------------+------------------------+---------------------------------+-----------\n ACC13554145 | 917.00418 | zb CVvbBTTzitjo5fK |          Jan Sanford I | 2023-06-21 21:50:10.463000+0000 | suspended\n ACC19154140 |  46.99177 |             VH88H9 |       Clyde Bailey PhD | 2023-07-18 11:33:03.675000+0000 |      open\n ACC50587836 |  774.9872 |         GENANwPm t |           Sang Monahan | 2023-03-21 00:16:53.308000+0000 |    closed\n ACC67619387 | 452.86706 |       5msTpcBLStTH |         Jewell Gerlach | 2022-10-18 19:13:07.606000+0000 | suspended\n ACC69889784 |  14.69298 |           WDmOh7NT |          Dale Schulist | 2022-10-25 12:10:52.239000+0000 | suspended\n ACC41977254 |  51.26492 |          J8jAKzvj2 |           Norma Nienow | 2023-08-19 18:54:39.195000+0000 | suspended\n ACC40932912 | 349.68067 |   SLcJgKZdLp5ALMyg | Vincenzo Considine III | 2023-05-16 00:22:45.991000+0000 |    closed\n ACC20642011 | 658.40713 |          clyZRD4fI |  Lannie McLaughlin DDS | 2023-05-11 23:14:30.249000+0000 |      open\n ACC74962085 | 970.98218 |       ZLETTSnj4NpD |          Ima Jerde DVM | 2023-05-07 10:01:56.218000+0000 |   pending\n ACC72848439 | 481.64267 |                 cc |        Kyla Deckow DDS | 2023-08-16 13:28:23.362000+0000 | suspended\n\n(10 rows)\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/http/","title":"HTTP Source","text":"<p>Info</p> <p>Generating data based on OpenAPI/Swagger document and pushing to HTTP endpoint is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on an OpenAPI/Swagger document.</p> <p></p>"},{"location":"setup/guide/data-source/http/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/http/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/http/#http-setup","title":"HTTP Setup","text":"<p>We will be using the http-bin docker image to help simulate a service with HTTP endpoints.</p> <p>Start it via:</p> <pre><code>cd docker\ndocker-compose up -d http\ndocker ps\n</code></pre>"},{"location":"setup/guide/data-source/http/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedHttpJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedHttpPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedHttpJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedHttpPlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/http/#schema","title":"Schema","text":"<p>We can point the schema of a data source to a OpenAPI/Swagger document or URL. For this example, we will use the OpenAPI document found under <code>docker/mount/http/petstore.json</code> in the data-caterer-example repo. This is a simplified version of the original OpenAPI spec that can be found here.</p> <p>We have kept the following endpoints to test out:</p> <ul> <li>GET /pets - get all pets</li> <li>POST /pets - create a new pet</li> <li>GET /pets/{id} - get a pet by id</li> <li>DELETE /pets/{id} - delete a pet by id</li> </ul> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.count(count.records(2))\n</code></pre> <p>The above defines that the schema will come from an OpenAPI document found on the pathway defined. It will then generate 2 requests per request method and endpoint combination.</p>"},{"location":"setup/guide/data-source/http/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\n#after completing\ndocker logs -f docker-http-1\n</code></pre> <p>It should look something like this.</p> <pre><code>172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DeXQxFUHVja+EYm%26limit%3D33895 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:53 +0000] GET /anything/pets?tags%3DSXaFvAqwYGF%26tags%3DjdNRFONA%26limit%3D40975 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:06:56 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/kbH8D7rDuq HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:00 +0000] GET /anything/pets/REsa0tnu7dvekGDvxR HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/EqrOr1dHFfKUjWb HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:07:03 +0000] DELETE /anything/pets/7WG7JHPaNxP HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/http/#foreign-keys","title":"Foreign keys","text":"<p>The four different requests that get sent could have the same <code>id</code> passed across to each of them if we define a foreign key relationship. This will make it more realistic to a real life scenario as pets get created and queried by a particular <code>id</code> value. We note that the <code>id</code> value is first used when a pet is created in the body of the POST request. Then it gets used as a path parameter in the DELETE and GET requests.</p> <p>To link them all together, we must follow a particular pattern when referring to request body, query parameter or path parameter columns.</p> HTTP Type Column Prefix Example Request Body <code>bodyContent</code> <code>bodyContent.id</code> Path Parameter <code>pathParam</code> <code>pathParamid</code> Query Parameter <code>queryParam</code> <code>queryParamid</code> Header <code>header</code> <code>headerContent_Type</code> <p>Also note, that when creating a foreign field definition for a HTTP data source, to refer to a specific endpoint and method, we have to follow the pattern of <code>{http method}{http path}</code>. For example, <code>POST/pets</code>. Let's apply this knowledge to link all the <code>id</code> values together.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n);\n\nexecute(myPlan, conf, httpTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nforeignField(\"my_http\", \"POST/pets\", \"bodyContent.id\"),     //source of foreign key value\nforeignField(\"my_http\", \"DELETE/pets/{id}\", \"pathParamid\"),\nforeignField(\"my_http\", \"GET/pets/{id}\", \"pathParamid\")\n)\n\nexecute(myPlan, conf, httpTask)\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:33:59 +0000] GET /anything/pets?limit%3D45971 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:00 +0000] GET /anything/pets?limit%3D62015 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:04 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:05 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:09 +0000] DELETE /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/IHPm2 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:34:14 +0000] GET /anything/pets/5e HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Now we have the same <code>id</code> values being produced across the POST, DELETE and GET requests! What if we knew that the <code>id</code> values should follow a particular pattern?</p>"},{"location":"setup/guide/data-source/http/#custom-metadata","title":"Custom metadata","text":"<p>So given that we have defined a foreign key where the root of the foreign key values is from the POST request, we can update the metadata of the <code>id</code> column for the POST request and it will proliferate to the other endpoints as well. Given the <code>id</code> column is a nested column as noted in the foreign key, we can alter its metadata via the following:</p> JavaScala <pre><code>var httpTask = http(\"my_http\")\n.schema(metadataSource().openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field().name(\"bodyContent\").schema(field().name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count().records(2));\n</code></pre> <pre><code>val httpTask = http(\"my_http\")\n.schema(metadataSource.openApi(\"/opt/app/mount/http/petstore.json\"))\n.schema(field.name(\"bodyContent\").schema(field.name(\"id\").regex(\"ID[0-9]{8}\")))\n.count(count.records(2))\n</code></pre> <p>We first get the column <code>bodyContent</code>, then get the nested schema and get the column <code>id</code> and add metadata stating that <code>id</code> should follow the patter <code>ID[0-9]{8}</code>.</p> <p>Let's try run again, and hopefully we should see some proper ID values.</p> <pre><code>./run.sh\n#input class MyAdvancedHttpJavaPlanRun or MyAdvancedHttpPlanRun\ndocker logs -f docker-http-1\n</code></pre> <pre><code>172.21.0.1 [06/Nov/2023:01:45:45 +0000] GET /anything/pets?tags%3D10fWnNoDz%26limit%3D66804 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:46 +0000] GET /anything/pets?tags%3DhyO6mI8LZUUpS HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:50 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:51 +0000] POST /anything/pets HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:52 +0000] DELETE /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID55185420 HTTP/1.1 200 Host: host.docker.internal}\n172.21.0.1 [06/Nov/2023:01:45:57 +0000] GET /anything/pets/ID20618951 HTTP/1.1 200 Host: host.docker.internal}\n</code></pre> <p>Great! Now we have replicated a production-like flow of HTTP requests.</p>"},{"location":"setup/guide/data-source/http/#ordering","title":"Ordering","text":"<p>If you wanted to change the ordering of the requests, you can alter the order from within the OpenAPI/Swagger document. This is particularly useful when you want to simulate the same flow that users would take when utilising your application (i.e. create account, query account, update account).</p>"},{"location":"setup/guide/data-source/http/#rows-per-second","title":"Rows per second","text":"<p>By default, Data Caterer will push requests per method and endpoint at a rate of around 5 requests per second. If you want to alter this value, you can do so via the below configuration. The lowest supported requests per second is 1.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n\n...\nvar httpTask = http(\"my_http\", Map.of(Constants.ROWS_PER_SECOND(), \"1\"))\n...\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.ROWS_PER_SECOND\n\n...\nval httpTask = http(\"my_http\", options = Map(ROWS_PER_SECOND -&gt; \"1\"))\n...\n</code></pre> <p>Check out the full example under <code>AdvancedHttpPlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/kafka/","title":"Kafka","text":"<p>Info</p> <p>Writing data to Kafka is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka. You will build a Docker image that will be able to populate data in kafka for the topics you configure.</p>"},{"location":"setup/guide/data-source/kafka/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Kafka</li> </ul>"},{"location":"setup/guide/data-source/kafka/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Kafka instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/kafka/#kafka-setup","title":"Kafka Setup","text":"<p>Next, let's make sure you have an instance of Kafka up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d kafka\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedKafkaJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedKafkaPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedKafkaJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedKafkaPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/kafka/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Kafka.</p> JavaScala <pre><code>var accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap.of()          //optional additional connection options\n);\n</code></pre> <p>Additional options can be found here.</p> <pre><code>val accountTask = kafka(\n\"my_kafka\",       //name\n\"localhost:9092\", //url\nMap()             //optional additional connection options\n)\n</code></pre> <p>Additional options can be found here.</p>"},{"location":"setup/guide/data-source/kafka/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>. This topic should already be setup for you if you followed this step. We can check if the topic is set up already via the following command:</p> <pre><code>docker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Trimming the connection details to work with the docker-compose Kafka, we have a base Kafka connection to define the topic we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield().name(\"key\").sql(\"content.account_id\"),\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),  can define partition here\nfield().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n),\nfield().name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield().name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n}\n</code></pre> <pre><code>val kafkaTask = kafka(\"my_kafka\", \"kafkaserver:29092\")\n.topic(\"account-topic\")\n.schema(\nfield.name(\"key\").sql(\"content.account_id\"),\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").type(IntegerType),  can define partition here\nfield.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\nfield.name(\"tmp_year\").sql(\"content.year\").omit(true),\nfield.name(\"tmp_name\").sql(\"content.details.name\").omit(true)\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#fields","title":"Fields","text":"<p>The schema defined for Kafka has a format that needs to be followed as noted above. Specifically, the required fields are: - value</p> <p>Whilst, the other fields are optional: - key - partition - headers</p>"},{"location":"setup/guide/data-source/kafka/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the  <code>value</code> part, it refers to <code>content.account_id</code> where <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(ArrayType.instance())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(ArrayType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will  sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/kafka/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code> .</p>"},{"location":"setup/guide/data-source/kafka/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedKafkaJavaPlanRun or AdvancedKafkaPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>Your output should look like this.</p> <pre><code>{\"account_id\":\"ACC56292178\",\"year\":2022,\"amount\":18338.627721151555,\"details\":{\"name\":\"Isaias Reilly\",\"first_txn_date\":\"2021-01-22\",\"updated_by\":{\"user\":\"FgYXbKDWdhHVc3\",\"time\":\"2022-12-30T13:49:07.309Z\"}},\"transactions\":[{\"txn_date\":\"2021-01-22\",\"amount\":30556.52125487579},{\"txn_date\":\"2021-10-29\",\"amount\":39372.302259554635},{\"txn_date\":\"2021-10-29\",\"amount\":61887.31389495968}]}\n{\"account_id\":\"ACC37729457\",\"year\":2022,\"amount\":96885.31758764731,\"details\":{\"name\":\"Randell Witting\",\"first_txn_date\":\"2021-06-30\",\"updated_by\":{\"user\":\"HCKYEBHN8AJ3TB\",\"time\":\"2022-12-02T02:05:01.144Z\"}},\"transactions\":[{\"txn_date\":\"2021-06-30\",\"amount\":98042.09647765031},{\"txn_date\":\"2021-10-06\",\"amount\":41191.43564742036},{\"txn_date\":\"2021-11-16\",\"amount\":78852.08184809204},{\"txn_date\":\"2021-10-09\",\"amount\":13747.157653571106}]}\n{\"account_id\":\"ACC23127317\",\"year\":2023,\"amount\":81164.49304198896,\"details\":{\"name\":\"Jed Wisozk\",\"updated_by\":{\"user\":\"9MBFZZ\",\"time\":\"2023-07-12T05:56:52.397Z\"}},\"transactions\":[]}\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/","title":"Metadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres tables and CSV file based on metadata stored in Marquez ( follows OpenLineage API).</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/marquez-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#marquez-setup","title":"Marquez Setup","text":"<p>You can follow the README found here to help with setting up Marquez in your local environment. This comes with an instance of Postgres which we will also be using as a data store for generated data.</p> <p>The command that was run for this example to help with setup of dummy data was <code>./docker/up.sh -a 5001 -m 5002 --seed</code>.</p> <p>Check that the following url shows some data like below once you click on <code>food_delivery</code> from the <code>ns</code> drop down in the top right corner.</p> <p></p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#postgres-setup","title":"Postgres Setup","text":"<p>Since we will also be using the Marquez Postgres instance as a data source, we will set up a separate database to store the generated data in via:</p> <pre><code>docker exec marquez-db psql -Upostgres -c 'CREATE DATABASE food_delivery'\n</code></pre>"},{"location":"setup/guide/data-source/marquez-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our Marquez instance. For the Postgres data source, we will point to a <code>namespace</code>, which in Marquez or OpenLineage, represents a set of datasets. For the CSV data source, we will point to a specific <code>namespace</code> and <code>dataset</code>.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>var csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map.of(\"saveMode\", \"overwrite\", \"header\", \"true\"))\n.schema(metadataSource().marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count().records(10));\n</code></pre> <pre><code>val csvTask = csv(\"my_csv\", \"/tmp/data/csv\", Map(\"saveMode\" -&gt; \"overwrite\", \"header\" -&gt; \"true\"))\n.schema(metadataSource.marquez(\"http://localhost:5001\", \"food_delivery\", \"public.delivery_7_days\"))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from Marquez, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>food_delivery</code> namespace and <code>public.categories</code> dataset to retrieve the schema information from.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#multiple-schemas","title":"Multiple Schemas","text":"JavaScala <pre><code>var postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\", Map.of())\n.schema(metadataSource().marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count().records(10));\n</code></pre> <pre><code>val postgresTask = postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/food_delivery\", \"postgres\", \"password\")\n.schema(metadataSource.marquez(\"http://host.docker.internal:5001\", \"food_delivery\"))\n.count(count.records(10))\n</code></pre> <p>We now have pointed this Postgres instance to produce multiple schemas that are defined under the <code>food_delivery</code> namespace. Also note that we are using database <code>food_delivery</code> in Postgres to push our generated data to, and we have set the number of records per sub data source (in this case, per table) to be 10.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\n#after completing\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <p>It should look something like this.</p> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |         customer_email         |                     customer_address                     | menu_id | restaurant_id |                        restaurant_address\n   | menu_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+--------------------------------+----------------------------------------------------------+---------+---------------+---------------------------------------------------------------\n---+--------------+-------------+-------------+---------+-----------\n    38736 | 2023-02-05 06:05:23.755 | 2023-09-08 04:29:10.878 | 2023-09-03 23:58:34.285 | april.skiles@hotmail.com       | 5018 Lang Dam, Gaylordfurt, MO 35172                     |   59841 |         30971 | Suite 439 51366 Bartoletti Plains, West Lashawndamouth, CA 242\n42 |        55697 |       36370 |       21574 |   88022 |     16569\n4376 | 2022-12-19 14:39:53.442 | 2023-08-30 07:40:06.948 | 2023-03-15 20:38:26.11  | adelina.balistreri@hotmail.com | Apt. 340 9146 Novella Motorway, East Troyhaven, UT 34773 |   66195 |         42765 | Suite 670 8956 Rob Fork, Rennershire, CA 04524\n|        26516 |       81335 |       87615 |   27433 |     45649\n11083 | 2022-10-30 12:46:38.692 | 2023-06-02 13:05:52.493 | 2022-11-27 18:38:07.873 | johnny.gleason@gmail.com       | Apt. 385 99701 Lemke Place, New Irvin, RI 73305          |   66427 |         44438 | 1309 Danny Cape, Weimanntown, AL 15865\n|        41686 |       36508 |       34498 |   24191 |     92405\n58759 | 2023-07-26 14:32:30.883 | 2022-12-25 11:04:08.561 | 2023-04-21 17:43:05.86  | isabelle.ohara@hotmail.com     | 2225 Evie Lane, South Ardella, SD 90805                  |   27106 |         25287 | Suite 678 3731 Dovie Park, Port Luigi, ID 08250\n|        94205 |       66207 |       81051 |   52553 |     27483\n</code></pre> <p>You can also try query some other tables. Let's also check what is in the CSV file.</p> <pre><code>$ head docker/sample/csv/part-0000*\nmenu_item_id,category_id,discount_id,city_id,driver_id,order_id,order_placed_on,order_dispatched_on,order_delivered_on,customer_email,customer_address,menu_id,restaurant_id,restaurant_address\n72248,37098,80135,45888,5036,11090,2023-09-20T05:33:08.036+08:00,2023-05-16T23:10:57.119+08:00,2023-05-01T22:02:23.272+08:00,demetrice.rohan@hotmail.com,\"406 Harmony Rue, Wisozkburgh, MD 12282\",33762,9042,\"Apt. 751 0796 Ellan Flats, Lake Chetville, WI 81957\"\n41644,40029,48565,83373,89919,58359,2023-04-18T06:28:26.194+08:00,2022-10-15T18:17:48.998+08:00,2023-02-06T17:02:04.104+08:00,joannie.okuneva@yahoo.com,\"Suite 889 022 Susan Lane, Zemlakport, OR 56996\",27467,6216,\"Suite 016 286 Derick Grove, Dooleytown, NY 14664\"\n49299,53699,79675,40821,61764,72234,2023-07-16T21:33:48.739+08:00,2023-02-14T21:23:10.265+08:00,2023-09-18T02:08:51.433+08:00,ina.heller@yahoo.com,\"Suite 600 86844 Heller Island, New Celestinestad, DE 42622\",48002,12462,\"5418 Okuneva Mountain, East Blairchester, MN 04060\"\n83197,86141,11085,29944,81164,65382,2023-01-20T06:08:25.981+08:00,2023-01-11T13:24:32.968+08:00,2023-09-09T02:30:16.890+08:00,lakisha.bashirian@yahoo.com,\"Suite 938 534 Theodore Lock, Port Caitlynland, LA 67308\",69109,47727,\"4464 Stewart Tunnel, Marguritemouth, AR 56791\"\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p> <p>What if we wanted the same records in Postgres <code>public.delivery_7_days</code> to also show up in the CSV file? That's where we can use a foreign key definition.</p>"},{"location":"setup/guide/data-source/marquez-metadata-source/#foreign-key","title":"Foreign Key","text":"<p>We can take a look at the report (under <code>docker/sample/report/index.html</code>) to see what we need to do to create the  foreign key. From the overview, you should see under <code>Tasks</code> there is a <code>my_postgres</code> task which has  <code>food_delivery_public.delivery_7_days</code> as a step. Click on the link for <code>food_delivery_public.delivery_7_days</code> and it  will take us to a page where we can find out about the columns used in this table. Click on the <code>Fields</code> button on the  far right to see.</p> <p>We can copy all of a subset of fields that we want matched across the CSV file and Postgres. For this example, we will  take all the fields.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\npostgresTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask);\n</code></pre> <pre><code>val foreignCols = List(\"order_id\", \"order_placed_on\", \"order_dispatched_on\", \"order_delivered_on\", \"customer_email\",\n\"customer_address\", \"menu_id\", \"restaurant_id\", \"restaurant_address\", \"menu_item_id\", \"category_id\", \"discount_id\",\n\"city_id\", \"driver_id\")\n\nval myPlan = plan.addForeignKeyRelationships(\ncsvTask, foreignCols,\nList(foreignField(postgresTask, \"food_delivery_public.delivery_7_days\", foreignCols))\n)\n\nval conf = ...\n\nexecute(myPlan, conf, postgresTask, csvTask)\n</code></pre> <p>Notice how we have defined the <code>csvTask</code> and <code>foreignCols</code> as the main foreign key but for <code>postgresTask</code>, we had to  define it as a <code>foreignField</code>. This is because <code>postgresTask</code> has multiple tables within it, and we only want to define our foreign key with respect to the <code>public.delivery_7_days</code> table. We use the step name (can be seen from the report)  to specify the table to target. </p> <p>To test this out, we will truncate the <code>public.delivery_7_days</code> table in Postgres first, and then try run again.</p> <pre><code>docker exec marquez-db psql -Upostgres -d food_delivery -c 'TRUNCATE public.delivery_7_days'\n./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ndocker exec marquez-db psql -Upostgres -d food_delivery -c 'SELECT * FROM public.delivery_7_days'\n</code></pre> <pre><code> order_id |     order_placed_on     |   order_dispatched_on   |   order_delivered_on    |        customer_email        |\ncustomer_address                     | menu_id | restaurant_id |                   restaurant_address                   | menu\n_item_id | category_id | discount_id | city_id | driver_id\n----------+-------------------------+-------------------------+-------------------------+------------------------------+-------------\n--------------------------------------------+---------+---------------+--------------------------------------------------------+-----\n---------+-------------+-------------+---------+-----------\n    53333 | 2022-10-15 08:40:23.394 | 2023-01-23 09:42:48.397 | 2023-08-12 08:50:52.397 | normand.aufderhar@gmail.com  | Apt. 036 449\n27 Wilderman Forge, Marvinchester, CT 15952 |   40412 |         70130 | Suite 146 98176 Schaden Village, Grahammouth, SD 12354 |\n90141 |       44210 |       83966 |   78614 |     77449\n</code></pre> <p>Let's grab the first email from the Postgres table and check whether the same record exists in the CSV file.</p> <pre><code>$ cat docker/sample/csv/part-0000* | grep normand.aufderhar\n90141,44210,83966,78614,77449,53333,2022-10-15T08:40:23.394+08:00,2023-01-23T09:42:48.397+08:00,2023-08-12T08:50:52.397+08:00,normand.aufderhar@gmail.com,\"Apt. 036 44927 Wilderman Forge, Marvinchester, CT 15952\",40412,70130,\"Suite 146 98176 Schaden Village, Grahammouth, SD 12354\"\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own foreign keys and generate  data.</p> <p>Check out the full example under <code>AdvancedMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/open-metadata-source/","title":"OpenMetadata Source","text":"<p>Info</p> <p>Generating data based on an external metadata source is a paid feature. Try the free trial here.</p> <p>Creating a data generator for a JSON file based on metadata stored in OpenMetadata.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#requirements","title":"Requirements","text":"<ul> <li>10 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/data-source/open-metadata-source/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/data-source/open-metadata-source/#openmetadata-setup","title":"OpenMetadata Setup","text":"<p>You can follow the local docker setup found here to help with setting up OpenMetadata in your local environment.</p> <p>If that page becomes outdated or the link doesn't work, below are the commands I used to run it:</p> <pre><code>mkdir openmetadata-docker &amp;&amp; cd openmetadata-docker\ncurl -sL https://github.com/open-metadata/OpenMetadata/releases/download/1.2.0-release/docker-compose.yml &gt; docker-compose.yml\ndocker compose -f docker-compose.yml up --detach\n</code></pre> <p>Check that the following url works and login with <code>admin:admin</code>. Then you should see some data  like below:</p> <p></p>"},{"location":"setup/guide/data-source/open-metadata-source/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedOpenMetadataSourceJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedOpenMetadataSourcePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedOpenMetadataSourceJavaPlanRun extends PlanRun {\n{\nvar conf = configuration().enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedOpenMetadataSourcePlanRun extends PlanRun {\nval conf = configuration.enableGeneratePlanAndTasks(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n}\n</code></pre> <p>We will enable generate plan and tasks so that we can read from external sources for metadata and save the reports under a folder we can easily access.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#schema","title":"Schema","text":"<p>We can point the schema of a data source to our OpenMetadata instance. We will use a JSON data source so that we can show how nested data types are handled and how we could customise it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#single-schema","title":"Single Schema","text":"JavaScala <pre><code>import com.github.pflooky.datacaterer.api.model.Constants;\n...\n\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(metadataSource().openMetadataJava(\n\"http://localhost:8585/api\",                                                              //url\nConstants.OPEN_METADATA_AUTH_TYPE_OPEN_METADATA(),                                        //auth type\nMap.of(                                                                                   //additional options (including auth options)\nConstants.OPEN_METADATA_JWT_TOKEN(), \"abc123\",                                        //get from settings/bots/ingestion-bot\nConstants.OPEN_METADATA_TABLE_FQN(), \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count().records(10));\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.model.Constants.{OPEN_METADATA_AUTH_TYPE_OPEN_METADATA, OPEN_METADATA_JWT_TOKEN, OPEN_METADATA_TABLE_FQN, SAVE_MODE}\n...\n\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(metadataSource.openMetadata(\n\"http://localhost:8585/api\",                                                  //url\nOPEN_METADATA_AUTH_TYPE_OPEN_METADATA,                                        //auth type\nMap(                                                                          //additional options (including auth options)\nOPEN_METADATA_JWT_TOKEN -&gt; \"abc123\",                                        //get from settings/bots/ingestion-bot\nOPEN_METADATA_TABLE_FQN -&gt; \"sample_data.ecommerce_db.shopify.raw_customer\"  //table fully qualified name\n)\n))\n.count(count.records(10))\n</code></pre> <p>The above defines that the schema will come from OpenMetadata, which is a type of metadata source that contains information about schemas. Specifically, it points to the <code>sample_data.ecommerce_db.shopify.raw_customer</code> table. You can check out the schema here to see what it looks like.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#run","title":"Run","text":"<p>Let's try run and see what happens.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedOpenMetadataSourceJavaPlanRun or MyAdvancedOpenMetadataSourcePlanRun\n#after completing\ncat docker/sample/json/part-00000-*\n</code></pre> <p>It should look something like this.</p> <pre><code>{\n\"comments\": \"Mh6jqpD5e4M\",\n\"creditcard\": \"6771839575926717\",\n\"membership\": \"Za3wCQUl9E  EJj712\",\n\"orders\": [\n{\n\"product_id\": \"Aa6NG0hxfHVq\",\n\"price\": 16139,\n\"onsale\": false,\n\"tax\": 58134,\n\"weight\": 40734,\n\"others\": 45813,\n\"vendor\": \"Kh\"\n},\n{\n\"product_id\": \"zbHBY \",\n\"price\": 17903,\n\"onsale\": false,\n\"tax\": 39526,\n\"weight\": 9346,\n\"others\": 52035,\n\"vendor\": \"jbkbnXAa\"\n},\n{\n\"product_id\": \"5qs3gakppd7Nw5\",\n\"price\": 48731,\n\"onsale\": true,\n\"tax\": 81105,\n\"weight\": 2004,\n\"others\": 20465,\n\"vendor\": \"nozCDMSXRPH Ev\"\n},\n{\n\"product_id\": \"CA6h17ANRwvb\",\n\"price\": 62102,\n\"onsale\": true,\n\"tax\": 96601,\n\"weight\": 78849,\n\"others\": 79453,\n\"vendor\": \" ihVXEJz7E2EFS\"\n}\n],\n\"platform\": \"GLt9\",\n\"preference\": {\n\"key\": \"nmPmsPjg C\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Loren Bechtelar\",\n\"street_address\": \"Suite 526 293 Rohan Road, Wunschshire, NE 25532\",\n\"city\": \"South Norrisland\",\n\"postcode\": \"56863\"\n}\n],\n\"shipping_date\": \"2022-11-03\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"lance.murphy\",\n\"name\": \"Zane Brakus DVM\",\n\"sex\": \"7HcAaPiO\",\n\"address\": \"594 Loida Haven, Gilland, MA 26071\",\n\"mail\": \"Un3fhbvK2rEbenIYdnq\",\n\"birthdate\": \"2023-01-31\"\n}\n}\n</code></pre> <p>Looks like we have some data now. But we can do better and add some enhancements to it.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#custom-metadata","title":"Custom metadata","text":"<p>We can see from the data generated, that it isn't quite what we want. The metadata is not sufficient for us to produce production-like data yet. Let's try to add some enhancements to it.</p> <p>Let's make the <code>platform</code> field a choice field that can only be a set of certain values and the nested field <code>customer.sex</code> is also from a predefined set of values.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map.of(\"saveMode\", \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield().name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield().name(\"customer\").schema(field().name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count().records(10));\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\", Map(\"saveMode\" -&gt; \"overwrite\"))\n.schema(\nmetadata...\n))\n.schema(\nfield.name(\"platform\").oneOf(\"website\", \"mobile\"),\nfield.name(\"customer\").schema(field.name(\"sex\").oneOf(\"M\", \"F\", \"O\"))\n)\n.count(count.records(10))\n</code></pre> <p>Let's test it out by running it again</p> <pre><code>./run.sh\n#input class MyAdvancedMetadataSourceJavaPlanRun or MyAdvancedMetadataSourcePlanRun\ncat docker/sample/json/part-00000-*\n</code></pre> <pre><code>{\n\"comments\": \"vqbPUm\",\n\"creditcard\": \"6304867705548636\",\n\"membership\": \"GZ1xOnpZSUOKN\",\n\"orders\": [\n{\n\"product_id\": \"rgOokDAv\",\n\"price\": 77367,\n\"onsale\": false,\n\"tax\": 61742,\n\"weight\": 87855,\n\"others\": 26857,\n\"vendor\": \"04XHR64ImMr9T\"\n}\n],\n\"platform\": \"mobile\",\n\"preference\": {\n\"key\": \"IB5vNdWka\",\n\"value\": true\n},\n\"shipping_address\": [\n{\n\"name\": \"Isiah Bins\",\n\"street_address\": \"36512 Ross Spurs, Hillhaven, IA 18760\",\n\"city\": \"Averymouth\",\n\"postcode\": \"75818\"\n},\n{\n\"name\": \"Scott Prohaska\",\n\"street_address\": \"26573 Haley Ports, Dariusland, MS 90642\",\n\"city\": \"Ashantimouth\",\n\"postcode\": \"31792\"\n},\n{\n\"name\": \"Rudolf Stamm\",\n\"street_address\": \"Suite 878 0516 Danica Path, New Christiaport, ID 10525\",\n\"city\": \"Doreathaport\",\n\"postcode\": \"62497\"\n}\n],\n\"shipping_date\": \"2023-08-24\",\n\"transaction_date\": \"2023-02-01\",\n\"customer\": {\n\"username\": \"jolie.cremin\",\n\"name\": \"Fay Klein\",\n\"sex\": \"O\",\n\"address\": \"Apt. 174 5084 Volkman Creek, Hillborough, PA 61959\",\n\"mail\": \"BiTmzb7\",\n\"birthdate\": \"2023-04-07\"\n}\n}\n</code></pre> <p>Great! Now we have the ability to get schema information from an external source, add our own metadata and generate  data.</p>"},{"location":"setup/guide/data-source/open-metadata-source/#data-validation","title":"Data validation","text":"<p>Another aspect of OpenMetadata that can be leveraged is the definition of data quality rules. These rules can be  incorporated into your Data Caterer job as well by enabling data validations via <code>enableGenerateValidations</code> in  <code>configuration</code>.</p> JavaScala <pre><code>var conf = configuration().enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(conf, jsonTask);\n</code></pre> <pre><code>val conf = configuration.enableGeneratePlanAndTasks(true)\n.enableGenerateValidations(true)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(conf, jsonTask)\n</code></pre> <p>Check out the full example under <code>AdvancedOpenMetadataSourcePlanRun</code> in the example repo.</p>"},{"location":"setup/guide/data-source/solace/","title":"Solace","text":"<p>Info</p> <p>Writing data to Solace is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Solace. You will build a Docker image that will be able to populate data in Solace for the queues/topics you configure.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> <li>Solace</li> </ul>"},{"location":"setup/guide/data-source/solace/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre> <p>If you already have a Solace instance running, you can skip to this step.</p>"},{"location":"setup/guide/data-source/solace/#solace-setup","title":"Solace Setup","text":"<p>Next, let's make sure you have an instance of Solace up and running in your local environment. This will make it easy for us to iterate and check our changes.</p> <pre><code>cd docker\ndocker-compose up -d solace\n</code></pre> <p>Open up localhost:8080 and login with <code>admin:admin</code> and check there is the <code>default</code> VPN like below. Notice there is 2 queues/topics created. If you do not see 2 created, try to run the script found under <code>docker/data/solace/setup_solace.sh</code> and change the <code>host</code> to <code>localhost</code>.</p> <p></p>"},{"location":"setup/guide/data-source/solace/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedSolaceJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedSolacePlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyAdvancedSolaceJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyAdvancedSolacePlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/data-source/solace/#connection-configuration","title":"Connection Configuration","text":"<p>Within our class, we can start by defining the connection properties to connect to Solace.</p> JavaScala <pre><code>var accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap.of()                            //optional additional connection options\n);\n</code></pre> <p>Additional connection options can be found here.</p> <pre><code>val accountTask = solace(\n\"my_solace\",                        //name\n\"smf://host.docker.internal:55554\", //url\nMap()                               //optional additional connection options\n)\n</code></pre> <p>Additional connection options can be found here.</p>"},{"location":"setup/guide/data-source/solace/#schema","title":"Schema","text":"<p>Let's create a task for inserting data into the <code>rest_test_queue</code> or <code>rest_test_topic</code> that is already created for us from this step.</p> <p>Trimming the connection details to work with the docker-compose Solace, we have a base Solace connection to define the JNDI destination we will publish to. Let's define each field along with their corresponding data type. You will notice that the <code>text</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>{\nvar solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield().name(\"value\").sql(\"TO_JSON(content)\"),\n//field().name(\"partition\").type(IntegerType.instance()),   //can define message JMS priority here\nfield().name(\"headers\")                                     //set message properties via headers field\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n),\nfield().name(\"content\")\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield().name(\"year\").type(IntegerType.instance()).min(2021).max(2023),\nfield().name(\"amount\").type(DoubleType.instance()),\nfield().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n),\nfield().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n)\n)\n.count(count().records(10));\n}\n</code></pre> <pre><code>val solaceTask = solace(\"my_solace\", \"smf://host.docker.internal:55554\")\n.destination(\"/JNDI/Q/rest_test_queue\")\n.schema(\nfield.name(\"value\").sql(\"TO_JSON(content)\"),\n//field.name(\"partition\").`type`(IntegerType),  //can define message JMS priority here\nfield.name(\"headers\")                           //set message properties via headers field\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n          |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n          |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n          |)\"\"\".stripMargin\n),\nfield.name(\"content\")\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"),\nfield.name(\"year\").`type`(IntegerType).min(2021).max(2023),\nfield.name(\"amount\").`type`(DoubleType),\nfield.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n),\nfield.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n),\n).count(count.records(10))\n</code></pre>"},{"location":"setup/guide/data-source/solace/#fields","title":"Fields","text":"<p>The schema defined for Solace has a format that needs to be followed as noted above. Specifically, the required fields are:</p> <ul> <li>value</li> </ul> <p>Whilst, the other fields are optional:</p> <ul> <li>partition - refers to JMS priority of the message</li> <li>headers - refers to JMS message properties</li> </ul>"},{"location":"setup/guide/data-source/solace/#headers","title":"headers","text":"<p><code>headers</code> follows a particular pattern that where it is of type <code>HeaderType.getType</code> which behind the scenes, translates to<code>array&lt;struct&lt;key: string,value: binary&gt;&gt;</code>. To be able to generate data for this data type, we need to use an SQL expression like the one below. You will notice that in the<code>value</code> part, it refers to <code>content.account_id</code> where  <code>content</code> is another field defined at the top level of the schema. This allows you to reference other values that have  already been generated.</p> JavaScala <pre><code>field().name(\"headers\")\n.type(HeaderType.getType())\n.sql(\n\"ARRAY(\" +\n\"NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\" +\n\"NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\" +\n\")\"\n)\n</code></pre> <pre><code>field.name(\"headers\")\n.`type`(HeaderType.getType)\n.sql(\n\"\"\"ARRAY(\n      |  NAMED_STRUCT('key', 'account-id', 'value', TO_BINARY(content.account_id, 'utf-8')),\n      |  NAMED_STRUCT('key', 'updated', 'value', TO_BINARY(content.details.updated_by.time, 'utf-8'))\n      |)\"\"\".stripMargin\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#transactions","title":"transactions","text":"<p><code>transactions</code> is an array that contains an inner structure of <code>txn_date</code> and <code>amount</code>. The size of the array generated can be controlled via <code>arrayMinLength</code> and <code>arrayMaxLength</code>.</p> JavaScala <pre><code>field().name(\"transactions\").type(ArrayType.instance())\n.schema(\nfield().name(\"txn_date\").type(DateType.instance()).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield().name(\"amount\").type(DoubleType.instance())\n)\n</code></pre> <pre><code>field.name(\"transactions\").`type`(ArrayType)\n.schema(\nfield.name(\"txn_date\").`type`(DateType).min(Date.valueOf(\"2021-01-01\")).max(\"2021-12-31\"),\nfield.name(\"amount\").`type`(DoubleType),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#details","title":"details","text":"<p><code>details</code> is another example of a nested schema structure where it also has a nested structure itself in <code>updated_by</code>. One thing to note here is the <code>first_txn_date</code> field has a reference to the <code>content.transactions</code> array where it will sort the array by <code>txn_date</code> and get the first element.</p> JavaScala <pre><code>field().name(\"details\")\n.schema(\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"first_txn_date\").type(DateType.instance()).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield().name(\"updated_by\")\n.schema(\nfield().name(\"user\"),\nfield().name(\"time\").type(TimestampType.instance())\n)\n)\n</code></pre> <pre><code>field.name(\"details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"first_txn_date\").`type`(DateType).sql(\"ELEMENT_AT(SORT_ARRAY(content.transactions.txn_date), 1)\"),\nfield.name(\"updated_by\")\n.schema(\nfield.name(\"user\"),\nfield.name(\"time\").`type`(TimestampType),\n),\n)\n</code></pre>"},{"location":"setup/guide/data-source/solace/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n</code></pre>"},{"location":"setup/guide/data-source/solace/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>kafkaTask</code>, we have to call <code>execute</code>.</p>"},{"location":"setup/guide/data-source/solace/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class AdvancedSolaceJavaPlanRun or AdvancedSolacePlanRun\n#after completing, check http://localhost:8080 from browser\n</code></pre> <p>Your output should look like this.</p> <p></p> <p>Unfortunately, there is no easy way to see the message content. You can check the message content from your application or service that consumes these messages.</p> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed. Or view the sample report found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/","title":"Auto Generate From Data Connection","text":"<p>Info</p> <p>Auto data generation from data connection is a paid feature. Try the free trial here.</p> <p>Creating a data generator based on only a data connection to Postgres.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/auto-generate-connection/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedAutomatedJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedAutomatedPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedAutomatedJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedAutomatedPlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (3)\n.enableUniqueCheck(true)                                                          (4)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code, we note the following:</p> <ol> <li>Data source configuration to a Postgres data source called <code>my_postgres</code></li> <li>We have enabled the flag <code>enableGeneratePlanAndTasks</code> which tells Data Caterer to go to <code>my_postgres</code> and generate    data for all the tables found under the database <code>customer</code> (which is defined in the connection string).</li> <li>The config <code>generatedPlanAndTaskFolderPath</code> defines where the metadata that is gathered from <code>my_postgres</code> should be    saved at so that we could re-use it later.</li> <li><code>enableUniqueCheck</code> is set to true to ensure that generated data is unique based on primary key or foreign key    definitions.</li> </ol> <p>Note</p> <p>Unique check will only ensure generated data is unique. Any existing data in your data source is not taken into  account, so generated data may fail to insert depending on the data source restrictions</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedAutomatedJavaPlanRun or MyAdvancedAutomatedPlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1;'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Also check the HTML report that gets generated under <code>docker/sample/report/index.html</code>. You can see a summary of what was generated along with other metadata.</p> <p>You can now look to play around with other tables or data sources and auto generate for them.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/auto-generate-connection/#learn-from-existing-data","title":"Learn From Existing Data","text":"<p>If you have any existing data within your data source, Data Caterer will gather metadata about the existing data to help guide it when generating new data. There are configurations that can help tune the metadata analysis found here.</p>"},{"location":"setup/guide/scenario/auto-generate-connection/#filter-out-schematables","title":"Filter Out Schema/Tables","text":"<p>As part of your connection definition, you can define any schemas and/or tables your don't want to generate data for. In the example below, it will not generate any data for any tables under the <code>history</code> and <code>audit</code> schemas. Also, any table with the name <code>balances</code> or <code>transactions</code> in any schema will also not have data generated.</p> JavaScala <pre><code>var autoRun = configuration()\n.postgres(\n\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap.of(\n\"filterOutSchema\", \"history, audit\",\n\"filterOutTable\", \"balances, transactions\")\n)\n)\n</code></pre> <pre><code>val autoRun = configuration\n.postgres(\n\"my_postgres\",\n\"jdbc:postgresql://host.docker.internal:5432/customer\",\nMap(\n\"filterOutSchema\" -&gt; \"history, audit\",\n\"filterOutTable\" -&gt; \"balances, transactions\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/auto-generate-connection/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/","title":"Generate Batch and Event Data","text":"<p>Info</p> <p>Generating event data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Kafka topic with matching records in a CSV file.</p>"},{"location":"setup/guide/scenario/batch-and-event/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/batch-and-event/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#kafka-setup","title":"Kafka Setup","text":"<p>If you don't have your own Kafka up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d kafka\ndocker exec docker-kafkaserver-1 kafka-topics --bootstrap-server localhost:9092 --list\n</code></pre> <p>Let's create a task for inserting data into the <code>account-topic</code> that is already defined under<code>docker/data/kafka/setup_kafka.sh</code>.</p>"},{"location":"setup/guide/scenario/batch-and-event/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedBatchEventJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedBatchEventPlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedBatchEventJavaPlanRun extends PlanRun {\n{\nvar kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedBatchEventPlanRun extends PlanRun {\nval kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n}\n</code></pre> <p>We will borrow the Kafka task that is already defined under the class <code>AdvancedKafkaPlanRun</code> or <code>AdvancedKafkaJavaPlanRun</code>. You can go through the Kafka guide here for more details.</p>"},{"location":"setup/guide/scenario/batch-and-event/#schema","title":"Schema","text":"<p>Let us set up the corresponding schema for the CSV file where we want to match the values that are generated for the Kafka messages.</p> JavaScala <pre><code>var kafkaTask = new AdvancedKafkaJavaPlanRun().getKafkaTask();\n\nvar csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield().name(\"account_number\"),\nfield().name(\"year\"),\nfield().name(\"name\"),\nfield().name(\"payload\")\n);\n</code></pre> <pre><code>val kafkaTask = new AdvancedKafkaPlanRun().kafkaTask\n\nval csvTask = csv(\"my_csv\", \"/opt/app/data/csv/account\")\n.schema(\nfield.name(\"account_number\"),\nfield.name(\"year\"),\nfield.name(\"name\"),\nfield.name(\"payload\")\n)\n</code></pre> <p>This is a simple schema where we want to use the values and metadata that is already defined in the <code>kafkaTask</code> to determine what the data will look like for the CSV file. Even if we defined some metadata here, it would be overridden when we define our foreign key relationships.</p>"},{"location":"setup/guide/scenario/batch-and-event/#foreign-keys","title":"Foreign Keys","text":"<p>From the above CSV schema, we see note the following against the Kafka schema:</p> <ul> <li><code>account_number</code> in CSV needs to match with the <code>account_id</code> in Kafka<ul> <li>We see that <code>account_id</code> is referred to in the <code>key</code> column as <code>field.name(\"key\").sql(\"content.account_id\")</code></li> </ul> </li> <li><code>year</code> needs to match with <code>content.year</code> in Kafka, which is a nested field<ul> <li>We can only do foreign key relationships with top level fields, not nested fields. So we define a new column   called <code>tmp_year</code> which will not appear in the final output for the Kafka messages but is used as an intermediate   step <code>field.name(\"tmp_year\").sql(\"content.year\").omit(true)</code></li> </ul> </li> <li><code>name</code> needs to match with <code>content.details.name</code> in Kafka, also a nested field<ul> <li>Using the same logic as above, we define a temporary column called <code>tmp_name</code> which will take the value of the   nested field but will be omitted <code>field.name(\"tmp_name\").sql(\"content.details.name\").omit(true)</code></li> </ul> </li> <li><code>payload</code> represents the whole JSON message sent to Kafka, which matches to <code>value</code> column</li> </ul> <p>Our foreign keys are therefore defined like below. Order is important when defining the list of columns. The index needs to match with the corresponding column in the other data source.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\nkafkaTask, List.of(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList.of(Map.entry(csvTask, List.of(\"account_number\", \"year\", \"name\", \"payload\")))\n);\n\nvar conf = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(myPlan, conf, kafkaTask, csvTask);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\nkafkaTask, List(\"key\", \"tmp_year\", \"tmp_name\", \"value\"),\nList(csvTask -&gt; List(\"account_number\", \"year\", \"name\", \"payload\"))\n)\n\nval conf = configuration.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(myPlan, conf, kafkaTask, csvTask)\n</code></pre>"},{"location":"setup/guide/scenario/batch-and-event/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedBatchEventJavaPlanRun or MyAdvancedBatchEventPlanRun\n#after completing\ndocker exec docker-kafkaserver-1 kafka-console-consumer --bootstrap-server localhost:9092 --topic account-topic --from-beginning\n</code></pre> <p>It should look something like this.</p> <pre><code>{\"account_id\":\"ACC03093143\",\"year\":2023,\"amount\":87990.37196728592,\"details\":{\"name\":\"Nadine Heidenreich Jr.\",\"first_txn_date\":\"2021-11-09\",\"updated_by\":{\"user\":\"YfEyJCe8ohrl0j IfyT\",\"time\":\"2022-09-26T20:47:53.404Z\"}},\"transactions\":[{\"txn_date\":\"2021-11-09\",\"amount\":97073.7914706189}]}\n{\"account_id\":\"ACC08764544\",\"year\":2021,\"amount\":28675.58758765888,\"details\":{\"name\":\"Delila Beer\",\"first_txn_date\":\"2021-05-19\",\"updated_by\":{\"user\":\"IzB5ksXu\",\"time\":\"2023-01-26T20:47:26.389Z\"}},\"transactions\":[{\"txn_date\":\"2021-10-01\",\"amount\":80995.23818711648},{\"txn_date\":\"2021-05-19\",\"amount\":92572.40049217848},{\"txn_date\":\"2021-12-11\",\"amount\":99398.79832225188}]}\n{\"account_id\":\"ACC62505420\",\"year\":2023,\"amount\":96125.3125884202,\"details\":{\"name\":\"Shawn Goodwin\",\"updated_by\":{\"user\":\"F3dqIvYp2pFtena4\",\"time\":\"2023-02-11T04:38:29.832Z\"}},\"transactions\":[]}\n</code></pre> <p>Let's also check if there is a corresponding record in the CSV file.</p> <pre><code>$ cat docker/sample/csv/account/part-0000* | grep ACC03093143\nACC03093143,2023,Nadine Heidenreich Jr.,\"{\\\"account_id\\\":\\\"ACC03093143\\\",\\\"year\\\":2023,\\\"amount\\\":87990.37196728592,\\\"details\\\":{\\\"name\\\":\\\"Nadine Heidenreich Jr.\\\",\\\"first_txn_date\\\":\\\"2021-11-09\\\",\\\"updated_by\\\":{\\\"user\\\":\\\"YfEyJCe8ohrl0j IfyT\\\",\\\"time\\\":\\\"2022-09-26T20:47:53.404Z\\\"}},\\\"transactions\\\":[{\\\"txn_date\\\":\\\"2021-11-09\\\",\\\"amount\\\":97073.7914706189}]}\"\n</code></pre> <p>Great! The account, year, name and payload look to all match up.</p>"},{"location":"setup/guide/scenario/batch-and-event/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/batch-and-event/#order-of-execution","title":"Order of execution","text":"<p>You may notice that the events are generated first, then the CSV file. This is because as part of the <code>execute</code> function, we passed in the <code>kafkaTask</code> first, before the <code>csvTask</code>. You can change the order of execution by passing in <code>csvTask</code> before <code>kafkaTask</code> into the <code>execute</code> function.</p>"},{"location":"setup/guide/scenario/data-validation/","title":"Data Validations","text":"<p>Creating a data validator for a JSON file.</p> <p></p>"},{"location":"setup/guide/scenario/data-validation/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#data-setup","title":"Data Setup","text":"<p>To aid in showing the functionality of data validations, we will first generate some data that our validations will run against. Run the below command and it will generate JSON files under <code>docker/sample/json</code> folder.</p> <pre><code>./run.sh JsonPlan\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyValidationJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyValidationPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyValidationJavaPlan extends PlanRun {\n{\nvar jsonTask = json(\"my_json\", \"/opt/app/data/json\");\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyValidationPlan extends PlanRun {\nval jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n}\n</code></pre> <p>As noted above, we create a JSON task that points to where the JSON data has been created at folder <code>/opt/app/data/json</code> . We also note that <code>enableValidation</code> is set to <code>true</code> and <code>enableGenerateData</code> to <code>false</code> to tell Data Catering, we only want to validate data.</p>"},{"location":"setup/guide/scenario/data-validation/#validations","title":"Validations","text":"<p>For reference, the schema in which we will be validating against looks like the below.</p> <pre><code>.schema(\nfield.name(\"account_id\"),\n  field.name(\"year\").`type`(IntegerType),\n  field.name(\"balance\").`type`(DoubleType),\n  field.name(\"date\").`type`(DateType),\n  field.name(\"status\"),\n  field.name(\"update_history\").`type`(ArrayType)\n.schema(\nfield.name(\"updated_time\").`type`(TimestampType),\n      field.name(\"status\").oneOf(\"open\", \"closed\", \"pending\", \"suspended\"),\n    ),\n  field.name(\"customer_details\")\n.schema(\nfield.name(\"name\").expression(\"#{Name.name}\"),\n      field.name(\"age\").`type`(IntegerType),\n      field.name(\"city\").expression(\"#{Address.city}\")\n)\n)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#basic-validation","title":"Basic Validation","text":"<p>Let's say our goal is to validate the <code>customer_details.name</code> field to ensure it conforms to the regex pattern <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. Given the diversity in naming conventions across cultures and countries, variations such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The validation considers an acceptable error threshold before marking it as failed.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria","title":"Validation Criteria","text":"<ul> <li>Field to Validate: <code>customer_details.name</code></li> <li>Regex Pattern: <code>[A-Z][a-z]+ [A-Z][a-z]+</code></li> <li>Error Tolerance: If more than 10% do not match the regex, then fail.</li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations","title":"Considerations","text":"<ul> <li>Customisation<ul> <li>Adjust the regex pattern and error threshold based on your specific data schema and validation requirements.</li> <li>For the full list of types of basic validations that can be   used, check this page.</li> </ul> </li> <li>Understanding Tolerance<ul> <li>Be mindful of the error threshold, as it directly influences what percentage of deviations from the pattern is   acceptable.</li> </ul> </li> </ul> JavaScala <pre><code>validation().col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre> <pre><code>validation.col(\"customer_details.name\")\n.matches(\"[A-Z][a-z]+ [A-Z][a-z]+\")\n.errorThreshold(0.1)                                      //&lt;=10% failure rate is acceptable\n.description(\"Names generally follow the same pattern\"),  //description to add context in report or other developers\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#custom-validation","title":"Custom Validation","text":"<p>There will be situation where you have a complex data setup and require you own custom logic to use for data validation. You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where we want to check the array <code>update_history</code>, that each entry has <code>updated_time</code> greater than a certain timestamp.</p> JavaScala <pre><code>validation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <pre><code>validation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\n</code></pre> <p>If you want to know what other SQL function are available for you to use, check this page.</p>"},{"location":"setup/guide/scenario/data-validation/#group-by-validation","title":"Group By Validation","text":"<p>There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example would be validating that each customer's transactions sum is greater than 0.</p>"},{"location":"setup/guide/scenario/data-validation/#validation-criteria_1","title":"Validation Criteria","text":"<p>Line 1: <code>validation.groupBy().count().isEqual(100)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy()</code>: Group by whole dataset.</li> <li><code>count()</code>: Counts the number of dataset elements.</li> <li><code>isEqual(100)</code>: Checks if the count is equal to 100.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that the count of the total dataset is exactly 100.</li> </ul> </li> </ul> <p>Line 2: <code>validation.groupBy(\"account_id\").max(\"balance\").lessThan(900)</code></p> <ul> <li>Method Chaining<ul> <li><code>groupBy(\"account_id\")</code>: Groups the data based on the <code>account_id</code> field.</li> <li><code>max(\"balance\")</code>: Calculates the maximum value of the <code>balance</code> field within each group.</li> <li><code>lessThan(900)</code>: Checks if the maximum balance in each group is less than 900.</li> </ul> </li> <li>Validation Rule<ul> <li>This line ensures that, for each group identified by <code>account_id</code> the maximum balance is less than 900.</li> </ul> </li> </ul>"},{"location":"setup/guide/scenario/data-validation/#considerations_1","title":"Considerations","text":"<ul> <li>Adjust the <code>errorThreshold</code> or validation to your specification scenario. The full list   of types of validations can be found here.</li> <li>For the full list of types of group by validations that can be   used, check this page.</li> </ul> JavaScala <pre><code>validation().groupBy().count().isEqual(100),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre> <pre><code>validation.groupBy().count().isEqual(100),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#sample-validation","title":"Sample Validation","text":"<p>To try cover the majority of validation cases, the below has been created.</p> JavaScala <pre><code>var jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation().col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation().col(\"date\").isNotNull().errorThreshold(10),\nvalidation().col(\"balance\").greaterThan(500),\nvalidation().expr(\"YEAR(date) == year\"),\nvalidation().col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation().col(\"customer_details.age\").greaterThan(18),\nvalidation().expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation().col(\"update_history\").greaterThanSize(2),\nvalidation().unique(\"account_id\"),\nvalidation().groupBy().count().isEqual(1000),\nvalidation().groupBy(\"account_id\").max(\"balance\").lessThan(900)\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false);\n\nexecute(config, jsonTask);\n</code></pre> <pre><code>val jsonTask = json(\"my_json\", \"/opt/app/data/json\")\n.validations(\nvalidation.col(\"customer_details.name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.1).description(\"Names generally follow the same pattern\"),\nvalidation.col(\"date\").isNotNull.errorThreshold(10),\nvalidation.col(\"balance\").greaterThan(500),\nvalidation.expr(\"YEAR(date) == year\"),\nvalidation.col(\"status\").in(\"open\", \"closed\", \"pending\").errorThreshold(0.2).description(\"Could be new status introduced\"),\nvalidation.col(\"customer_details.age\").greaterThan(18),\nvalidation.expr(\"FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP('2022-01-01 00:00:00'))\"),\nvalidation.col(\"update_history\").greaterThanSize(2),\nvalidation.unique(\"account_id\"),\nvalidation.groupBy().count().isEqual(1000),\nvalidation.groupBy(\"account_id\").max(\"balance\").lessThan(900)\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableValidation(true)\n.enableGenerateData(false)\n\nexecute(config, jsonTask)\n</code></pre>"},{"location":"setup/guide/scenario/data-validation/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>./run.sh\n#input class MyValidationJavaPlan or MyValidationPlan\n#after completing, check report at docker/sample/report/index.html\n</code></pre> <p>It should look something like this.</p> <p>Check the full example at <code>ValidationPlanRun</code> inside the examples repo.</p>"},{"location":"setup/guide/scenario/delete-generated-data/","title":"Delete Generated Data","text":"<p>Info</p> <p>Delete generated data is a paid feature. Try the free trial here.</p> <p>Creating a data generator for Postgres and delete the generated data after using it.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/delete-generated-data/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyAdvancedDeleteJavaPlanRun.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyAdvancedDeletePlanRun.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyAdvancedDeleteJavaPlanRun extends PlanRun {\n{\nvar autoRun = configuration()\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\");\n\nexecute(autoRun);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyAdvancedDeletePlanRun extends PlanRun {\n\nval autoRun = configuration\n.postgres(\"my_postgres\", \"jdbc:postgresql://host.docker.internal:5432/customer\")  (1)\n.enableGeneratePlanAndTasks(true)                                                 (2)\n.enableRecordTracking(true)                                                       (3)\n.enableDeleteGeneratedRecords(false)                                              (4)\n.enableUniqueCheck(true)\n.generatedPlanAndTaskFolderPath(\"/opt/app/data/generated\")                        (5)\n.recordTrackingFolderPath(\"/opt/app/data/recordTracking\")                         (6)\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(configuration = autoRun)\n}\n</code></pre> <p>In the above code we note the following:</p> <ol> <li>We have defined a Postgres connection called <code>my_postgres</code></li> <li><code>enableGeneratePlanAndTasks</code> is enabled to auto generate data for all tables under <code>customer</code> database</li> <li><code>enableRecordTracking</code> is enabled to ensure that all generated records are tracked. This will get used when we want    to delete data afterwards</li> <li><code>enableDeleteGeneratedRecords</code> is disabled for now. We want to see the generated data first and delete sometime after</li> <li><code>generatedPlanAndTaskFolderPath</code> is the folder path where we saved the metadata we have gathered from <code>my_postgres</code></li> <li><code>recordTrackingFolderPath</code> is the folder path where record tracking is maintained. We need to persist this data to    ensure it is still available when we want to delete data</li> </ol>"},{"location":"setup/guide/scenario/delete-generated-data/#postgres-setup","title":"Postgres Setup","text":"<p>If you don't have your own Postgres up and running, you can set up and run an instance configured in the <code>docker</code> folder via.</p> <pre><code>cd docker\ndocker-compose up -d postgres\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c '\\dt+ account.*'\n</code></pre> <p>This will create the tables found under <code>docker/data/sql/postgres/customer.sql</code>. You can change this file to contain your own tables. We can see there are 4 tables created for us, <code>accounts, balances, transactions and mapping</code>.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>cd ..\n./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\n</code></pre> <p>It should look something like this.</p> <pre><code>   id   | account_number  | account_status | created_by | created_by_fixed_length | customer_id_int | customer_id_smallint | customer_id_bigint |   customer_id_decimal    | customer_id_real | customer_id_double | open_date  |     open_timestamp      | last_opened_time |                                                           payload_bytes\n--------+-----------------+----------------+------------+-------------------------+-----------------+----------------------+--------------------+--------------------------+------------------+--------------------+------------+-------------------------+------------------+------------------------------------------------------------------------------------------------------------------------------------\n 100414 | 5uROOVOUyQUbubN | h3H            | SfA0eZJcTm | CuRw                    |              13 |                   42 |               6041 | 76987.745612542900000000 |         91866.78 |  66400.37433202339 | 2023-03-05 | 2023-08-14 11:33:11.343 | 23:58:01.736     | \\x604d315d4547616e6a233050415373317274736f5e682d516132524f3d23233c37463463322f342d34376d597e665d6b3d395b4238284028622b7d6d2b4f5042\n(1 row)\n</code></pre> <p>The data that gets inserted will follow the foreign keys that are defined within Postgres and also ensure the insertion order is correct.</p> <p>Check the number of records via:</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n#open report under docker/sample/report/index.html\n</code></pre>"},{"location":"setup/guide/scenario/delete-generated-data/#delete","title":"Delete","text":"<p>We are now at a stage where we want to delete the data that was generated. All we need to do is flip two flags.</p> <pre><code>.enableDeleteGeneratedRecords(true)\n.enableGenerateData(false)  //we need to explicitly disable generating data\n</code></pre> <p>Enable delete generated records and disable generating data. </p> <p>Before we run again, let us insert a record manually to see if that data will survive after running the job to delete the generated data.</p> <pre><code>docker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"insert into account.accounts (account_number) values ('my_account_number')\"\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c \"select count(1) from account.accounts\"\n</code></pre> <p>We now should have 1001 records in our <code>account.accounts</code> table. Let's delete the generated data now.</p> <pre><code>./run.sh\n#input class MyAdvancedDeleteJavaPlanRun or MyAdvancedDeletePlanRun\n#after completing\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select * from account.accounts limit 1'\ndocker exec docker-postgresserver-1 psql -Upostgres -d customer -c 'select count(1) from account.accounts'\n</code></pre> <p>You should see that only 1 record is left, the one that we manually inserted. Great, now we can generate data reliably  and also be able to clean it up.</p>"},{"location":"setup/guide/scenario/delete-generated-data/#additional-topics","title":"Additional Topics","text":""},{"location":"setup/guide/scenario/delete-generated-data/#one-class-for-generating-another-for-deleting","title":"One class for generating, another for deleting?","text":"<p>Yes, this is possible. There are two requirements: - the connection names used need to be the same across both classes - <code>recordTrackingFolderPath</code> needs to be set to the same value</p>"},{"location":"setup/guide/scenario/delete-generated-data/#define-record-count","title":"Define record count","text":"<p>You can control the record count per sub data source via <code>numRecordsPerStep</code>.</p> JavaScala <pre><code>var autoRun = configuration()\n...\n.numRecordsPerStep(100)\n\nexecute(autoRun)\n</code></pre> <pre><code>val autoRun = configuration\n...\n.numRecordsPerStep(100)\n\nexecute(configuration = autoRun)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/","title":"First Data Generation","text":"<p>Creating a data generator for a CSV file.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#requirements","title":"Requirements","text":"<ul> <li>20 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/first-data-generation/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyCsvPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyCsvPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n\npublic class MyCsvJavaPlan extends PlanRun {\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n\nclass MyCsvPlan extends PlanRun {\n}\n</code></pre> <p>This class defines where we need to define all of our configurations for generating data. There are helper variables and methods defined to make it simple and easy to use.</p>"},{"location":"setup/guide/scenario/first-data-generation/#connection-configuration","title":"Connection Configuration","text":"<p>When dealing with CSV files, we need to define a path for our generated CSV files to be saved at, along with any other high level configurations.</p> JavaScala <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap.of(\"header\", \"true\")          //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p> <pre><code>csv(\n\"customer_accounts\",              //name\n\"/opt/app/data/customer/account\", //path\nMap(\"header\" -&gt; \"true\")           //optional additional options\n)\n</code></pre> <p>Other additional options for CSV can be found here</p>"},{"location":"setup/guide/scenario/first-data-generation/#schema","title":"Schema","text":"<p>Our CSV file that we generate should adhere to a defined schema where we can also define data types.</p> <p>Let's define each field along with their corresponding data type. You will notice that the <code>string</code> fields do not have a data type defined. This is because the default data type is <code>StringType</code>.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"balance\").type(DoubleType.instance()),\nfield().name(\"created_by\"),\nfield().name(\"name\"),\nfield().name(\"open_time\").type(TimestampType.instance()),\nfield().name(\"status\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"balance\").`type`(DoubleType),\nfield.name(\"created_by\"),\nfield.name(\"name\"),\nfield.name(\"open_time\").`type`(TimestampType),\nfield.name(\"status\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#field-metadata","title":"Field Metadata","text":"<p>We could stop here and generate random data for the accounts table. But wouldn't it be more useful if we produced data that is closer to the structure of the data that would come in production? We can do this by defining various metadata attributes that add guidelines that the data generator will understand when generating data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#account_id","title":"account_id","text":"<ul> <li><code>account_id</code> follows a particular pattern that where it starts with <code>ACC</code> and has 8 digits after it.   This can be defined via a regex like below. Alongside, we also mention that values are unique ensure that   unique values are generated.</li> </ul> JavaScala <pre><code>field().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre> <pre><code>field.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#balance","title":"balance","text":"<ul> <li><code>balance</code> let's make the numbers not too large, so we can define a min and max for the generated numbers to be between   <code>1</code> and <code>1000</code>.</li> </ul> JavaScala <pre><code>field().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\n</code></pre> <pre><code>field.name(\"balance\").`type`(DoubleType).min(1).max(1000),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name","title":"name","text":"<ul> <li><code>name</code> is a string that also follows a certain pattern, so we could also define a regex but here we will choose to   leverage the DataFaker library and create an <code>expression</code> to generate real looking name. All possible faker   expressions   can be found here</li> </ul> JavaScala <pre><code>field().name(\"name\").expression(\"#{Name.name}\"),\n</code></pre> <pre><code>field.name(\"name\").expression(\"#{Name.name}\"),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#open_time","title":"open_time","text":"<ul> <li><code>open_time</code> is a timestamp that we want to have a value greater than a specific date. We can define a min date by   using   <code>java.sql.Date</code> like below.</li> </ul> JavaScala <pre><code>field().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre> <pre><code>field.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#status","title":"status","text":"<ul> <li><code>status</code> is a field that can only obtain one of four values, <code>open, closed, suspended or pending</code>.</li> </ul> JavaScala <pre><code>field().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre> <pre><code>field.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#created_by","title":"created_by","text":"<ul> <li><code>created_by</code> is a field that is based on the <code>status</code> field where it follows the   logic: <code>if status is open or closed, then   it is created_by eod else created_by event</code>. This can be achieved by defining a SQL expression like below.</li> </ul> JavaScala <pre><code>field().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <pre><code>field.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\n</code></pre> <p>Putting it all the fields together, our class should now look like this.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#record-count","title":"Record Count","text":"<p>We only want to generate 100 records, so that we can see what the output looks like. This is controlled at the <code>accountTask</code> level like below. If you want to generate more records, set it to the value you want.</p> JavaScala <pre><code>var accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().records(100));\n</code></pre> <pre><code>val accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.records(100))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#additional-configurations","title":"Additional Configurations","text":"<p>At the end of data generation, a report gets generated that summarises the actions it performed. We can control the output folder of that report via configurations. We will also enable the unique check to ensure any unique fields will have unique values generated.</p> JavaScala <pre><code>var config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n</code></pre> <pre><code>val config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#execute","title":"Execute","text":"<p>To tell Data Caterer that we want to run with the configurations along with the <code>accountTask</code>, we have to call <code>execute</code> . So our full plan run will look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, accountTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nexecute(config, accountTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#run","title":"Run","text":"<p>Now we can run via the script <code>./run.sh</code> that is in the top level directory of the <code>data-caterer-example</code> to run the class we just created.</p> <pre><code>./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing\nhead docker/sample/customer/account/part-00000*\n</code></pre> <p>Your output should look like this.</p> <pre><code>account_id,balance,created_by,name,open_time,status\nACC06192462,853.9843359645766,eod,Hoyt Kertzmann MD,2023-07-22T11:17:01.713Z,closed\nACC15350419,632.5969895326234,eod,Dr. Claude White,2022-12-13T21:57:56.840Z,open\nACC25134369,592.0958847218986,eod,Fabian Rolfson,2023-04-26T04:54:41.068Z,open\nACC48021786,656.6413439322964,eod,Dewayne Stroman,2023-05-17T06:31:27.603Z,open\nACC26705211,447.2850352884595,event,Garrett Funk,2023-07-14T03:50:22.746Z,pending\nACC03150585,750.4568929015996,event,Natisha Reichel,2023-04-11T11:13:10.080Z,suspended\nACC29834210,686.4257811608622,event,Gisele Ondricka,2022-11-15T22:09:41.172Z,suspended\nACC39373863,583.5110618128994,event,Thaddeus Ortiz,2022-09-30T06:33:57.193Z,suspended\nACC39405798,989.2623959059525,eod,Shelby Reinger,2022-10-23T17:29:17.564Z,open\n</code></pre> <p>Also check the HTML report, found at <code>docker/sample/report/index.html</code>, that gets generated to get an overview of what was executed.</p> <p></p>"},{"location":"setup/guide/scenario/first-data-generation/#join-with-another-csv","title":"Join With Another CSV","text":"<p>Now that we have generated some accounts, let's also try to generate a set of transactions for those accounts in CSV format as well. The transactions could be in any other format, but to keep this simple, we will continue using CSV.</p> <p>We can define our schema the same way along with any additional metadata.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"full_name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#records-per-column","title":"Records Per Column","text":"<p>Usually, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#random-records-per-column","title":"Random Records Per Column","text":"<p>Above, you will notice that we are generating 5 records per <code>account_id, full_name</code>. This is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5.</p>"},{"location":"setup/guide/scenario/first-data-generation/#foreign-key","title":"Foreign Key","text":"<p>In this scenario, we want to match the <code>account_id</code> in <code>account</code> to match the same column values in <code>transaction</code>. We also want to match <code>name</code> in <code>account</code> to <code>full_name</code> in <code>transaction</code>. This can be done via plan configuration like below.</p> JavaScala <pre><code>var myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"), //the task and columns we want linked\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\"))) //list of other tasks and their respective column names we want matched\n);\n</code></pre> <pre><code>val myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),  //the task and columns we want linked\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))  //list of other tasks and their respective column names we want matched\n)\n</code></pre> <p>Now, stitching it all together for the <code>execute</code> function, our final plan should look like this.</p> JavaScala <pre><code>public class MyCsvJavaPlan extends PlanRun {\n{\nvar accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield().name(\"balance\").type(DoubleType.instance()).min(1).max(1000),\nfield().name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield().name(\"name\").expression(\"#{Name.name}\"),\nfield().name(\"open_time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count().records(100));\n\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nvar myPlan = plan().addForeignKeyRelationship(\naccountTask, List.of(\"account_id\", \"name\"),\nList.of(Map.entry(transactionTask, List.of(\"account_id\", \"full_name\")))\n);\n\nexecute(myPlan, config, accountTask, transactionTask);\n}\n}\n</code></pre> <pre><code>class MyCsvPlan extends PlanRun {\n\nval accountTask = csv(\"customer_accounts\", \"/opt/app/data/customer/account\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\").unique(true),\nfield.name(\"balance\").`type`(DoubleType).min(1).max(1000),\nfield.name(\"created_by\").sql(\"CASE WHEN status IN ('open', 'closed') THEN 'eod' ELSE 'event' END\"),\nfield.name(\"name\").expression(\"#{Name.name}\"),\nfield.name(\"open_time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"status\").oneOf(\"open\", \"closed\", \"suspended\", \"pending\")\n)\n.count(count.records(100))\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\"),\nfield.name(\"name\"),\nfield.name(\"amount\").`type`(DoubleType).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield.name(\"date\").`type`(DateType).sql(\"DATE(time)\")\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true)\n\nval myPlan = plan.addForeignKeyRelationship(\naccountTask, List(\"account_id\", \"name\"),\nList(transactionTask -&gt; List(\"account_id\", \"full_name\"))\n)\n\nexecute(myPlan, config, accountTask, transactionTask)\n}\n</code></pre> <p>Let's try run again.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyCsvJavaPlan or MyCsvPlan\n#after completing, let's pick an account and check the transactions for that account\naccount=$(tail -1 docker/sample/customer/account/part-00000* | awk -F \",\" '{print $1 \",\" $4}')\necho $account\ncat docker/sample/customer/transaction/part-00000* | grep $account\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>Congratulations! You have now made a data generator that has simulated a real world data scenario. You can check the <code>DocumentationJavaPlanRun.java</code> or <code>DocumentationPlanRun.scala</code> files as well to check that your plan is the same.</p> <p>We can now look to consume this CSV data from a job or service. Usually, once we have consumed the data, we would also want to check and validate that our consumer has correctly ingested the data.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validate","title":"Validate","text":"<p>In this scenario, our consumer will read in the CSV file, do some transformations, and then save the data to Postgres. Let's try to configure data validations for the data that gets pushed into Postgres.</p>"},{"location":"setup/guide/scenario/first-data-generation/#postgres-setup","title":"Postgres setup","text":"<p>First, we define our connection properties for Postgres. You can check out the full options available here.</p> JavaScala <pre><code>var postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\");\n</code></pre> <pre><code>val postgresValidateTask = postgres(\n\"my_postgres\",                                          //connection name\n\"jdbc:postgresql://host.docker.internal:5432/customer\", //url\n\"postgres\",                                             //username\n\"password\"                                              //password\n).table(\"account\", \"transactions\")\n</code></pre> <p>We can connect and access the data inside the table <code>account.transactions</code>. Now to define our data validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#validations","title":"Validations","text":"<p>For full information about validation options and configurations, check here. Below, we have an example that should give you a good understanding of what validations are possible.</p> JavaScala <pre><code>var postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation().col(\"account_id\").isNotNull(),\nvalidation().col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation().col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation().expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation().unique(\"account_id\", \"name\"),\nvalidation().groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n);\n</code></pre> <pre><code>val postgresValidateTask = postgres(...)\n.table(\"account\", \"transactions\")\n.validations(\nvalidation.col(\"account_id\").isNotNull,\nvalidation.col(\"name\").matches(\"[A-Z][a-z]+ [A-Z][a-z]+\").errorThreshold(0.2).description(\"Some names have different formats\"),\nvalidation.col(\"balance\").greaterThanOrEqual(0).errorThreshold(10).description(\"Account can have negative balance if overdraft\"),\nvalidation.expr(\"CASE WHEN status == 'closed' THEN isNotNull(close_date) ELSE isNull(close_date) END\"),\nvalidation.unique(\"account_id\", \"name\"),\nvalidation.groupBy(\"account_id\", \"name\").max(\"login_retry\").lessThan(10)\n)\n</code></pre>"},{"location":"setup/guide/scenario/first-data-generation/#name_1","title":"name","text":"<p>For all values in the <code>name</code> column, we check if they match the regex <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. As we know in the real world, names do not always follow the same pattern, so we allow for an <code>errorThreshold</code> before marking the validation as failed. Here, we define the <code>errorThreshold</code> to be <code>0.2</code>, which means, if the error percentage is greater than 20%, then fail the validation. We also append on a helpful description so other developers/users can understand the context of the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#balance_1","title":"balance","text":"<p>We check that all <code>balance</code> values are greater than or equal to 0. This time, we have a slightly different <code>errorThreshold</code> as it is set to <code>10</code>, which means, if the number of errors is greater than 10, then fail the validation.</p>"},{"location":"setup/guide/scenario/first-data-generation/#expr","title":"expr","text":"<p>Sometimes, we may need to include the values of multiple columns to validate a certain condition. This is where we can use <code>expr</code> to define a SQL expression that returns a boolean. In this scenario, we are checking if the <code>status</code> column has value <code>closed</code>, then the <code>close_date</code> should be not null, otherwise, <code>close_date</code> is null.</p>"},{"location":"setup/guide/scenario/first-data-generation/#unique","title":"unique","text":"<p>We check whether the combination of <code>account_id</code> and <code>name</code> are unique within the dataset. You can define one or more columns for <code>unique</code> validations.</p>"},{"location":"setup/guide/scenario/first-data-generation/#groupby","title":"groupBy","text":"<p>There may be some business rule that states the number of <code>login_retry</code> should be less than 10 for each account. We can check this via a group by validation where we group by the <code>account_id, name</code>, take the maximum value for <code>login_retry</code> per <code>account_id,name</code> combination, then check if it is less than 10.</p> <p>You can now look to play around with other configurations or data sources to meet your needs. Also, make sure to explore the docs further as it can guide you on what can be configured.</p>"},{"location":"setup/guide/scenario/records-per-column/","title":"Multiple Records Per Column","text":"<p>Creating a data generator for a CSV file where there are multiple records per column values.</p>"},{"location":"setup/guide/scenario/records-per-column/#requirements","title":"Requirements","text":"<ul> <li>5 minutes</li> <li>Git</li> <li>Gradle</li> <li>Docker</li> </ul>"},{"location":"setup/guide/scenario/records-per-column/#get-started","title":"Get Started","text":"<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p> <pre><code>git clone git@github.com:pflooky/data-caterer-example.git\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#plan-setup","title":"Plan Setup","text":"<p>Create a new Java or Scala class.</p> <ul> <li>Java: <code>src/main/java/com/github/pflooky/plan/MyMultipleRecordsPerColJavaPlan.java</code></li> <li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyMultipleRecordsPerColPlan.scala</code></li> </ul> <p>Make sure your class extends <code>PlanRun</code>.</p> JavaScala <pre><code>import com.github.pflooky.datacaterer.java.api.PlanRun;\n...\n\npublic class MyMultipleRecordsPerColJavaPlan extends PlanRun {\n{\nvar transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\nfield().name(\"account_id\"),\nfield().name(\"full_name\"),\nfield().name(\"amount\").type(DoubleType.instance()).min(1).max(100),\nfield().name(\"time\").type(TimestampType.instance()).min(java.sql.Date.valueOf(\"2022-01-01\")),\nfield().name(\"date\").type(DateType.instance()).sql(\"DATE(time)\")\n);\n\nvar config = configuration()\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n.enableUniqueCheck(true);\n\nexecute(config, transactionTask);\n}\n}\n</code></pre> <pre><code>import com.github.pflooky.datacaterer.api.PlanRun\n...\n\nclass MyMultipleRecordsPerColPlan extends PlanRun {\n\nval transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\nfield.name(\"account_id\").regex(\"ACC[0-9]{8}\"), field.name(\"full_name\").expression(\"#{Name.name}\"), field.name(\"amount\").`type`(DoubleType.instance).min(1).max(100),\nfield.name(\"time\").`type`(TimestampType.instance).min(java.sql.Date.valueOf(\"2022-01-01\")), field.name(\"date\").`type`(DateType.instance).sql(\"DATE(time)\")\n)\n\nval config = configuration\n.generatedReportsFolderPath(\"/opt/app/data/report\")\n\nexecute(config, transactionTask)\n}\n</code></pre>"},{"location":"setup/guide/scenario/records-per-column/#record-count","title":"Record Count","text":"<p>By default, tasks will generate 1000 records. You can alter this value via the <code>count</code> configuration which can be applied to individual tasks. For example, in Scala, <code>csv(...).count(count.records(100))</code> to generate only 100 records.</p>"},{"location":"setup/guide/scenario/records-per-column/#records-per-column","title":"Records Per Column","text":"<p>In this scenario, for a given <code>account_id, full_name</code>, there should be multiple records for it as we want to simulate a customer having multiple transactions. We can achieve this through defining the number of records to generate in the <code>count</code> function.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumn(5, \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumn(5, \"account_id\", \"full_name\"))\n</code></pre> <p>This will generate <code>1000 * 5 = 5000</code> records as the default number of records is set (1000) and per <code>account_id, full_name</code> from the initial 1000 records, 5 records will be generated.</p>"},{"location":"setup/guide/scenario/records-per-column/#random-records-per-column","title":"Random Records Per Column","text":"<p>Generating 5 records per column is okay but still not quite reflective of the real world. Sometimes, people have accounts with no transactions in them, or they could have many. We can accommodate for this via defining a random number of records per column.</p> JavaScala <pre><code>var transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map.of(\"header\", \"true\"))\n.schema(\n...\n)\n.count(count().recordsPerColumnGenerator(generator().min(0).max(5), \"account_id\", \"full_name\"));\n</code></pre> <pre><code>val transactionTask = csv(\"customer_transactions\", \"/opt/app/data/customer/transaction\", Map(\"header\" -&gt; \"true\"))\n.schema(\n...\n)\n.count(count.recordsPerColumnGenerator(generator.min(0).max(5), \"account_id\", \"full_name\"))\n</code></pre> <p>Here we set the minimum number of records per column to be 0 and the maximum to 5. This will follow a uniform distribution so the average number of records per account is 2.5. We could also define other metadata, just like we did with fields, when defining the generator. For example, we could set <code>standardDeviation</code> and <code>mean</code> for the number of records generated per column to follow a normal distribution.</p>"},{"location":"setup/guide/scenario/records-per-column/#run","title":"Run","text":"<p>Let's try run.</p> <pre><code>#clean up old data\nrm -rf docker/sample/customer/account\n./run.sh\n#input class MyMultipleRecordsPerColJavaPlan or MyMultipleRecordsPerColPlan\n#after completing\nhead docker/sample/customer/transaction/part-00000*\n</code></pre> <p>It should look something like this.</p> <pre><code>ACC29117767,Willodean Sauer\nACC29117767,Willodean Sauer,84.99145871948083,2023-05-14T09:55:51.439Z,2023-05-14\nACC29117767,Willodean Sauer,58.89345733567232,2022-11-22T07:38:20.143Z,2022-11-22\n</code></pre> <p>You can now look to play around with other count configurations found here.</p>"},{"location":"setup/validation/basic-validation/","title":"Basic Validations","text":"<p>Run validations on a column to ensure the values adhere to your requirement. Can be set to complex validation logic via SQL expression as well if needed (see here).</p>"},{"location":"setup/validation/basic-validation/#equal","title":"Equal","text":"<p>Ensure all data in column is equal to certain value. Value can be of any data type.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isEqual(2021)\n</code></pre> <pre><code>validation.col(\"year\").isEqual(2021)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year == 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-equal","title":"Not Equal","text":"<p>Ensure all data in column is not equal to certain value. Value can be of any data type.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotEqual(2021)\n</code></pre> <pre><code>validation.col(\"year\").isNotEqual(2021)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"year != 2021\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#null","title":"Null","text":"<p>Ensure all data in column is null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNull()\n</code></pre> <pre><code>validation.col(\"year\").isNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-null","title":"Not Null","text":"<p>Ensure all data in column is not null.</p> JavaScalaYAML <pre><code>validation().col(\"year\").isNotNull()\n</code></pre> <pre><code>validation.col(\"year\").isNotNull\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ISNOTNULL(year)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#contains","title":"Contains","text":"<p>Ensure all data in column is contains certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").contains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-contains","title":"Not Contains","text":"<p>Ensure all data in column does not contain certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>validation.col(\"name\").notContains(\"peter\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!CONTAINS(name, 'peter')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#unqiue","title":"Unqiue","text":"<p>Ensure all data in column is unique.</p> JavaScalaYAML <pre><code>validation().unique(\"account_id\", \"name\")\n</code></pre> <pre><code>validation.unique(\"account_id\", \"name\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- unique: [\"account_id\", \"name\"]\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than","title":"Less Than","text":"<p>Ensure all data in column is less than certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThan(100)\n</code></pre> <pre><code>validation.col(\"amount\").lessThan(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt; 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal","title":"Less Than Or Equal","text":"<p>Ensure all data in column is less than or equal to certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.col(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &lt;= 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than","title":"Greater Than","text":"<p>Ensure all data in column is greater than certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThan(100)\n</code></pre> <pre><code>validation.col(\"amount\").greaterThan(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt; 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal","title":"Greater Than Or Equal","text":"<p>Ensure all data in column is greater than or equal to certain value.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").greaterThanOrEqual(100)\n</code></pre> <pre><code>validation.col(\"amount\").greaterThanOrEqual(100)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount &gt;= 100\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#between","title":"Between","text":"<p>Ensure all data in column is between two values.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").between(100, 200)\n</code></pre> <pre><code>validation.col(\"amount\").between(100, 200)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount BETWEEN 100 AND 200\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-between","title":"Not Between","text":"<p>Ensure all data in column is not between two values.</p> JavaScalaYAML <pre><code>validation().col(\"amount\").notBetween(100, 200)\n</code></pre> <pre><code>validation.col(\"amount\").notBetween(100, 200)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"amount NOT BETWEEN 100 AND 200\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#in","title":"In","text":"<p>Ensure all data in column is in set of defined values.</p> JavaScalaYAML <pre><code>validation().col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>validation.col(\"status\").in(\"open\", \"closed\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"status IN ('open', 'closed')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#matches","title":"Matches","text":"<p>Ensure all data in column matches certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>validation.col(\"account_id\").matches(\"ACC[0-9]{8}\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"REGEXP(account_id, ACC[0-9]{8})\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-matches","title":"Not Matches","text":"<p>Ensure all data in column does not match certain regex expression.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>validation.col(\"account_id\").notMatches(\"^acc.*\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!REGEXP(account_id, '^acc.*')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#starts-with","title":"Starts With","text":"<p>Ensure all data in column starts with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").startsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-starts-with","title":"Not Starts With","text":"<p>Ensure all data in column does not start with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notStartsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!STARTSWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#ends-with","title":"Ends With","text":"<p>Ensure all data in column ends with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").endsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-ends-with","title":"Not Ends With","text":"<p>Ensure all data in column does not end with certain string. Column has to have type string.</p> JavaScalaYAML <pre><code>validation().col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>validation.col(\"account_id\").notEndsWith(\"ACC\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"!ENDWITH(account_id, 'ACC')\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#size","title":"Size","text":"<p>Ensure all data in column has certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").size(5)\n</code></pre> <pre><code>validation.col(\"transactions\").size(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions, 5)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#not-size","title":"Not Size","text":"<p>Ensure all data in column does not have certain size. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").notSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").notSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) != 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-size","title":"Less Than Size","text":"<p>Ensure all data in column has size less than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#less-than-or-equal-size","title":"Less Than Or Equal Size","text":"<p>Ensure all data in column has size less than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").lessThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &lt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-size","title":"Greater Than Size","text":"<p>Ensure all data in column has size greater than certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt; 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#greater-than-or-equal-size","title":"Greater Than Or Equal Size","text":"<p>Ensure all data in column has size greater than or equal to certain value. Column has to have type array or map.</p> JavaScalaYAML <pre><code>validation().col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>validation.col(\"transactions\").greaterThanOrEqualSize(5)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"SIZE(transactions) &gt;= 5\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#luhn-check","title":"Luhn Check","text":"<p>Ensure all data in column passes luhn check. Luhn check is used to validate credit card numbers and certain identification numbers (see here for more details).</p> JavaScalaYAML <pre><code>validation().col(\"credit_card\").luhnCheck()\n</code></pre> <pre><code>validation.col(\"credit_card\").luhnCheck\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"LUHN_CHECK(credit_card)\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#has-type","title":"Has Type","text":"<p>Ensure all data in column has certain data type.</p> JavaScalaYAML <pre><code>validation().col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>validation.col(\"id\").hasType(\"string\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\n...\nvalidations:\n- expr: \"TYPEOF(id) == 'string'\"\n</code></pre>"},{"location":"setup/validation/basic-validation/#expression","title":"Expression","text":"<p>Ensure all data in column adheres to SQL expression defined that returns back a boolean. You can define complex logic in here that could combine multiple columns.</p> <p>For example, <code>CASE WHEN status == 'open' THEN balance &gt; 0 ELSE balance == 0 END</code> would check all rows with <code>status</code> open to have <code>balance</code> greater than 0, otherwise, check the <code>balance</code> is 0.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().expr(\"amount &lt; 100\"),\nvalidation().expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n);\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.expr(\"amount &lt; 100\"),\nvalidation.expr(\"year == 2021\").errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.expr(\"REGEXP_LIKE(name, 'Peter .*')\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\n\n#enableValidation inside application.conf\n</code></pre>"},{"location":"setup/validation/group-by-validation/","title":"Group By Validation","text":"<p>If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group by validations. An example would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The validations applied can be one of the validations from the basic validation set found here.</p>"},{"location":"setup/validation/group-by-validation/#record-count","title":"Record count","text":"<p>Check the number of records across the whole dataset.</p> JavaScala <pre><code>validation().groupBy().count().lessThan(1000)\n</code></pre> <pre><code>validation.groupBy().count().lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#record-count-per-group","title":"Record count per group","text":"<p>Check the number of records for each group.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count().lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#sum","title":"Sum","text":"<p>Check the sum of a columns values for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").sum(\"amount\").lessThan(1000)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#count","title":"Count","text":"<p>Check the count for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").count(\"amount\").lessThan(10)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#min","title":"Min","text":"<p>Check the min for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").min(\"amount\").greaterThan(0)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#max","title":"Max","text":"<p>Check the max for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").max(\"amount\").lessThanOrEqual(100)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#average","title":"Average","text":"<p>Check the average for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").avg(\"amount\").between(40, 60)\n</code></pre>"},{"location":"setup/validation/group-by-validation/#standard-deviation","title":"Standard deviation","text":"<p>Check the standard deviation for each group adheres to validation.</p> JavaScala <pre><code>validation().groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre> <pre><code>validation.groupBy(\"account_id\", \"year\").stddev(\"amount\").between(0.5, 0.6)\n</code></pre>"},{"location":"setup/validation/validation/","title":"Validations","text":"<p>Validations can be used to run data checks after you have run the data generator or even as a standalone task. A report summarising the success or failure of the validations is produced and can be examined for further investigation.</p> <ul> <li>Basic - Basic column level validations</li> <li>Group by/Aggregate - Run aggregates over grouped data, then validate</li> <li>[Relationship (Coming soon)] - Ensure record values exist in other datasets based on relationships</li> <li>[Data Profile (Coming soon)] - Score how close the data profile of generated data is against the target data profile</li> </ul> <p>Currently, SQL expression validations are supported (can see here for reference what other expressions are valid), but will later be extended out to supported other validations such as aggregates (group by account_number, sum of amounts should be greater than 100), ordering (transaction dates should be in descending order), relationships (at least one transaction per account_number) or data profiling (how close produced data profile is to expected data profile).</p>"},{"location":"setup/validation/validation/#define-validations","title":"Define Validations","text":"<p>Full example validation can be found below. For more details, check out each of the subsections defined further below.</p> JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validations(\nvalidation().col(\"amount\").lessThan(100),\nvalidation().col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation().col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)\n.validationWait(waitCondition().pause(1));\n\nvar conf = configuration().enableValidation(true);\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validations(\nvalidation.col(\"amount\").lessThan(100),\nvalidation.col(\"year\").isEqual(2021).errorThreshold(0.1),  //equivalent to if error percentage is &gt; 10%, then fail\nvalidation.col(\"name\").matches(\"Peter .*\").errorThreshold(200)  //equivalent to if number of errors is &gt; 200, then fail\n)  .validationWait(waitCondition.pause(1))\n\nval conf = configuration.enableValidation(true)\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nvalidations:\n- expr: \"amount &lt; 100\"\n- expr: \"year == 2021\"\nerrorThreshold: 0.1   #equivalent to if error percentage is &gt; 10%, then fail\n- expr: \"REGEXP_LIKE(name, 'Peter .*')\"\nerrorThreshold: 200   #equivalent to if number of errors is &gt; 200, then fail\ndescription: \"Should be lots of Peters\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/validation/#wait-condition","title":"Wait Condition","text":"<p>Once data has been generated, you may want to wait for a certain condition to be met before starting the data validations. This can be via:</p> <ul> <li>Pause for seconds</li> <li>When file is available</li> <li>Data exists</li> <li>Webhook</li> </ul>"},{"location":"setup/validation/validation/#pause","title":"Pause","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().pause(1));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.pause(1))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npauseInSeconds: 1\n</code></pre>"},{"location":"setup/validation/validation/#data-exists","title":"Data exists","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\");\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWaitDataExists(\"updated_date &gt; DATE('2023-01-01')\")\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"transactions\"\noptions:\npath: \"/tmp/csv\"\nexpr: \"updated_date &gt; DATE('2023-01-01')\"\n</code></pre>"},{"location":"setup/validation/validation/#webhook","title":"Webhook","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\")); //by default, GET request successful when 200 status code\n\n//or\n\nvar csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202));  //successful if 200 or 202 status code\n\n//or\n\nvar csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().webhook(\"my_http\", \"http://localhost:8080/finished\"));  //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\"))  //by default, GET request successful when 200 status code\n\n//or\n\nval csvTxnsWithStatusCodes = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"http://localhost:8080/finished\", \"GET\", 200, 202)) //successful if 200 or 202 status code\n\n//or\n\nval csvTxnsWithExistingHttpConnection = csv(\"transactions\", \"/tmp/csv\", Map(\"header\" -&gt; \"true\"))\n.validationWait(waitCondition.webhook(\"my_http\", \"http://localhost:8080/finished\")) //use connection configuration from existing 'my_http' connection definition\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\" #by default, GET request successful when 200 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\nurl: \"http://localhost:8080/finished\"\nmethod: \"GET\"\nstatusCodes: [200, 202] #successful if 200 or 202 status code\n\n#or\n\n---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\ndataSourceName: \"my_http\" #use connection configuration from existing 'my_http' connection definition\nurl: \"http://localhost:8080/finished\"\n</code></pre>"},{"location":"setup/validation/validation/#file-available","title":"File available","text":"JavaScalaYAML <pre><code>var csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition().file(\"/tmp/json\"));\n</code></pre> <pre><code>val csvTxns = csv(\"transactions\", \"/tmp/csv\", Map.of(\"header\", \"true\"))\n.validationWait(waitCondition.file(\"/tmp/json\"))\n</code></pre> <pre><code>---\nname: \"account_checks\"\ndataSources:\ntransactions:\noptions:\npath: \"/tmp/csv\"\nwaitCondition:\npath: \"/tmp/json\"\n</code></pre>"},{"location":"setup/validation/validation/#report","title":"Report","text":"<p>Once run, it will produce a report like this.</p>"},{"location":"use-case/business-value/","title":"Business Value","text":"<p>Below is a list of the business related benefits from using Data Caterer which may be applicable for your use case.</p> Problem Data Caterer Solution Resources Effects Reliable test data creation - Profile existing data- Create scenarios- Generate data Software Engineers, QA, Testers Cost reduction in labor, more time spent on development, more bugs caught before production Faster development cycles - Generate data in local, test, UAT, pre-prod- Run different scenarios Software Engineers, QA, Testers More defects caught in lower environments, features pushed to production faster, common framework used across all environments Data compliance - Profiling existing data- Generate based on metadata- No complex masking- No production data used in lower environments Audit and compliance No chance for production data breaches Storage costs - Delete generated data- Test specific scenarios Infrastructure Lower data storage costs, less time spent on data management and clean up Schema evolution - Create metadata from data sources- Generate data based off fresh metadata Software Engineers, QA, Testers Less time spent altering tests due to schema changes, ease of use between environments and application versions"},{"location":"use-case/comparison/","title":"Comparison to similar tools","text":"<p>I have tried to include all the companies found in the list here from Mostly AI blog post and used information that is publicly available.</p> <p>The companies/products not shown below either have:</p> <ul> <li>a website with insufficient information about the technology side of data generation/validation</li> <li>no/little documentation</li> <li>don't have a free, no sign-up version of their app to use</li> </ul>"},{"location":"use-case/comparison/#data-generation","title":"Data Generation","text":"Tool Description Cost Pros Cons Clearbox AI Python based data generation tool via ML Unclear  Python SDK UI interface Detect private data Report generation  Batch data only No data clean up Limited/no documentation Curiosity Software Platform solution for test data management Unclear  Extensive documentation Generate data based off test cases UI interface Web/API/UI/mobile testing  No quick start No SDK Many components that may not be required No event generation support DataCebo Synthetic Data Vault Python based data generation tool via ML Unclear  Python SDK Report generation Data quality checks Business logic constraints  No data connection support No data clean up No foreign key support Datafaker Realistic data generation library Free  SDK for many languages Simple, easy to use Extensible Open source Generate realistic values  No data connection support No data clean up No validation No foreign key support DBLDatagen Python based data generation tool Free  Python SDK Open source Good documentation Customisable scenarios Customisable column generation Generate from existing data/schemas Plugin third-party libraries  Limited support if issues Code required No data clean up No data validation Gatling HTTP API load testing tool Free (Open Source)Gatling Enterprise, usage based, starts from \u20ac89 per month, 1 user, 6.25 hours of testing  Kotlin, Java &amp; Scala SDK Widely used Open source Clear documentation Extensive testing/validation support Customisable scenarios Report generation  Only supports HTTP, JMS and JDBC No data clean up Data feeders not based off metadata Gretel Python based data generation tool via ML Usage based, starts from $295 per month, $2.20 per credit, assumed USD  CLI &amp; Python SDK UI interface Training and re-use of models Detect private data Customisable scenarios  Batch data only No relationships between data sources Only simple foreign key relations defined No data clean up Charge by usage Howso Python based data generation tool via ML Unclear  Python SDK Playground to try Open source library Customisable scenarios  No support for data sources No data validation No data clean up Mostly AI Python based data generation tool via ML Usage based, Enterprise 1 user, 100 columns, 100K rows $3,100 per month, assumed USD  Report generation Non-technical users can use UI Customisable scenarios  Charge by usage Batch data only No data clean up Confusing use of 'smart select' for multiple foreign keys Limited custom column generation logic Multiple deployment components No SDK Octopize Python based data generation tool via ML Unclear  Python &amp; R SDK Report generation API for metadata Customisable scenarios  Input data source is only CSV Multiple manual steps before starting Quickstart is not a quickstart Documentation lacks code examples Synthesized Python based data generation tool via ML Unclear  CLI &amp; Python SDK API for metadata IDE setup Data quality checks  Not sure what is SDK &amp; TDK Charge by usage No report of what was generated No relationships between data sources Tonic Platform solution for generating data Unclear  UI interface Good documentation Detect private data Support for encrypted columns Report generation Alerting  Batch data only Multiple deployment components No relationships between data sources No data validation No data clean up No SDK (only API) Difficult to embed complex business logic YData Python based data generation tool via ML. Platform solution as well Unclear  Python SDK Open source Detect private data Compare datasets Report generation  No data connection support Batch data only No data clean up Separate data generation and data validation No foreign key support"},{"location":"use-case/comparison/#use-of-ml-models","title":"Use of ML models","text":"<p>You may notice that the majority of data generators use machine learning (ML) models to learn from your existing datasets to generate new data. Below are some pros and cons to the approach.</p> <p>Pros</p> <ul> <li>Simple setup</li> <li>Ability to reproduce complex logic</li> <li>Flexible to accept all types of data</li> </ul> <p>Cons</p> <ul> <li>Long time for model learning</li> <li>Black box of logic</li> <li>Maintain, store and update of ML models</li> <li>Restriction on input data lengths</li> <li>May not maintain referential integrity</li> <li>Require deeper understanding of ML models for fine-tuning</li> <li>Accuracy may be worse than non-ML models</li> </ul>"},{"location":"use-case/roadmap/","title":"Roadmap","text":"<ul> <li>Support for other data sources<ul> <li>GCP and Azure related data services ( cloud storage)</li> <li>Deltalake</li> <li>RabbitMQ</li> <li>ActiveMQ</li> <li>MongoDB</li> <li>Airflow</li> <li>DBT</li> </ul> </li> <li>Further support for metadata discovery<ul> <li> HTTP (OpenAPI spec)</li> <li>JMS</li> <li>Read from samples</li> </ul> </li> <li> API for developers and testers<ul> <li> Scala</li> <li> Java</li> </ul> </li> <li>UI for metadata and data generation</li> <li> Report for data generated and validation rules</li> <li>Metadata stored in database</li> <li>Integration with existing metadata services (i.e. Amundsen, Datahub, Schema Registry, DBT)<ul> <li>Populate metadata back to metadata services</li> <li> OpenLineage metadata (Marquez)</li> <li> OpenMetadata</li> </ul> </li> <li>Integration with existing data validations<ul> <li>Great Expectation</li> <li>DBT constraints</li> <li>SodaCL</li> <li>MonteCarlo</li> </ul> </li> <li> Suggest data validations</li> <li>Data dictionary<ul> <li>Business definitions of fields that can be referenced for metadata across all data sources</li> </ul> </li> <li> Verification rules after data generation</li> <li> Validation waiting conditions<ul> <li> Webhook</li> <li> File exists</li> <li> Data exists via SQL expression</li> <li> Pause</li> </ul> </li> <li>Extend validation types<ul> <li> Aggregates (sum of amount per account is &gt; 500)</li> <li>Ordering (transactions are ordered by date)</li> <li>Relationship (at least one account entry in history table per account in accounts table)</li> <li>Data profile (how close the generated data profile is compared to the expected data profile)</li> </ul> </li> <li>Extend count<ul> <li>Cover all possible cases (i.e. record for each combination of oneOf values, positive/negative values etc.)</li> <li>Similar to edge cases</li> </ul> </li> <li>Alerting<ul> <li>Slack</li> <li>Email</li> </ul> </li> <li>Overriding tasks<ul> <li>Can customise tasks without copying whole schema definitions</li> <li>Easier to create scenarios</li> </ul> </li> <li>Gradle plugin</li> <li>Metadata improvements<ul> <li>PII detection (can integrate with Presidio)</li> <li>Relationship detection across data sources</li> <li>SQL generation</li> <li>Ordering information</li> </ul> </li> <li>Code generation</li> <li>Schema generation from Scala/Java class</li> <li>Ordering within data sources that support order for insertion</li> <li>Clean up data in consumer data sinks</li> <li> Trial app to try out all features</li> <li>HTTP response data validation</li> </ul>"},{"location":"use-case/use-case/","title":"Use cases","text":""},{"location":"use-case/use-case/#replicate-production-in-lower-environment","title":"Replicate production in lower environment","text":"<p>Having a stable and reliable test environment is a challenge for a number of companies, especially where teams are asynchronously deploying and testing changes at faster rates. Data Caterer can help alleviate these issues by doing the following:</p> <ol> <li>Generates data with the latest schema changes and production like field values</li> <li>Run as a job on a daily/regular basis to replicate production traffic or data flows</li> <li>Validate data to ensure your system runs as expected</li> <li>Clean up data to avoid build up of generated data</li> </ol> <p></p>"},{"location":"use-case/use-case/#local-development","title":"Local development","text":"<p>Similar to the above, being able to replicate production like data in your local environment can be key to developing more reliable code as you can test directly against data in your local computer. This has a number of benefits including:</p> <ol> <li>Fewer assumptions or ambiguities when the developer codes</li> <li>Direct feedback loop in local computer rather than waiting for test environment for more reliable test data</li> <li>No domain expertise required to understand the data</li> <li>Easy for new developers to be onboarded and developing/testing code for jobs/services</li> </ol>"},{"location":"use-case/use-case/#systemintegration-testing","title":"System/integration testing","text":"<p>When working with third-party, external or internal data providers, it can be difficult to have all setup ready to produce reliable data that abides by relationship contracts between each of the systems. You have to rely on these data providers in order for you to run your tests which may not align to their priorities. With Data Caterer, you can generate the same data that they would produce, along with maintaining referential integrity across the data providers, so that you can run your tests without relying on their systems being up and reliable in their corresponding lower environments.</p>"},{"location":"use-case/use-case/#scenario-testing","title":"Scenario testing","text":"<p>If you want to set up particular data scenarios, you can customise the generated data to fit your scenario. Once the data gets generated and is consumed, you can also run validations to ensure your system has consumed the data correctly. These scenarios can be put together from existing tasks or data sources can be enabled/disabled based on your requirement. Built into Data Caterer and controlled via feature flags, is the ability to test edge cases based on the data type of the fields used for data generation (<code>enableEdgeCases</code> flag within <code>&lt;field&gt;.generator.options</code>, see more here).</p>"},{"location":"use-case/use-case/#data-debugging","title":"Data debugging","text":"<p>When data related issues occur in production, it may be difficult to replicate in a lower or local environment. It could be related to specific fields not containing expected results, size of data is too large or missing corresponding referenced data. This becomes key to resolving the issue as you can directly code against the exact data scenario and have confidence that your code changes will fix the problem. Data Caterer can be used to generate the appropriate data in whichever environment you want to test your changes against.</p>"},{"location":"use-case/use-case/#data-profiling","title":"Data profiling","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), metadata relating all the fields defined in the data sources you have configured will be generated via data profiling. You can run this as a standalone job (can disable <code>enableGenerateData</code>)  so that you can focus on the profile of the data you are utilising. This can be run against your production data sources  to ensure the metadata can be used to accurately generate data in other environments. This is a key feature of Data  Caterer as no direct production connections need to be maintained to generate data in other environments (which can  lead to serious concerns about data security as seen here).</p>"},{"location":"use-case/use-case/#schema-gathering","title":"Schema gathering","text":"<p>When using Data Caterer with the feature flag <code>enableGeneratePlanAndTasks</code> enabled (see here), all schemas of the data sources defined will be tracked in a common format (as tasks). This data, along with the data profiling metadata, could then feed back into your schema registries to help keep them up to date with your system.</p>"}]}
\ No newline at end of file
diff --git a/site/setup/advanced/advanced/index.html b/site/setup/advanced/advanced/index.html
index 8e3f1311..a040df1f 100644
--- a/site/setup/advanced/advanced/index.html
+++ b/site/setup/advanced/advanced/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/configuration/index.html b/site/setup/configuration/index.html
index f70f67fe..08058c6e 100644
--- a/site/setup/configuration/index.html
+++ b/site/setup/configuration/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/connection/connection/index.html b/site/setup/connection/connection/index.html
index d25b5e38..6b82bb89 100644
--- a/site/setup/connection/connection/index.html
+++ b/site/setup/connection/connection/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/deployment/deployment/index.html b/site/setup/deployment/deployment/index.html
index c525f968..cf886d89 100644
--- a/site/setup/deployment/deployment/index.html
+++ b/site/setup/deployment/deployment/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/foreign-key/foreign-key/index.html b/site/setup/foreign-key/foreign-key/index.html
index ecce0de5..213bf633 100644
--- a/site/setup/foreign-key/foreign-key/index.html
+++ b/site/setup/foreign-key/foreign-key/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/generator/count/index.html b/site/setup/generator/count/index.html
index f886ee56..f3704b12 100644
--- a/site/setup/generator/count/index.html
+++ b/site/setup/generator/count/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/generator/generator/index.html b/site/setup/generator/generator/index.html
index 9150dd6b..ff7c63e9 100644
--- a/site/setup/generator/generator/index.html
+++ b/site/setup/generator/generator/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/generator/report/index.html b/site/setup/generator/report/index.html
index 25826690..04c6d269 100644
--- a/site/setup/generator/report/index.html
+++ b/site/setup/generator/report/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/cassandra/index.html b/site/setup/guide/data-source/cassandra/index.html
index 869d9102..3439f153 100644
--- a/site/setup/guide/data-source/cassandra/index.html
+++ b/site/setup/guide/data-source/cassandra/index.html
@@ -676,6 +676,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/http/index.html b/site/setup/guide/data-source/http/index.html
index 2edda07f..f573a4ec 100644
--- a/site/setup/guide/data-source/http/index.html
+++ b/site/setup/guide/data-source/http/index.html
@@ -678,6 +678,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/kafka/index.html b/site/setup/guide/data-source/kafka/index.html
index 1341ac65..0428497e 100644
--- a/site/setup/guide/data-source/kafka/index.html
+++ b/site/setup/guide/data-source/kafka/index.html
@@ -676,6 +676,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/marquez-metadata-source/index.html b/site/setup/guide/data-source/marquez-metadata-source/index.html
index 585b5d08..afd71c0c 100644
--- a/site/setup/guide/data-source/marquez-metadata-source/index.html
+++ b/site/setup/guide/data-source/marquez-metadata-source/index.html
@@ -676,6 +676,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/open-metadata-source/index.html b/site/setup/guide/data-source/open-metadata-source/index.html
index 870c9470..0b7f97f8 100644
--- a/site/setup/guide/data-source/open-metadata-source/index.html
+++ b/site/setup/guide/data-source/open-metadata-source/index.html
@@ -676,6 +676,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/data-source/solace/index.html b/site/setup/guide/data-source/solace/index.html
index 6d5f9663..2c106180 100644
--- a/site/setup/guide/data-source/solace/index.html
+++ b/site/setup/guide/data-source/solace/index.html
@@ -678,6 +678,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/index.html b/site/setup/guide/index.html
index 713fa157..44a4011c 100644
--- a/site/setup/guide/index.html
+++ b/site/setup/guide/index.html
@@ -835,6 +835,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="scenario/auto-generate-connection/" class="md-nav__link">
         
@@ -1916,7 +1968,7 @@ <h2 id="scenarios">Scenarios</h2>
 <li><strong><a href="scenario/first-data-generation/">First Data Generation</a></strong> - If you are new, this is the place to start</li>
 <li><strong><a href="scenario/records-per-column/">Multiple Records Per Column Value</a></strong> - How you can generate multiple records per set of columns</li>
 <li><strong><a href="scenario/batch-and-event/">Foreign Keys Across Data Sources</a></strong> - Generate matching values across generated data sets</li>
-<li><strong><a href="scenario/first-data-generation/">Data Validations</a></strong> - (Soon to document) Run data validations after generating data</li>
+<li><strong><a href="scenario/data-validation/">Data Validations</a></strong> - Run data validations after generating data</li>
 <li><strong><a href="scenario/auto-generate-connection/">Auto Generate From Data Connection</a></strong> - Automatically generating data from just defining data sources</li>
 <li><strong><a href="scenario/delete-generated-data/">Delete Generated Data</a></strong> - Delete the generated data whilst leaving other data</li>
 <li><strong><a href="scenario/batch-and-event/">Generate Batch and Event Data</a></strong> - Generate matching batch and event data</li>
diff --git a/site/setup/guide/scenario/auto-generate-connection/index.html b/site/setup/guide/scenario/auto-generate-connection/index.html
index 944e2839..7e87ab7c 100644
--- a/site/setup/guide/scenario/auto-generate-connection/index.html
+++ b/site/setup/guide/scenario/auto-generate-connection/index.html
@@ -11,7 +11,7 @@
         <link rel="canonical" href="https://data.catering/setup/guide/scenario/auto-generate-connection/">
       
       
-        <link rel="prev" href="../records-per-column/">
+        <link rel="prev" href="../data-validation/">
       
       
         <link rel="next" href="../delete-generated-data/">
@@ -677,6 +677,58 @@
                 
   
   
+  
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
     
   
   
@@ -2104,7 +2156,7 @@ <h4 id="define-record-count">Define record count</h4>
       <nav class="md-footer__inner md-grid" aria-label="Footer" >
         
           
-          <a href="../records-per-column/" class="md-footer__link md-footer__link--prev" aria-label="Previous: Multiple Records Per Column Value">
+          <a href="../data-validation/" class="md-footer__link md-footer__link--prev" aria-label="Previous: Data Validations">
             <div class="md-footer__button md-icon">
               
               <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg>
@@ -2114,7 +2166,7 @@ <h4 id="define-record-count">Define record count</h4>
                 Previous
               </span>
               <div class="md-ellipsis">
-                Multiple Records Per Column Value
+                Data Validations
               </div>
             </div>
           </a>
diff --git a/site/setup/guide/scenario/batch-and-event/index.html b/site/setup/guide/scenario/batch-and-event/index.html
index 8a891498..0c257f16 100644
--- a/site/setup/guide/scenario/batch-and-event/index.html
+++ b/site/setup/guide/scenario/batch-and-event/index.html
@@ -678,6 +678,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="./" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/scenario/data-validation/index.html b/site/setup/guide/scenario/data-validation/index.html
new file mode 100644
index 00000000..e61fc6f7
--- /dev/null
+++ b/site/setup/guide/scenario/data-validation/index.html
@@ -0,0 +1,2526 @@
+
+<!doctype html>
+<html lang="en" class="no-js">
+  <head>
+    
+      <meta charset="utf-8">
+      <meta name="viewport" content="width=device-width,initial-scale=1">
+      
+        <meta name="description" content="Validate data via basic checks and group by aggregates across columns and the whole dataset.">
+      
+      
+      
+        <link rel="canonical" href="https://data.catering/setup/guide/scenario/data-validation/">
+      
+      
+        <link rel="prev" href="../batch-and-event/">
+      
+      
+        <link rel="next" href="../auto-generate-connection/">
+      
+      
+        
+      
+      
+      <link rel="icon" href="../../../../diagrams/logo/data_catering_transparent.svg">
+      <meta name="generator" content="mkdocs-1.5.3, mkdocs-material-9.4.2+insiders-4.42.0">
+    
+    
+      
+        <title>Data Validations - Data Catering</title>
+      
+    
+    
+      <link rel="stylesheet" href="../../../../assets/stylesheets/main.63a9ca82.min.css">
+      
+        
+        <link rel="stylesheet" href="../../../../assets/stylesheets/palette.46987102.min.css">
+      
+      
+
+
+    
+    
+      
+    
+    
+      
+        
+        
+        <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+        <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:300,300i,400,400i,700,700i%7CRoboto+Mono:400,400i,700,700i&display=fallback">
+        <style>:root{--md-text-font:"Roboto";--md-code-font:"Roboto Mono"}</style>
+      
+    
+    
+      <link rel="stylesheet" href="../../../../stylesheets/extra.css">
+    
+    <script>__md_scope=new URL("../../../..",location),__md_hash=e=>[...e].reduce((e,_)=>(e<<5)-e+_.charCodeAt(0),0),__md_get=(e,_=localStorage,t=__md_scope)=>JSON.parse(_.getItem(t.pathname+"."+e)),__md_set=(e,_,t=localStorage,a=__md_scope)=>{try{t.setItem(a.pathname+"."+e,JSON.stringify(_))}catch(e){}}</script>
+    
+      
+  
+
+
+  
+  
+
+<script id="__analytics">function __md_analytics(){function n(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],n("js",new Date),n("config","G-4098CTH5TX"),document.addEventListener("DOMContentLoaded",function(){document.forms.search&&document.forms.search.query.addEventListener("blur",function(){this.value&&n("event","search",{search_term:this.value})}),document$.subscribe(function(){var a=document.forms.feedback;if(void 0!==a)for(var e of a.querySelectorAll("[type=submit]"))e.addEventListener("click",function(e){e.preventDefault();var t=document.location.pathname,e=this.getAttribute("data-md-value");n("event","feedback",{page:t,data:e}),a.firstElementChild.disabled=!0;e=a.querySelector(".md-feedback__note [data-md-value='"+e+"']");e&&(e.hidden=!1)}),a.hidden=!1}),location$.subscribe(function(e){n("config","G-4098CTH5TX",{page_path:e.pathname})})});var e=document.createElement("script");e.async=!0,e.src="https://www.googletagmanager.com/gtag/js?id=G-4098CTH5TX",document.getElementById("__analytics").insertAdjacentElement("afterEnd",e)}</script>
+  
+    <script>var consent;"undefined"==typeof __md_analytics||(consent=__md_get("__consent"))&&consent.analytics&&__md_analytics()</script>
+  
+
+    
+    
+  </head>
+  
+  
+    
+    
+      
+    
+    
+    
+    
+    <body dir="ltr" data-md-color-scheme="default" data-md-color-primary="deep-orange" data-md-color-accent="orange">
+  
+    
+    <input class="md-toggle" data-md-toggle="drawer" type="checkbox" id="__drawer" autocomplete="off">
+    <input class="md-toggle" data-md-toggle="search" type="checkbox" id="__search" autocomplete="off">
+    <label class="md-overlay" for="__drawer"></label>
+    <div data-md-component="skip">
+      
+        
+        <a href="#data-validations" class="md-skip">
+          Skip to content
+        </a>
+      
+    </div>
+    <div data-md-component="announce">
+      
+    </div>
+    
+    
+      
+
+<header class="md-header" data-md-component="header">
+  <nav class="md-header__inner md-grid" aria-label="Header">
+    <a href="../../../.." title="Data Catering" class="md-header__button md-logo" aria-label="Data Catering" data-md-component="logo">
+      
+  <img src="../../../../diagrams/logo/data_catering_transparent.svg" alt="logo">
+
+    </a>
+    <label class="md-header__button md-icon" for="__drawer">
+      
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M3 6h18v2H3V6m0 5h18v2H3v-2m0 5h18v2H3v-2Z"/></svg>
+    </label>
+    <div class="md-header__title" data-md-component="header-title">
+      <div class="md-header__ellipsis">
+        <div class="md-header__topic">
+          <span class="md-ellipsis">
+            Data Catering
+          </span>
+        </div>
+        <div class="md-header__topic" data-md-component="header-topic">
+          <span class="md-ellipsis">
+            
+              Data Validations
+            
+          </span>
+        </div>
+      </div>
+    </div>
+    
+      
+        <form class="md-header__option" data-md-component="palette">
+  
+    
+    
+    
+    <input class="md-option" data-md-color-media="(prefers-color-scheme)" data-md-color-scheme="default" data-md-color-primary="deep-orange" data-md-color-accent="orange"  aria-label="Switch to light mode"  type="radio" name="__palette" id="__palette_0">
+    
+      <label class="md-header__button md-icon" title="Switch to light mode" for="__palette_1" hidden>
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="m14.3 16-.7-2h-3.2l-.7 2H7.8L11 7h2l3.2 9h-1.9M20 8.69V4h-4.69L12 .69 8.69 4H4v4.69L.69 12 4 15.31V20h4.69L12 23.31 15.31 20H20v-4.69L23.31 12 20 8.69m-9.15 3.96h2.3L12 9l-1.15 3.65Z"/></svg>
+      </label>
+    
+  
+    
+    
+    
+    <input class="md-option" data-md-color-media="(prefers-color-scheme: light)" data-md-color-scheme="default" data-md-color-primary="deep-orange" data-md-color-accent="orange"  aria-label="Switch to dark mode"  type="radio" name="__palette" id="__palette_1">
+    
+      <label class="md-header__button md-icon" title="Switch to dark mode" for="__palette_2" hidden>
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 8a4 4 0 0 0-4 4 4 4 0 0 0 4 4 4 4 0 0 0 4-4 4 4 0 0 0-4-4m0 10a6 6 0 0 1-6-6 6 6 0 0 1 6-6 6 6 0 0 1 6 6 6 6 0 0 1-6 6m8-9.31V4h-4.69L12 .69 8.69 4H4v4.69L.69 12 4 15.31V20h4.69L12 23.31 15.31 20H20v-4.69L23.31 12 20 8.69Z"/></svg>
+      </label>
+    
+  
+    
+    
+    
+    <input class="md-option" data-md-color-media="(prefers-color-scheme: dark)" data-md-color-scheme="slate" data-md-color-primary="deep-orange" data-md-color-accent="orange"  aria-label="Switch to system preference"  type="radio" name="__palette" id="__palette_2">
+    
+      <label class="md-header__button md-icon" title="Switch to system preference" for="__palette_0" hidden>
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M12 18c-.89 0-1.74-.2-2.5-.55C11.56 16.5 13 14.42 13 12c0-2.42-1.44-4.5-3.5-5.45C10.26 6.2 11.11 6 12 6a6 6 0 0 1 6 6 6 6 0 0 1-6 6m8-9.31V4h-4.69L12 .69 8.69 4H4v4.69L.69 12 4 15.31V20h4.69L12 23.31 15.31 20H20v-4.69L23.31 12 20 8.69Z"/></svg>
+      </label>
+    
+  
+</form>
+      
+    
+    
+      <script>var media,input,key,value,palette=__md_get("__palette");if(palette&&palette.color){"(prefers-color-scheme)"===palette.color.media&&(media=matchMedia("(prefers-color-scheme: light)"),input=document.querySelector(media.matches?"[data-md-color-media='(prefers-color-scheme: light)']":"[data-md-color-media='(prefers-color-scheme: dark)']"),palette.color.media=input.getAttribute("data-md-color-media"),palette.color.scheme=input.getAttribute("data-md-color-scheme"),palette.color.primary=input.getAttribute("data-md-color-primary"),palette.color.accent=input.getAttribute("data-md-color-accent"));for([key,value]of Object.entries(palette.color))document.body.setAttribute("data-md-color-"+key,value)}</script>
+    
+    
+    
+      <label class="md-header__button md-icon" for="__search">
+        
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg>
+      </label>
+      <div class="md-search" data-md-component="search" role="dialog">
+  <label class="md-search__overlay" for="__search"></label>
+  <div class="md-search__inner" role="search">
+    <form class="md-search__form" name="search">
+      <input type="text" class="md-search__input" name="query" aria-label="Search" placeholder="Search" autocapitalize="off" autocorrect="off" autocomplete="off" spellcheck="false" data-md-component="search-query" required>
+      <label class="md-search__icon md-icon" for="__search">
+        
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M9.5 3A6.5 6.5 0 0 1 16 9.5c0 1.61-.59 3.09-1.56 4.23l.27.27h.79l5 5-1.5 1.5-5-5v-.79l-.27-.27A6.516 6.516 0 0 1 9.5 16 6.5 6.5 0 0 1 3 9.5 6.5 6.5 0 0 1 9.5 3m0 2C7 5 5 7 5 9.5S7 14 9.5 14 14 12 14 9.5 12 5 9.5 5Z"/></svg>
+        
+        <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg>
+      </label>
+      <nav class="md-search__options" aria-label="Search">
+        
+        <button type="reset" class="md-search__icon md-icon" title="Clear" aria-label="Clear" tabindex="-1">
+          
+          <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M19 6.41 17.59 5 12 10.59 6.41 5 5 6.41 10.59 12 5 17.59 6.41 19 12 13.41 17.59 19 19 17.59 13.41 12 19 6.41Z"/></svg>
+        </button>
+      </nav>
+      
+    </form>
+    <div class="md-search__output">
+      <div class="md-search__scrollwrap" data-md-scrollfix>
+        <div class="md-search-result" data-md-component="search-result">
+          <div class="md-search-result__meta">
+            Initializing search
+          </div>
+          <ol class="md-search-result__list" role="presentation"></ol>
+        </div>
+      </div>
+    </div>
+  </div>
+</div>
+    
+    
+      <div class="md-header__source">
+        <a href="https://github.com/pflooky/data-caterer-docs" title="Go to repository" class="md-source" data-md-component="source">
+  <div class="md-source__icon md-icon">
+    
+    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M439.55 236.05 244 40.45a28.87 28.87 0 0 0-40.81 0l-40.66 40.63 51.52 51.52c27.06-9.14 52.68 16.77 43.39 43.68l49.66 49.66c34.23-11.8 61.18 31 35.47 56.69-26.49 26.49-70.21-2.87-56-37.34L240.22 199v121.85c25.3 12.54 22.26 41.85 9.08 55a34.34 34.34 0 0 1-48.55 0c-17.57-17.6-11.07-46.91 11.25-56v-123c-20.8-8.51-24.6-30.74-18.64-45L142.57 101 8.45 235.14a28.86 28.86 0 0 0 0 40.81l195.61 195.6a28.86 28.86 0 0 0 40.8 0l194.69-194.69a28.86 28.86 0 0 0 0-40.81z"/></svg>
+  </div>
+  <div class="md-source__repository">
+    data-caterer-docs
+  </div>
+</a>
+      </div>
+    
+  </nav>
+  
+</header>
+    
+    <div class="md-container" data-md-component="container">
+      
+      
+        
+          
+            
+<nav class="md-tabs" aria-label="Tabs" data-md-component="tabs">
+  <div class="md-grid">
+    <ul class="md-tabs__list">
+      
+        
+  
+  
+  
+    <li class="md-tabs__item">
+      <a href="../../../.." class="md-tabs__link">
+        
+  
+    
+  
+  Home
+
+      </a>
+    </li>
+  
+
+      
+        
+  
+  
+  
+    <li class="md-tabs__item">
+      <a href="../../../../get-started/docker/" class="md-tabs__link">
+        
+  
+    
+  
+  Get Started
+
+      </a>
+    </li>
+  
+
+      
+        
+  
+  
+    
+  
+  
+    
+    
+      <li class="md-tabs__item md-tabs__item--active">
+        <a href="../../../" class="md-tabs__link">
+          
+  
+  Setup
+
+        </a>
+      </li>
+    
+  
+
+      
+        
+  
+  
+  
+    
+    
+      <li class="md-tabs__item">
+        <a href="../../../../use-case/use-case/" class="md-tabs__link">
+          
+  
+  Use Case and Comparison
+
+        </a>
+      </li>
+    
+  
+
+      
+        
+  
+  
+  
+    
+    
+      <li class="md-tabs__item">
+        <a href="../../../../about/" class="md-tabs__link">
+          
+  
+  About
+
+        </a>
+      </li>
+    
+  
+
+      
+    </ul>
+  </div>
+</nav>
+          
+        
+      
+      <main class="md-main" data-md-component="main">
+        <div class="md-main__inner md-grid">
+          
+            
+              
+              <div class="md-sidebar md-sidebar--primary" data-md-component="sidebar" data-md-type="navigation" >
+                <div class="md-sidebar__scrollwrap">
+                  <div class="md-sidebar__inner">
+                    
+
+
+  
+
+
+  
+
+<nav class="md-nav md-nav--primary md-nav--lifted md-nav--integrated" aria-label="Navigation" data-md-level="0">
+  <label class="md-nav__title" for="__drawer">
+    <a href="../../../.." title="Data Catering" class="md-nav__button md-logo" aria-label="Data Catering" data-md-component="logo">
+      
+  <img src="../../../../diagrams/logo/data_catering_transparent.svg" alt="logo">
+
+    </a>
+    Data Catering
+  </label>
+  
+    <div class="md-nav__source">
+      <a href="https://github.com/pflooky/data-caterer-docs" title="Go to repository" class="md-source" data-md-component="source">
+  <div class="md-source__icon md-icon">
+    
+    <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M439.55 236.05 244 40.45a28.87 28.87 0 0 0-40.81 0l-40.66 40.63 51.52 51.52c27.06-9.14 52.68 16.77 43.39 43.68l49.66 49.66c34.23-11.8 61.18 31 35.47 56.69-26.49 26.49-70.21-2.87-56-37.34L240.22 199v121.85c25.3 12.54 22.26 41.85 9.08 55a34.34 34.34 0 0 1-48.55 0c-17.57-17.6-11.07-46.91 11.25-56v-123c-20.8-8.51-24.6-30.74-18.64-45L142.57 101 8.45 235.14a28.86 28.86 0 0 0 0 40.81l195.61 195.6a28.86 28.86 0 0 0 40.8 0l194.69-194.69a28.86 28.86 0 0 0 0-40.81z"/></svg>
+  </div>
+  <div class="md-source__repository">
+    data-caterer-docs
+  </div>
+</a>
+    </div>
+  
+  <ul class="md-nav__list" data-md-scrollfix>
+    
+      
+      
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../.." class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Home
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+    
+      
+      
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../get-started/docker/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Get Started
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+    
+      
+      
+  
+  
+    
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--active md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3" checked>
+        
+          
+          <label class="md-nav__link" for="__nav_3" id="__nav_3_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Setup
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="1" aria-labelledby="__nav_3_label" aria-expanded="true">
+          <label class="md-nav__title" for="__nav_3">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Setup
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Setup
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+    
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--active md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3_2" checked>
+        
+          
+          <label class="md-nav__link" for="__nav_3_2" id="__nav_3_2_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Guide
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="2" aria-labelledby="__nav_3_2_label" aria-expanded="true">
+          <label class="md-nav__title" for="__nav_3_2">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Guide
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Guides
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+    
+  
+  
+    
+    
+    
+    
+    
+    <li class="md-nav__item md-nav__item--active md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3_2_2" checked>
+        
+          
+          <label class="md-nav__link" for="__nav_3_2_2" id="__nav_3_2_2_label" tabindex="0">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Scenario
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="3" aria-labelledby="__nav_3_2_2_label" aria-expanded="true">
+          <label class="md-nav__title" for="__nav_3_2_2">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Scenario
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../first-data-generation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    First Data Generation
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../records-per-column/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Multiple Records Per Column Value
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+    
+  
+  
+    <li class="md-nav__item md-nav__item--active">
+      
+      <input class="md-nav__toggle md-toggle" type="checkbox" id="__toc">
+      
+      
+        
+      
+      
+        <label class="md-nav__link md-nav__link--active" for="__toc">
+          
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+          <span class="md-nav__icon md-icon"></span>
+        </label>
+      
+      <a href="./" class="md-nav__link md-nav__link--active">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+      
+        
+
+<nav class="md-nav md-nav--secondary" aria-label="Table of contents">
+  
+  
+  
+    
+  
+  
+    <label class="md-nav__title" for="__toc">
+      <span class="md-nav__icon md-icon"></span>
+      Table of contents
+    </label>
+    <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
+      
+        <li class="md-nav__item">
+  <a href="#requirements" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Requirements
+      
+    </span>
+  </a>
+  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#get-started" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Get Started
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Get Started">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#data-setup" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Data Setup
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#plan-setup" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Plan Setup
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#validations" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Validations
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Validations">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#basic-validation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Basic Validation
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Basic Validation">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#validation-criteria" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Validation Criteria
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#considerations" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Considerations
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#custom-validation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Custom Validation
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#group-by-validation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Group By Validation
+      
+    </span>
+  </a>
+  
+    <nav class="md-nav" aria-label="Group By Validation">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#validation-criteria_1" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Validation Criteria
+      
+    </span>
+  </a>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#considerations_1" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Considerations
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#sample-validation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Sample Validation
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#run" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Run
+      
+    </span>
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+      
+    </ul>
+  
+</nav>
+      
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../auto-generate-connection/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Auto Generate From Data Connection
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../delete-generated-data/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Delete Generated Data
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Generate Batch and Event Data
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    
+    
+    
+    
+    
+    <li class="md-nav__item md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3_2_3" >
+        
+          
+          <label class="md-nav__link" for="__nav_3_2_3" id="__nav_3_2_3_label" tabindex="0">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Sources
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="3" aria-labelledby="__nav_3_2_3_label" aria-expanded="false">
+          <label class="md-nav__title" for="__nav_3_2_3">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Data Sources
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../first-data-generation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Files (CSV, JSON, ORC, Parquet)
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/cassandra/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Cassandra
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/kafka/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Kafka
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/solace/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Solace
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/marquez-metadata-source/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Marquez Metadata Source
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/open-metadata-source/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    OpenMetadata Source
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../data-source/http/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    HTTP
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../configuration/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Configuration
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../connection/connection/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Connection
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3_5" >
+        
+          
+          <label class="md-nav__link" for="__nav_3_5" id="__nav_3_5_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Generator
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="2" aria-labelledby="__nav_3_5_label" aria-expanded="false">
+          <label class="md-nav__title" for="__nav_3_5">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Generator
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../generator/generator/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Generator
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../generator/count/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Record Count
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../generator/report/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Report
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_3_6" >
+        
+          
+          <label class="md-nav__link" for="__nav_3_6" id="__nav_3_6_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Validation
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="2" aria-labelledby="__nav_3_6_label" aria-expanded="false">
+          <label class="md-nav__title" for="__nav_3_6">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Validation
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../validation/validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../validation/basic-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Basic
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../validation/group-by-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Group by/Aggregate
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../foreign-key/foreign-key/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../deployment/deployment/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Deployment
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../advanced/advanced/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Advanced
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+    
+      
+      
+  
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_4" >
+        
+          
+          <label class="md-nav__link" for="__nav_4" id="__nav_4_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    Use Case and Comparison
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="1" aria-labelledby="__nav_4_label" aria-expanded="false">
+          <label class="md-nav__title" for="__nav_4">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    Use Case and Comparison
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../use-case/use-case/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Use cases
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../use-case/business-value/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Business Value
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../use-case/comparison/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Comparison
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../use-case/roadmap/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Roadmap
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+    
+      
+      
+  
+  
+  
+    
+    
+    
+    
+    
+      
+      
+    
+    <li class="md-nav__item md-nav__item--section md-nav__item--nested">
+      
+        
+        
+        
+        <input class="md-nav__toggle md-toggle " type="checkbox" id="__nav_5" >
+        
+          
+          <label class="md-nav__link" for="__nav_5" id="__nav_5_label" tabindex="">
+            
+  
+  <span class="md-ellipsis">
+    
+  
+    About
+  
+
+    
+  </span>
+  
+  
+
+            <span class="md-nav__icon md-icon"></span>
+          </label>
+        
+        <nav class="md-nav" data-md-level="1" aria-labelledby="__nav_5_label" aria-expanded="false">
+          <label class="md-nav__title" for="__nav_5">
+            <span class="md-nav__icon md-icon"></span>
+            
+  
+    About
+  
+
+          </label>
+          <ul class="md-nav__list" data-md-scrollfix>
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../about/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    About
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../legal/terms-of-service/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Terms of service
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../legal/privacy-policy/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Privacy policy
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../../../pricing/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Pricing
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+          </ul>
+        </nav>
+      
+    </li>
+  
+
+    
+  </ul>
+</nav>
+                  </div>
+                </div>
+              </div>
+            
+            
+          
+          
+            <div class="md-content" data-md-component="content">
+              
+              <article class="md-content__inner md-typeset">
+                
+                  
+
+  
+  
+
+
+<h1 id="data-validations">Data Validations</h1>
+<p>Creating a data validator for a JSON file.</p>
+<p><img alt="Example data validation report" src="../../../../diagrams/data_validation_report.png" /></p>
+<h2 id="requirements">Requirements</h2>
+<ul>
+<li>5 minutes</li>
+<li>Git</li>
+<li>Gradle</li>
+<li>Docker</li>
+</ul>
+<h2 id="get-started">Get Started</h2>
+<p>First, we will clone the data-caterer-example repo which will already have the base project setup required.</p>
+<div class="highlight"><pre><span></span><code>git<span class="w"> </span>clone<span class="w"> </span>git@github.com:pflooky/data-caterer-example.git
+</code></pre></div>
+<h3 id="data-setup">Data Setup</h3>
+<p>To aid in showing the functionality of data validations, we will first generate some data that our validations will run
+against. Run the below command and it will generate JSON files under <code>docker/sample/json</code> folder.</p>
+<div class="highlight"><pre><span></span><code>./run.sh<span class="w"> </span>JsonPlan
+</code></pre></div>
+<h3 id="plan-setup">Plan Setup</h3>
+<p>Create a new Java or Scala class.</p>
+<ul>
+<li>Java: <code>src/main/java/com/github/pflooky/plan/MyValidationJavaPlan.java</code></li>
+<li>Scala: <code>src/main/scala/com/github/pflooky/plan/MyValidationPlan.scala</code></li>
+</ul>
+<p>Make sure your class extends <code>PlanRun</code>.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="1:2"><input checked="checked" id="__tabbed_1_1" name="__tabbed_1" type="radio" /><input id="__tabbed_1_2" name="__tabbed_1" type="radio" /><div class="tabbed-labels"><label for="__tabbed_1_1">Java</label><label for="__tabbed_1_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="kn">import</span><span class="w"> </span><span class="nn">com.github.pflooky.datacaterer.java.api.PlanRun</span><span class="p">;</span>
+<span class="p">...</span>
+
+<span class="kd">public</span><span class="w"> </span><span class="kd">class</span> <span class="nc">MyValidationJavaPlan</span><span class="w"> </span><span class="kd">extends</span><span class="w"> </span><span class="n">PlanRun</span><span class="w"> </span><span class="p">{</span>
+<span class="w">    </span><span class="p">{</span>
+<span class="w">        </span><span class="kd">var</span><span class="w"> </span><span class="n">jsonTask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">json</span><span class="p">(</span><span class="s">&quot;my_json&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/json&quot;</span><span class="p">);</span>
+
+<span class="w">        </span><span class="kd">var</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">configuration</span><span class="p">()</span>
+<span class="w">                </span><span class="p">.</span><span class="na">generatedReportsFolderPath</span><span class="p">(</span><span class="s">&quot;/opt/app/data/report&quot;</span><span class="p">)</span>
+<span class="w">                </span><span class="p">.</span><span class="na">enableValidation</span><span class="p">(</span><span class="kc">true</span><span class="p">)</span>
+<span class="w">                </span><span class="p">.</span><span class="na">enableGenerateData</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
+
+<span class="w">        </span><span class="n">execute</span><span class="p">(</span><span class="n">config</span><span class="p">,</span><span class="w"> </span><span class="n">jsonTask</span><span class="p">);</span>
+<span class="w">    </span><span class="p">}</span>
+<span class="p">}</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="k">import</span><span class="w"> </span><span class="nn">com</span><span class="p">.</span><span class="nn">github</span><span class="p">.</span><span class="nn">pflooky</span><span class="p">.</span><span class="nn">datacaterer</span><span class="p">.</span><span class="nn">api</span><span class="p">.</span><span class="nc">PlanRun</span>
+<span class="p">...</span>
+
+<span class="k">class</span><span class="w"> </span><span class="nc">MyValidationPlan</span><span class="w"> </span><span class="k">extends</span><span class="w"> </span><span class="nc">PlanRun</span><span class="w"> </span><span class="p">{</span>
+<span class="w">  </span><span class="kd">val</span><span class="w"> </span><span class="n">jsonTask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">json</span><span class="p">(</span><span class="s">&quot;my_json&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/json&quot;</span><span class="p">)</span>
+
+<span class="w">  </span><span class="kd">val</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">configuration</span>
+<span class="w">    </span><span class="p">.</span><span class="n">generatedReportsFolderPath</span><span class="p">(</span><span class="s">&quot;/opt/app/data/report&quot;</span><span class="p">)</span>
+<span class="w">    </span><span class="p">.</span><span class="n">enableValidation</span><span class="p">(</span><span class="kc">true</span><span class="p">)</span>
+<span class="w">    </span><span class="p">.</span><span class="n">enableGenerateData</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
+
+<span class="w">  </span><span class="n">execute</span><span class="p">(</span><span class="n">config</span><span class="p">,</span><span class="w"> </span><span class="n">jsonTask</span><span class="p">)</span>
+<span class="p">}</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<p>As noted above, we create a JSON task that points to where the JSON data has been created at folder <code>/opt/app/data/json</code>
+. We also note that <code>enableValidation</code> is set to <code>true</code> and <code>enableGenerateData</code> to <code>false</code> to tell Data Catering, we
+only want to validate data.</p>
+<h3 id="validations">Validations</h3>
+<p>For reference, the schema in which we will be validating against looks like the below.</p>
+<div class="highlight"><pre><span></span><code>.schema<span class="o">(</span>
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;account_id&quot;</span><span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;year&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>IntegerType<span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;balance&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>DoubleType<span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;date&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>DateType<span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;status&quot;</span><span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;update_history&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>ArrayType<span class="o">)</span>
+<span class="w">    </span>.schema<span class="o">(</span>
+<span class="w">      </span>field.name<span class="o">(</span><span class="s2">&quot;updated_time&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>TimestampType<span class="o">)</span>,
+<span class="w">      </span>field.name<span class="o">(</span><span class="s2">&quot;status&quot;</span><span class="o">)</span>.oneOf<span class="o">(</span><span class="s2">&quot;open&quot;</span>,<span class="w"> </span><span class="s2">&quot;closed&quot;</span>,<span class="w"> </span><span class="s2">&quot;pending&quot;</span>,<span class="w"> </span><span class="s2">&quot;suspended&quot;</span><span class="o">)</span>,
+<span class="w">    </span><span class="o">)</span>,
+<span class="w">  </span>field.name<span class="o">(</span><span class="s2">&quot;customer_details&quot;</span><span class="o">)</span>
+<span class="w">    </span>.schema<span class="o">(</span>
+<span class="w">      </span>field.name<span class="o">(</span><span class="s2">&quot;name&quot;</span><span class="o">)</span>.expression<span class="o">(</span><span class="s2">&quot;#{Name.name}&quot;</span><span class="o">)</span>,
+<span class="w">      </span>field.name<span class="o">(</span><span class="s2">&quot;age&quot;</span><span class="o">)</span>.<span class="sb">`</span><span class="nb">type</span><span class="sb">`</span><span class="o">(</span>IntegerType<span class="o">)</span>,
+<span class="w">      </span>field.name<span class="o">(</span><span class="s2">&quot;city&quot;</span><span class="o">)</span>.expression<span class="o">(</span><span class="s2">&quot;#{Address.city}&quot;</span><span class="o">)</span>
+<span class="w">    </span><span class="o">)</span>
+<span class="o">)</span>
+</code></pre></div>
+<h4 id="basic-validation">Basic Validation</h4>
+<p>Let's say our goal is to validate the <code>customer_details.name</code> field to ensure it conforms to the regex
+pattern <code>[A-Z][a-z]+ [A-Z][a-z]+</code>. Given the diversity in naming conventions across cultures and countries, variations
+such as middle names, suffixes, prefixes, or language-specific differences are tolerated to a certain extent. The
+validation considers an acceptable error threshold before marking it as failed.</p>
+<h5 id="validation-criteria">Validation Criteria</h5>
+<ul>
+<li>Field to Validate: <code>customer_details.name</code></li>
+<li>Regex Pattern: <code>[A-Z][a-z]+ [A-Z][a-z]+</code></li>
+<li>Error Tolerance: If more than 10% do not match the regex, then fail.</li>
+</ul>
+<h5 id="considerations">Considerations</h5>
+<ul>
+<li>Customisation<ul>
+<li>Adjust the regex pattern and error threshold based on your specific data schema and validation requirements.</li>
+<li>For the full list of types of basic validations that can be
+  used, <a href="../../../validation/basic-validation/">check this page</a>.</li>
+</ul>
+</li>
+<li>Understanding Tolerance<ul>
+<li>Be mindful of the error threshold, as it directly influences what percentage of deviations from the pattern is
+  acceptable.</li>
+</ul>
+</li>
+</ul>
+<div class="tabbed-set tabbed-alternate" data-tabs="2:2"><input checked="checked" id="__tabbed_2_1" name="__tabbed_2" type="radio" /><input id="__tabbed_2_2" name="__tabbed_2" type="radio" /><div class="tabbed-labels"><label for="__tabbed_2_1">Java</label><label for="__tabbed_2_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;customer_details.name&quot;</span><span class="p">)</span>
+<span class="w">    </span><span class="p">.</span><span class="na">matches</span><span class="p">(</span><span class="s">&quot;[A-Z][a-z]+ [A-Z][a-z]+&quot;</span><span class="p">)</span>
+<span class="w">    </span><span class="p">.</span><span class="na">errorThreshold</span><span class="p">(</span><span class="mf">0.1</span><span class="p">)</span><span class="w">                                      </span><span class="c1">//&lt;=10% failure rate is acceptable</span>
+<span class="w">    </span><span class="p">.</span><span class="na">description</span><span class="p">(</span><span class="s">&quot;Names generally follow the same pattern&quot;</span><span class="p">),</span><span class="w">  </span><span class="c1">//description to add context in report or other developers</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;customer_details.name&quot;</span><span class="p">)</span>
+<span class="w">  </span><span class="p">.</span><span class="n">matches</span><span class="p">(</span><span class="s">&quot;[A-Z][a-z]+ [A-Z][a-z]+&quot;</span><span class="p">)</span>
+<span class="w">  </span><span class="p">.</span><span class="n">errorThreshold</span><span class="p">(</span><span class="mf">0.1</span><span class="p">)</span><span class="w">                                      </span><span class="c1">//&lt;=10% failure rate is acceptable</span>
+<span class="w">  </span><span class="p">.</span><span class="n">description</span><span class="p">(</span><span class="s">&quot;Names generally follow the same pattern&quot;</span><span class="p">),</span><span class="w">  </span><span class="c1">//description to add context in report or other developers</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<h5 id="custom-validation">Custom Validation</h5>
+<p>There will be situation where you have a complex data setup and require you own custom logic to use for data validation.
+You can achieve this via setting your own SQL expression that returns a boolean value. An example is seen below where
+we want to check the array <code>update_history</code>, that each entry has <code>updated_time</code> greater than a certain timestamp.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="3:2"><input checked="checked" id="__tabbed_3_1" name="__tabbed_3" type="radio" /><input id="__tabbed_3_2" name="__tabbed_3" type="radio" /><div class="tabbed-labels"><label for="__tabbed_3_1">Java</label><label for="__tabbed_3_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">expr</span><span class="p">(</span><span class="s">&quot;FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP(&#39;2022-01-01 00:00:00&#39;))&quot;</span><span class="p">),</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">expr</span><span class="p">(</span><span class="s">&quot;FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP(&#39;2022-01-01 00:00:00&#39;))&quot;</span><span class="p">),</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<p>If you want to know what other SQL function are available for you to
+use, <a href="https://spark.apache.org/docs/latest/api/sql/">check this page</a>.</p>
+<h4 id="group-by-validation">Group By Validation</h4>
+<p>There are scenarios where you want to validate against grouped values or the whole dataset via aggregations. An example
+would be validating that each customer's transactions sum is greater than 0.</p>
+<h5 id="validation-criteria_1">Validation Criteria</h5>
+<p>Line 1: <code>validation.groupBy().count().isEqual(100)</code></p>
+<ul>
+<li>Method Chaining<ul>
+<li><code>groupBy()</code>: Group by whole dataset.</li>
+<li><code>count()</code>: Counts the number of dataset elements.</li>
+<li><code>isEqual(100)</code>: Checks if the count is equal to 100.</li>
+</ul>
+</li>
+<li>Validation Rule<ul>
+<li>This line ensures that the count of the total dataset is exactly 100.</li>
+</ul>
+</li>
+</ul>
+<p>Line 2: <code>validation.groupBy("account_id").max("balance").lessThan(900)</code></p>
+<ul>
+<li>Method Chaining<ul>
+<li><code>groupBy("account_id")</code>: Groups the data based on the <code>account_id</code> field.</li>
+<li><code>max("balance")</code>: Calculates the maximum value of the <code>balance</code> field within each group.</li>
+<li><code>lessThan(900)</code>: Checks if the maximum balance in each group is less than 900.</li>
+</ul>
+</li>
+<li>Validation Rule<ul>
+<li>This line ensures that, for each group identified by <code>account_id</code> the maximum balance is less than 900.</li>
+</ul>
+</li>
+</ul>
+<h5 id="considerations_1">Considerations</h5>
+<ul>
+<li>Adjust the <code>errorThreshold</code> or validation to your specification scenario. The full list
+  of <a href="../../../validation/validation/">types of validations can be found here</a>.</li>
+<li>For the full list of types of group by validations that can be
+  used, <a href="../../../validation/group-by-validation/">check this page</a>.</li>
+</ul>
+<div class="tabbed-set tabbed-alternate" data-tabs="4:2"><input checked="checked" id="__tabbed_4_1" name="__tabbed_4" type="radio" /><input id="__tabbed_4_2" name="__tabbed_4" type="radio" /><div class="tabbed-labels"><label for="__tabbed_4_1">Java</label><label for="__tabbed_4_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">().</span><span class="na">count</span><span class="p">().</span><span class="na">isEqual</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
+<span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">).</span><span class="na">max</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">900</span><span class="p">)</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">().</span><span class="n">count</span><span class="p">().</span><span class="n">isEqual</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
+<span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">).</span><span class="n">max</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="n">lessThan</span><span class="p">(</span><span class="mi">900</span><span class="p">)</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<h4 id="sample-validation">Sample Validation</h4>
+<p>To try cover the majority of validation cases, the below has been created.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="5:2"><input checked="checked" id="__tabbed_5_1" name="__tabbed_5" type="radio" /><input id="__tabbed_5_2" name="__tabbed_5" type="radio" /><div class="tabbed-labels"><label for="__tabbed_5_1">Java</label><label for="__tabbed_5_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="kd">var</span><span class="w"> </span><span class="n">jsonTask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">json</span><span class="p">(</span><span class="s">&quot;my_json&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/json&quot;</span><span class="p">)</span>
+<span class="w">        </span><span class="p">.</span><span class="na">validations</span><span class="p">(</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;customer_details.name&quot;</span><span class="p">).</span><span class="na">matches</span><span class="p">(</span><span class="s">&quot;[A-Z][a-z]+ [A-Z][a-z]+&quot;</span><span class="p">).</span><span class="na">errorThreshold</span><span class="p">(</span><span class="mf">0.1</span><span class="p">).</span><span class="na">description</span><span class="p">(</span><span class="s">&quot;Names generally follow the same pattern&quot;</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;date&quot;</span><span class="p">).</span><span class="na">isNotNull</span><span class="p">().</span><span class="na">errorThreshold</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="na">greaterThan</span><span class="p">(</span><span class="mi">500</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">expr</span><span class="p">(</span><span class="s">&quot;YEAR(date) == year&quot;</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;status&quot;</span><span class="p">).</span><span class="na">in</span><span class="p">(</span><span class="s">&quot;open&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;closed&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;pending&quot;</span><span class="p">).</span><span class="na">errorThreshold</span><span class="p">(</span><span class="mf">0.2</span><span class="p">).</span><span class="na">description</span><span class="p">(</span><span class="s">&quot;Could be new status introduced&quot;</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;customer_details.age&quot;</span><span class="p">).</span><span class="na">greaterThan</span><span class="p">(</span><span class="mi">18</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">expr</span><span class="p">(</span><span class="s">&quot;FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP(&#39;2022-01-01 00:00:00&#39;))&quot;</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">col</span><span class="p">(</span><span class="s">&quot;update_history&quot;</span><span class="p">).</span><span class="na">greaterThanSize</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">unique</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">().</span><span class="na">count</span><span class="p">().</span><span class="na">isEqual</span><span class="p">(</span><span class="mi">1000</span><span class="p">),</span>
+<span class="w">                </span><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">).</span><span class="na">max</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">900</span><span class="p">)</span>
+<span class="w">        </span><span class="p">);</span>
+
+<span class="kd">var</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">configuration</span><span class="p">()</span>
+<span class="w">        </span><span class="p">.</span><span class="na">generatedReportsFolderPath</span><span class="p">(</span><span class="s">&quot;/opt/app/data/report&quot;</span><span class="p">)</span>
+<span class="w">        </span><span class="p">.</span><span class="na">enableValidation</span><span class="p">(</span><span class="kc">true</span><span class="p">)</span>
+<span class="w">        </span><span class="p">.</span><span class="na">enableGenerateData</span><span class="p">(</span><span class="kc">false</span><span class="p">);</span>
+
+<span class="n">execute</span><span class="p">(</span><span class="n">config</span><span class="p">,</span><span class="w"> </span><span class="n">jsonTask</span><span class="p">);</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="kd">val</span><span class="w"> </span><span class="n">jsonTask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">json</span><span class="p">(</span><span class="s">&quot;my_json&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/json&quot;</span><span class="p">)</span>
+<span class="w">  </span><span class="p">.</span><span class="n">validations</span><span class="p">(</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;customer_details.name&quot;</span><span class="p">).</span><span class="n">matches</span><span class="p">(</span><span class="s">&quot;[A-Z][a-z]+ [A-Z][a-z]+&quot;</span><span class="p">).</span><span class="n">errorThreshold</span><span class="p">(</span><span class="mf">0.1</span><span class="p">).</span><span class="n">description</span><span class="p">(</span><span class="s">&quot;Names generally follow the same pattern&quot;</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;date&quot;</span><span class="p">).</span><span class="n">isNotNull</span><span class="p">.</span><span class="n">errorThreshold</span><span class="p">(</span><span class="mi">10</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="n">greaterThan</span><span class="p">(</span><span class="mi">500</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">expr</span><span class="p">(</span><span class="s">&quot;YEAR(date) == year&quot;</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;status&quot;</span><span class="p">).</span><span class="n">in</span><span class="p">(</span><span class="s">&quot;open&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;closed&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;pending&quot;</span><span class="p">).</span><span class="n">errorThreshold</span><span class="p">(</span><span class="mf">0.2</span><span class="p">).</span><span class="n">description</span><span class="p">(</span><span class="s">&quot;Could be new status introduced&quot;</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;customer_details.age&quot;</span><span class="p">).</span><span class="n">greaterThan</span><span class="p">(</span><span class="mi">18</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">expr</span><span class="p">(</span><span class="s">&quot;FORALL(update_history, x -&gt; x.updated_time &gt; TIMESTAMP(&#39;2022-01-01 00:00:00&#39;))&quot;</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">col</span><span class="p">(</span><span class="s">&quot;update_history&quot;</span><span class="p">).</span><span class="n">greaterThanSize</span><span class="p">(</span><span class="mi">2</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">unique</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">().</span><span class="n">count</span><span class="p">().</span><span class="n">isEqual</span><span class="p">(</span><span class="mi">1000</span><span class="p">),</span>
+<span class="w">    </span><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">).</span><span class="n">max</span><span class="p">(</span><span class="s">&quot;balance&quot;</span><span class="p">).</span><span class="n">lessThan</span><span class="p">(</span><span class="mi">900</span><span class="p">)</span>
+<span class="w">  </span><span class="p">)</span>
+
+<span class="kd">val</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">configuration</span>
+<span class="w">  </span><span class="p">.</span><span class="n">generatedReportsFolderPath</span><span class="p">(</span><span class="s">&quot;/opt/app/data/report&quot;</span><span class="p">)</span>
+<span class="w">  </span><span class="p">.</span><span class="n">enableValidation</span><span class="p">(</span><span class="kc">true</span><span class="p">)</span>
+<span class="w">  </span><span class="p">.</span><span class="n">enableGenerateData</span><span class="p">(</span><span class="kc">false</span><span class="p">)</span>
+
+<span class="n">execute</span><span class="p">(</span><span class="n">config</span><span class="p">,</span><span class="w"> </span><span class="n">jsonTask</span><span class="p">)</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<h3 id="run">Run</h3>
+<p>Let's try run.</p>
+<div class="highlight"><pre><span></span><code>./run.sh
+<span class="c1">#input class MyValidationJavaPlan or MyValidationPlan</span>
+<span class="c1">#after completing, check report at docker/sample/report/index.html</span>
+</code></pre></div>
+<p>It should look something like this.</p>
+<video src="https://user-images.githubusercontent.com/26299147/283040918-5de0c992-cddf-4ab1-a501-273ceef0cb30.mov" data-canonical-src="https://user-images.githubusercontent.com/26299147/283040918-5de0c992-cddf-4ab1-a501-273ceef0cb30.mov" controls="controls" muted="muted" style="max-height:640px; min-height: 200px"></video>
+
+<p>Check the full example at <code>ValidationPlanRun</code> inside the examples repo.</p>
+
+  
+  
+
+
+
+  
+
+
+
+
+                
+              </article>
+            </div>
+          
+          
+  <script>var tabs=__md_get("__tabs");if(Array.isArray(tabs))e:for(var set of document.querySelectorAll(".tabbed-set")){var tab,labels=set.querySelector(".tabbed-labels");for(tab of tabs)for(var label of labels.getElementsByTagName("label"))if(label.innerText.trim()===tab){var input=document.getElementById(label.htmlFor);input.checked=!0;continue e}}</script>
+
+<script>var target=document.getElementById(location.hash.slice(1));target&&target.name&&(target.checked=target.name.startsWith("__tabbed_"))</script>
+        </div>
+        
+          <button type="button" class="md-top md-icon" data-md-component="top" hidden>
+  
+  <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M13 20h-2V8l-5.5 5.5-1.42-1.42L12 4.16l7.92 7.92-1.42 1.42L13 8v12Z"/></svg>
+  Back to top
+</button>
+        
+      </main>
+      
+        <footer class="md-footer">
+  
+    
+      
+      <nav class="md-footer__inner md-grid" aria-label="Footer" >
+        
+          
+          <a href="../batch-and-event/" class="md-footer__link md-footer__link--prev" aria-label="Previous: Foreign Keys Across Data Sources">
+            <div class="md-footer__button md-icon">
+              
+              <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M20 11v2H8l5.5 5.5-1.42 1.42L4.16 12l7.92-7.92L13.5 5.5 8 11h12Z"/></svg>
+            </div>
+            <div class="md-footer__title">
+              <span class="md-footer__direction">
+                Previous
+              </span>
+              <div class="md-ellipsis">
+                Foreign Keys Across Data Sources
+              </div>
+            </div>
+          </a>
+        
+        
+          
+          <a href="../auto-generate-connection/" class="md-footer__link md-footer__link--next" aria-label="Next: Auto Generate From Data Connection">
+            <div class="md-footer__title">
+              <span class="md-footer__direction">
+                Next
+              </span>
+              <div class="md-ellipsis">
+                Auto Generate From Data Connection
+              </div>
+            </div>
+            <div class="md-footer__button md-icon">
+              
+              <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24"><path d="M4 11v2h12l-5.5 5.5 1.42 1.42L19.84 12l-7.92-7.92L10.5 5.5 16 11H4Z"/></svg>
+            </div>
+          </a>
+        
+      </nav>
+    
+  
+  <div class="md-footer-meta md-typeset">
+    <div class="md-footer-meta__inner md-grid">
+      <div class="md-copyright">
+  
+  
+    Made with
+    <a href="https://squidfunk.github.io/mkdocs-material/" target="_blank" rel="noopener">
+      Material for MkDocs Insiders
+    </a>
+  
+</div>
+      
+        <div class="md-social">
+  
+    
+    
+    
+    
+      
+      
+    
+    <a href="https://join.slack.com/t/data-catering/shared_invite/zt-2664ylbpi-w3n7lWAO~PHeOG9Ujpm~~w" target="_blank" rel="noopener" title="join.slack.com" class="md-social__link">
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M94.12 315.1c0 25.9-21.16 47.06-47.06 47.06S0 341 0 315.1c0-25.9 21.16-47.06 47.06-47.06h47.06v47.06zm23.72 0c0-25.9 21.16-47.06 47.06-47.06s47.06 21.16 47.06 47.06v117.84c0 25.9-21.16 47.06-47.06 47.06s-47.06-21.16-47.06-47.06V315.1zm47.06-188.98c-25.9 0-47.06-21.16-47.06-47.06S139 32 164.9 32s47.06 21.16 47.06 47.06v47.06H164.9zm0 23.72c25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06H47.06C21.16 243.96 0 222.8 0 196.9s21.16-47.06 47.06-47.06H164.9zm188.98 47.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06s-21.16 47.06-47.06 47.06h-47.06V196.9zm-23.72 0c0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06V79.06c0-25.9 21.16-47.06 47.06-47.06 25.9 0 47.06 21.16 47.06 47.06V196.9zM283.1 385.88c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06-25.9 0-47.06-21.16-47.06-47.06v-47.06h47.06zm0-23.72c-25.9 0-47.06-21.16-47.06-47.06 0-25.9 21.16-47.06 47.06-47.06h117.84c25.9 0 47.06 21.16 47.06 47.06 0 25.9-21.16 47.06-47.06 47.06H283.1z"/></svg>
+    </a>
+  
+    
+    
+    
+    
+      
+      
+    
+    <a href="https://github.com/pflooky/data-caterer-example" target="_blank" rel="noopener" title="github.com" class="md-social__link">
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 496 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg>
+    </a>
+  
+    
+    
+    
+    
+      
+      
+    
+    <a href="https://medium.com/@pflooky" target="_blank" rel="noopener" title="medium.com" class="md-social__link">
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 640 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M180.5 74.262C80.813 74.262 0 155.633 0 256s80.819 181.738 180.5 181.738S361 356.373 361 256 280.191 74.262 180.5 74.262Zm288.25 10.646c-49.845 0-90.245 76.619-90.245 171.095s40.406 171.1 90.251 171.1 90.251-76.619 90.251-171.1H559c0-94.503-40.4-171.095-90.248-171.095Zm139.506 17.821c-17.526 0-31.735 68.628-31.735 153.274s14.2 153.274 31.735 153.274S640 340.631 640 256c0-84.649-14.215-153.271-31.742-153.271Z"/></svg>
+    </a>
+  
+    
+    
+    
+    
+      
+      
+    
+    <a href="https://www.linkedin.com/in/peter-flook/" target="_blank" rel="noopener" title="www.linkedin.com" class="md-social__link">
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 448 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"/></svg>
+    </a>
+  
+    
+    
+    
+    
+    <a href="mailto:peter.flook@data.catering" target="_blank" rel="noopener" title="" class="md-social__link">
+      <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512"><!--! Font Awesome Free 6.4.2 by @fontawesome - https://fontawesome.com License - https://fontawesome.com/license/free (Icons: CC BY 4.0, Fonts: SIL OFL 1.1, Code: MIT License) Copyright 2023 Fonticons, Inc.--><path d="M498.1 5.6c10.1 7 15.4 19.1 13.5 31.2l-64 416c-1.5 9.7-7.4 18.2-16 23s-18.9 5.4-28 1.6L284 427.7l-68.5 74.1c-8.9 9.7-22.9 12.9-35.2 8.1S160 493.2 160 480v-83.6c0-4 1.5-7.8 4.2-10.7l167.6-182.9c5.8-6.3 5.6-16-.4-22s-15.7-6.4-22-.7L106 360.8l-88.3-44.2C7.1 311.3.3 300.7 0 288.9s5.9-22.8 16.1-28.7l448-256c10.7-6.1 23.9-5.5 34 1.4z"/></svg>
+    </a>
+  
+</div>
+      
+    </div>
+  </div>
+</footer>
+      
+    </div>
+    <div class="md-dialog" data-md-component="dialog">
+      <div class="md-dialog__inner md-typeset"></div>
+    </div>
+    
+    
+      <div class="md-consent" data-md-component="consent" id="__consent" hidden>
+        <div class="md-consent__overlay"></div>
+        <aside class="md-consent__inner">
+          <form class="md-consent__form md-grid md-typeset" name="consent">
+            
+
+  
+    
+  
+
+
+  
+    
+  
+
+
+
+  
+
+
+<h4>Cookie consent</h4>
+<p>We use cookies to recognize your repeated visits and preferences, as well as to measure the effectiveness of our documentation and whether users find what they're searching for. With your consent, you're helping us to make our documentation better.</p>
+<input class="md-toggle" type="checkbox" id="__settings" >
+<div class="md-consent__settings">
+  <ul class="task-list">
+    
+      
+      
+        
+        
+      
+      <li class="task-list-item">
+        <label class="task-list-control">
+          <input type="checkbox" name="analytics" checked>
+          <span class="task-list-indicator"></span>
+          Google Analytics
+        </label>
+      </li>
+    
+      
+      
+        
+        
+      
+      <li class="task-list-item">
+        <label class="task-list-control">
+          <input type="checkbox" name="github" checked>
+          <span class="task-list-indicator"></span>
+          GitHub
+        </label>
+      </li>
+    
+  </ul>
+</div>
+<div class="md-consent__controls">
+  
+    
+      <button class="md-button md-button--primary">Accept</button>
+    
+    
+    
+  
+    
+    
+    
+      <label class="md-button" for="__settings">Manage settings</label>
+    
+  
+</div>
+          </form>
+        </aside>
+      </div>
+      <script>var consent=__md_get("__consent");if(consent)for(var input of document.forms.consent.elements)input.name&&(input.checked=consent[input.name]||!1);else"file:"!==location.protocol&&setTimeout(function(){document.querySelector("[data-md-component=consent]").hidden=!1},250);var action,form=document.forms.consent;for(action of["submit","reset"])form.addEventListener(action,function(e){if(e.preventDefault(),"reset"===e.type)for(var n of document.forms.consent.elements)n.name&&(n.checked=!1);__md_set("__consent",Object.fromEntries(Array.from(new FormData(form).keys()).map(function(e){return[e,!0]}))),location.hash="",location.reload()})</script>
+    
+    <script id="__config" type="application/json">{"base": "../../../..", "features": ["navigation.tabs", "navigation.sections", "navigation.instant", "navigation.tracking", "navigation.top", "navigation.footer", "content.code.copy", "content.code.select", "content.tabs.link", "toc.integrate", "toc.follow"], "search": "../../../../assets/javascripts/workers/search.f2da59ea.min.js", "translations": {"clipboard.copied": "Copied to clipboard", "clipboard.copy": "Copy to clipboard", "search.result.more.one": "1 more on this page", "search.result.more.other": "# more on this page", "search.result.none": "No matching documents", "search.result.one": "1 matching document", "search.result.other": "# matching documents", "search.result.placeholder": "Type to start searching", "search.result.term.missing": "Missing", "select.version": "Select version"}}</script>
+    
+    
+      <script src="../../../../assets/javascripts/bundle.ef51cdc3.min.js"></script>
+      
+        <script src="../../../../js/open_in_new_tab.js"></script>
+      
+    
+  </body>
+</html>
\ No newline at end of file
diff --git a/site/setup/guide/scenario/delete-generated-data/index.html b/site/setup/guide/scenario/delete-generated-data/index.html
index f2642eba..3db53b42 100644
--- a/site/setup/guide/scenario/delete-generated-data/index.html
+++ b/site/setup/guide/scenario/delete-generated-data/index.html
@@ -678,6 +678,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/scenario/first-data-generation/index.html b/site/setup/guide/scenario/first-data-generation/index.html
index f23a60ba..7218ca33 100644
--- a/site/setup/guide/scenario/first-data-generation/index.html
+++ b/site/setup/guide/scenario/first-data-generation/index.html
@@ -678,6 +678,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/guide/scenario/records-per-column/index.html b/site/setup/guide/scenario/records-per-column/index.html
index e01b071c..3e5abd8a 100644
--- a/site/setup/guide/scenario/records-per-column/index.html
+++ b/site/setup/guide/scenario/records-per-column/index.html
@@ -14,7 +14,7 @@
         <link rel="prev" href="../first-data-generation/">
       
       
-        <link rel="next" href="../auto-generate-connection/">
+        <link rel="next" href="../batch-and-event/">
       
       
         
@@ -813,6 +813,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../auto-generate-connection/" class="md-nav__link">
         
@@ -1919,7 +1971,7 @@ <h3 id="plan-setup">Plan Setup</h3>
 <span class="w">                        </span><span class="n">field</span><span class="p">().</span><span class="na">name</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">type</span><span class="p">(</span><span class="n">DoubleType</span><span class="p">.</span><span class="na">instance</span><span class="p">()).</span><span class="na">min</span><span class="p">(</span><span class="mi">1</span><span class="p">).</span><span class="na">max</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span>
 <span class="w">                        </span><span class="n">field</span><span class="p">().</span><span class="na">name</span><span class="p">(</span><span class="s">&quot;time&quot;</span><span class="p">).</span><span class="na">type</span><span class="p">(</span><span class="n">TimestampType</span><span class="p">.</span><span class="na">instance</span><span class="p">()).</span><span class="na">min</span><span class="p">(</span><span class="n">java</span><span class="p">.</span><span class="na">sql</span><span class="p">.</span><span class="na">Date</span><span class="p">.</span><span class="na">valueOf</span><span class="p">(</span><span class="s">&quot;2022-01-01&quot;</span><span class="p">)),</span>
 <span class="w">                        </span><span class="n">field</span><span class="p">().</span><span class="na">name</span><span class="p">(</span><span class="s">&quot;date&quot;</span><span class="p">).</span><span class="na">type</span><span class="p">(</span><span class="n">DateType</span><span class="p">.</span><span class="na">instance</span><span class="p">()).</span><span class="na">sql</span><span class="p">(</span><span class="s">&quot;DATE(time)&quot;</span><span class="p">)</span>
-<span class="w">                </span><span class="p">)</span>
+<span class="w">                </span><span class="p">);</span>
 
 <span class="w">        </span><span class="kd">var</span><span class="w"> </span><span class="n">config</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">configuration</span><span class="p">()</span>
 <span class="w">                </span><span class="p">.</span><span class="na">generatedReportsFolderPath</span><span class="p">(</span><span class="s">&quot;/opt/app/data/report&quot;</span><span class="p">)</span>
@@ -1936,7 +1988,7 @@ <h3 id="plan-setup">Plan Setup</h3>
 
 <span class="k">class</span><span class="w"> </span><span class="nc">MyMultipleRecordsPerColPlan</span><span class="w"> </span><span class="k">extends</span><span class="w"> </span><span class="nc">PlanRun</span><span class="w"> </span><span class="p">{</span>
 
-<span class="w">  </span><span class="kd">val</span><span class="w"> </span><span class="n">transactionTask</span><span class="p">:</span><span class="w"> </span><span class="nc">ConnectionTaskBuilder</span><span class="p">[</span><span class="nc">FileBuilder</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">csv</span><span class="p">(</span><span class="s">&quot;customer_transactions&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/customer/transaction&quot;</span><span class="p">,</span><span class="w"> </span><span class="nc">Map</span><span class="p">(</span><span class="s">&quot;header&quot;</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="s">&quot;true&quot;</span><span class="p">))</span>
+<span class="w">  </span><span class="kd">val</span><span class="w"> </span><span class="n">transactionTask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">csv</span><span class="p">(</span><span class="s">&quot;customer_transactions&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;/opt/app/data/customer/transaction&quot;</span><span class="p">,</span><span class="w"> </span><span class="nc">Map</span><span class="p">(</span><span class="s">&quot;header&quot;</span><span class="w"> </span><span class="o">-&gt;</span><span class="w"> </span><span class="s">&quot;true&quot;</span><span class="p">))</span>
 <span class="w">    </span><span class="p">.</span><span class="n">schema</span><span class="p">(</span>
 <span class="w">      </span><span class="n">field</span><span class="p">.</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">).</span><span class="n">regex</span><span class="p">(</span><span class="s">&quot;ACC[0-9]{8}&quot;</span><span class="p">),</span><span class="w"> </span>
 <span class="w">      </span><span class="n">field</span><span class="p">.</span><span class="n">name</span><span class="p">(</span><span class="s">&quot;full_name&quot;</span><span class="p">).</span><span class="n">expression</span><span class="p">(</span><span class="s">&quot;#{Name.name}&quot;</span><span class="p">),</span><span class="w"> </span>
@@ -2076,13 +2128,13 @@ <h3 id="run">Run</h3>
         
         
           
-          <a href="../auto-generate-connection/" class="md-footer__link md-footer__link--next" aria-label="Next: Auto Generate From Data Connection">
+          <a href="../batch-and-event/" class="md-footer__link md-footer__link--next" aria-label="Next: Foreign Keys Across Data Sources">
             <div class="md-footer__title">
               <span class="md-footer__direction">
                 Next
               </span>
               <div class="md-ellipsis">
-                Auto Generate From Data Connection
+                Foreign Keys Across Data Sources
               </div>
             </div>
             <div class="md-footer__button md-icon">
diff --git a/site/setup/index.html b/site/setup/index.html
index d6f508ce..049ad36c 100644
--- a/site/setup/index.html
+++ b/site/setup/index.html
@@ -753,6 +753,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/validation/basic-validation/index.html b/site/setup/validation/basic-validation/index.html
index 10859a72..28ec3259 100644
--- a/site/setup/validation/basic-validation/index.html
+++ b/site/setup/validation/basic-validation/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/setup/validation/group-by-validation/index.html b/site/setup/validation/group-by-validation/index.html
index c08d525b..f86debb2 100644
--- a/site/setup/validation/group-by-validation/index.html
+++ b/site/setup/validation/group-by-validation/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
@@ -1352,6 +1404,28 @@
     </label>
     <ul class="md-nav__list" data-md-component="toc" data-md-scrollfix>
       
+        <li class="md-nav__item">
+  <a href="#record-count" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Record count
+      
+    </span>
+  </a>
+  
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#record-count-per-group" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Record count per group
+      
+    </span>
+  </a>
+  
+</li>
+      
         <li class="md-nav__item">
   <a href="#sum" class="md-nav__link">
     <span class="md-ellipsis">
@@ -1405,6 +1479,17 @@
     </span>
   </a>
   
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#standard-deviation" class="md-nav__link">
+    <span class="md-ellipsis">
+      
+        Standard deviation
+      
+    </span>
+  </a>
+  
 </li>
       
     </ul>
@@ -1849,12 +1934,40 @@
 
 
 <h1 id="group-by-validation">Group By Validation</h1>
-<p>If you want to run aggregations based on a particular set of columns, you can do so via group by validations. An example
-would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The validations applied can
-be one of the validations from above.</p>
+<p>If you want to run aggregations based on a particular set of columns or just the whole dataset, you can do so via group
+by validations. An example would be checking that the sum of <code>amount</code> is less than 1000 per <code>account_id, year</code>. The
+validations applied can be one of the validations from the <a href="../basic-validation/">basic validation set found here</a>.</p>
+<h2 id="record-count">Record count</h2>
+<p>Check the number of records across the whole dataset.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="1:2"><input checked="checked" id="__tabbed_1_1" name="__tabbed_1" type="radio" /><input id="__tabbed_1_2" name="__tabbed_1" type="radio" /><div class="tabbed-labels"><label for="__tabbed_1_1">Java</label><label for="__tabbed_1_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">().</span><span class="na">count</span><span class="p">().</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">().</span><span class="n">count</span><span class="p">().</span><span class="n">lessThan</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
+</code></pre></div>
+</div>
+</div>
+</div>
+<h2 id="record-count-per-group">Record count per group</h2>
+<p>Check the number of records for each group.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="2:2"><input checked="checked" id="__tabbed_2_1" name="__tabbed_2" type="radio" /><input id="__tabbed_2_2" name="__tabbed_2" type="radio" /><div class="tabbed-labels"><label for="__tabbed_2_1">Java</label><label for="__tabbed_2_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">count</span><span class="p">().</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="n">count</span><span class="p">().</span><span class="n">lessThan</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
+</code></pre></div>
+</div>
+</div>
+</div>
 <h2 id="sum">Sum</h2>
 <p>Check the sum of a columns values for each group adheres to validation.</p>
-<div class="tabbed-set tabbed-alternate" data-tabs="1:2"><input checked="checked" id="__tabbed_1_1" name="__tabbed_1" type="radio" /><input id="__tabbed_1_2" name="__tabbed_1" type="radio" /><div class="tabbed-labels"><label for="__tabbed_1_1">Java</label><label for="__tabbed_1_2">Scala</label></div>
+<div class="tabbed-set tabbed-alternate" data-tabs="3:2"><input checked="checked" id="__tabbed_3_1" name="__tabbed_3" type="radio" /><input id="__tabbed_3_2" name="__tabbed_3" type="radio" /><div class="tabbed-labels"><label for="__tabbed_3_1">Java</label><label for="__tabbed_3_2">Scala</label></div>
 <div class="tabbed-content">
 <div class="tabbed-block">
 <div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">sum</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">1000</span><span class="p">)</span>
@@ -1868,7 +1981,7 @@ <h2 id="sum">Sum</h2>
 </div>
 <h2 id="count">Count</h2>
 <p>Check the count for each group adheres to validation.</p>
-<div class="tabbed-set tabbed-alternate" data-tabs="2:2"><input checked="checked" id="__tabbed_2_1" name="__tabbed_2" type="radio" /><input id="__tabbed_2_2" name="__tabbed_2" type="radio" /><div class="tabbed-labels"><label for="__tabbed_2_1">Java</label><label for="__tabbed_2_2">Scala</label></div>
+<div class="tabbed-set tabbed-alternate" data-tabs="4:2"><input checked="checked" id="__tabbed_4_1" name="__tabbed_4" type="radio" /><input id="__tabbed_4_2" name="__tabbed_4" type="radio" /><div class="tabbed-labels"><label for="__tabbed_4_1">Java</label><label for="__tabbed_4_2">Scala</label></div>
 <div class="tabbed-content">
 <div class="tabbed-block">
 <div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">count</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">lessThan</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
@@ -1882,7 +1995,7 @@ <h2 id="count">Count</h2>
 </div>
 <h2 id="min">Min</h2>
 <p>Check the min for each group adheres to validation.</p>
-<div class="tabbed-set tabbed-alternate" data-tabs="3:2"><input checked="checked" id="__tabbed_3_1" name="__tabbed_3" type="radio" /><input id="__tabbed_3_2" name="__tabbed_3" type="radio" /><div class="tabbed-labels"><label for="__tabbed_3_1">Java</label><label for="__tabbed_3_2">Scala</label></div>
+<div class="tabbed-set tabbed-alternate" data-tabs="5:2"><input checked="checked" id="__tabbed_5_1" name="__tabbed_5" type="radio" /><input id="__tabbed_5_2" name="__tabbed_5" type="radio" /><div class="tabbed-labels"><label for="__tabbed_5_1">Java</label><label for="__tabbed_5_2">Scala</label></div>
 <div class="tabbed-content">
 <div class="tabbed-block">
 <div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">min</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">greaterThan</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
@@ -1896,7 +2009,7 @@ <h2 id="min">Min</h2>
 </div>
 <h2 id="max">Max</h2>
 <p>Check the max for each group adheres to validation.</p>
-<div class="tabbed-set tabbed-alternate" data-tabs="4:2"><input checked="checked" id="__tabbed_4_1" name="__tabbed_4" type="radio" /><input id="__tabbed_4_2" name="__tabbed_4" type="radio" /><div class="tabbed-labels"><label for="__tabbed_4_1">Java</label><label for="__tabbed_4_2">Scala</label></div>
+<div class="tabbed-set tabbed-alternate" data-tabs="6:2"><input checked="checked" id="__tabbed_6_1" name="__tabbed_6" type="radio" /><input id="__tabbed_6_2" name="__tabbed_6" type="radio" /><div class="tabbed-labels"><label for="__tabbed_6_1">Java</label><label for="__tabbed_6_2">Scala</label></div>
 <div class="tabbed-content">
 <div class="tabbed-block">
 <div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">max</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">lessThanOrEqual</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
@@ -1910,7 +2023,7 @@ <h2 id="max">Max</h2>
 </div>
 <h2 id="average">Average</h2>
 <p>Check the average for each group adheres to validation.</p>
-<div class="tabbed-set tabbed-alternate" data-tabs="5:2"><input checked="checked" id="__tabbed_5_1" name="__tabbed_5" type="radio" /><input id="__tabbed_5_2" name="__tabbed_5" type="radio" /><div class="tabbed-labels"><label for="__tabbed_5_1">Java</label><label for="__tabbed_5_2">Scala</label></div>
+<div class="tabbed-set tabbed-alternate" data-tabs="7:2"><input checked="checked" id="__tabbed_7_1" name="__tabbed_7" type="radio" /><input id="__tabbed_7_2" name="__tabbed_7" type="radio" /><div class="tabbed-labels"><label for="__tabbed_7_1">Java</label><label for="__tabbed_7_2">Scala</label></div>
 <div class="tabbed-content">
 <div class="tabbed-block">
 <div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">avg</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">between</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span><span class="w"> </span><span class="mi">60</span><span class="p">)</span>
@@ -1922,6 +2035,20 @@ <h2 id="average">Average</h2>
 </div>
 </div>
 </div>
+<h2 id="standard-deviation">Standard deviation</h2>
+<p>Check the standard deviation for each group adheres to validation.</p>
+<div class="tabbed-set tabbed-alternate" data-tabs="8:2"><input checked="checked" id="__tabbed_8_1" name="__tabbed_8" type="radio" /><input id="__tabbed_8_2" name="__tabbed_8" type="radio" /><div class="tabbed-labels"><label for="__tabbed_8_1">Java</label><label for="__tabbed_8_2">Scala</label></div>
+<div class="tabbed-content">
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">().</span><span class="na">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="na">stddev</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="na">between</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span><span class="w"> </span><span class="mf">0.6</span><span class="p">)</span>
+</code></pre></div>
+</div>
+<div class="tabbed-block">
+<div class="highlight"><pre><span></span><code><span class="n">validation</span><span class="p">.</span><span class="n">groupBy</span><span class="p">(</span><span class="s">&quot;account_id&quot;</span><span class="p">,</span><span class="w"> </span><span class="s">&quot;year&quot;</span><span class="p">).</span><span class="n">stddev</span><span class="p">(</span><span class="s">&quot;amount&quot;</span><span class="p">).</span><span class="n">between</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span><span class="w"> </span><span class="mf">0.6</span><span class="p">)</span>
+</code></pre></div>
+</div>
+</div>
+</div>
 
 
 
diff --git a/site/setup/validation/validation/index.html b/site/setup/validation/validation/index.html
index dc83004a..c418ca8b 100644
--- a/site/setup/validation/validation/index.html
+++ b/site/setup/validation/validation/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/sitemap.xml b/site/sitemap.xml
index 4e9c4054..4774fdea 100644
--- a/site/sitemap.xml
+++ b/site/sitemap.xml
@@ -2,172 +2,177 @@
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
     <url>
          <loc>https://data.catering/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/about/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/pricing/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/get-started/docker/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/legal/privacy-policy/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/legal/terms-of-service/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/configuration/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/advanced/advanced/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/connection/connection/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/deployment/deployment/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/foreign-key/foreign-key/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/count/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/generator/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/generator/report/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/cassandra/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/http/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/kafka/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/marquez-metadata-source/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/open-metadata-source/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/data-source/solace/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/auto-generate-connection/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/batch-and-event/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
+         <changefreq>daily</changefreq>
+    </url>
+    <url>
+         <loc>https://data.catering/setup/guide/scenario/data-validation/</loc>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/delete-generated-data/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/first-data-generation/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/guide/scenario/records-per-column/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/basic-validation/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/group-by-validation/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/setup/validation/validation/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/business-value/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/comparison/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/roadmap/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
     <url>
          <loc>https://data.catering/use-case/use-case/</loc>
-         <lastmod>2023-11-13</lastmod>
+         <lastmod>2023-11-15</lastmod>
          <changefreq>daily</changefreq>
     </url>
 </urlset>
\ No newline at end of file
diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz
index b8cef87d..f1f3546f 100644
Binary files a/site/sitemap.xml.gz and b/site/sitemap.xml.gz differ
diff --git a/site/use-case/business-value/index.html b/site/use-case/business-value/index.html
index f797d80a..4e517f6f 100644
--- a/site/use-case/business-value/index.html
+++ b/site/use-case/business-value/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/use-case/comparison/index.html b/site/use-case/comparison/index.html
index e4051d9a..8e07f043 100644
--- a/site/use-case/comparison/index.html
+++ b/site/use-case/comparison/index.html
@@ -674,6 +674,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
diff --git a/site/use-case/roadmap/index.html b/site/use-case/roadmap/index.html
index 901172f0..03c12360 100644
--- a/site/use-case/roadmap/index.html
+++ b/site/use-case/roadmap/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">
         
@@ -1842,6 +1894,7 @@ <h1 id="roadmap">Roadmap</h1>
 <li>Ordering within data sources that support order for insertion</li>
 <li>Clean up data in consumer data sinks</li>
 <li><img alt="✅" class="twemoji" src="https://cdn.jsdelivr.net/gh/jdecked/twemoji@14.1.2/assets/svg/2705.svg" title=":white_check_mark:" /> Trial app to try out all features</li>
+<li>HTTP response data validation</li>
 </ul>
 
 
diff --git a/site/use-case/use-case/index.html b/site/use-case/use-case/index.html
index cc6da5a3..18b508e4 100644
--- a/site/use-case/use-case/index.html
+++ b/site/use-case/use-case/index.html
@@ -672,6 +672,58 @@
   
   
   
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/batch-and-event/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Foreign Keys Across Data Sources
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
+    <li class="md-nav__item">
+      <a href="../../setup/guide/scenario/data-validation/" class="md-nav__link">
+        
+  
+  <span class="md-ellipsis">
+    
+  
+    Data Validations
+  
+
+    
+  </span>
+  
+  
+
+      </a>
+    </li>
+  
+
+              
+            
+              
+                
+  
+  
+  
     <li class="md-nav__item">
       <a href="../../setup/guide/scenario/auto-generate-connection/" class="md-nav__link">