Skip to content

Using Great Assertions for production data monitoring

serialbandicoot edited this page Jan 10, 2022 · 5 revisions

Using GA it is possible to test the quality of data in higher environments, therefore it would be possible to send the results in order to monitor, analyse and display the results.

Background

The data, which is being generated by failures can also be used to support data quality. For example if we know that a certain assertion has failed by a particular range, then we can also provide feedback to how far that range was out. We can raise awareness about the magnitude of the quality and we can also display the results in various formats.

Data Monitoring can be split into two sections, saving and displaying the data.

Saving the data

The following list is an example of the types of storage, which are often found in Data Engineering environments. Depending on access, info-sec or other environmental constraints these are a list of options to consider for saving or storing the validation results.

  • Log Analytics (Azure)
  • File XML (Junit)
  • Storage e.g. Databricks

Displaying the results

For each of the options for storing the data, will need a separate solution. Therefore for this stage of the issue the easiest solution is to cater for a Database. Using a notebook in a Databricks, we can retrieve the assertion results data and use tools such as Pandas to display the results.

For Log Analytics its assumed that the engineer will be able to query the data at their discretion.

For File XML, its assumed that generic reporting tools, which already can use this file format can be re-used.

Save Coding

Currently to create a results as below, a result.to_barh() would generate a graph. Using the same approach, we can write the results object into a DataFrame and this DataFrame can be saved/merged.

The save method will take in an argument, which will allow the user to specify the type. e.g. result.save(type = databricks) It may be possible to put several. For this PR, it should be a single option.

class DisplayTest(GreatAssertions):
    def test_pass1(self):
        assert True is True

    def test_fail(self):
        assert "Hello" == "World"

suite = unittest.TestLoader().loadTestsFromTestCase(DisplayTest)
test_runner = unittest.runner.TextTestRunner(resultclass = GreatAssertionResult)
result = test_runner.run(suite)

Display Code

GA will provide a sample Notebook, which can be dropped into an existing project, with a small amount of configuration the default code should be able to render the results set. GA will also provide a series of analytical reports/charts etc, which can give instance out of the box information about the data.

Example Reporting