Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename references to old github organisation #2

Merged
merged 1 commit into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# Data-Seedling

This repository serves as an example Data Pipeline designed to work with [FlowEHR](https://github.com/UCLH-Foundry/FlowEHR).
This repository serves as an example Data Pipeline designed to work with [FlowEHR](https://github.com/SAFEHR-hdata/FlowEHR).

It showcases the way to write, test, deploy, and run data pipelines in production, and does so with a focus on long-term maintainability.

## FlowEHR

FlowEHR is a framework for creating secure data transformation pipelines in Azure. FlowEHR supports authoring multiple data pipelines through multiple repositories and automates the deployment of cloud resources to run these pipelines.

If you don't yet know about FlowEHR, please read [the FlowEHR README](https://github.com/UCLH-Foundry/FlowEHR/blob/main/README.md).
If you don't yet know about FlowEHR, please read [the FlowEHR README](https://github.com/SAFEHR-hdata/FlowEHR/blob/main/README.md).

## Quick Start

Expand Down
4 changes: 2 additions & 2 deletions docs/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This repository is designed to work with an instance of FlowEHR. To work with Data Pipelines in FlowEHR, you will need to have both the FlowEHR repo and the Data Seedling repo opened in VSCode. This document describes the way to set them up and test your changes in your personal development deployment.

1. Clone the [FlowEHR repo](https://github.com/UCLH-Foundry/FlowEHR) and deploy an instance of FlowEHR following the [steps outlined in the README](https://github.com/UCLH-Foundry/FlowEHR#getting-started).
1. Clone the [FlowEHR repo](https://github.com/SAFEHR-hdata/FlowEHR) and deploy an instance of FlowEHR following the [steps outlined in the README](https://github.com/SAFEHR-hdata/FlowEHR#getting-started).
> Note that the resource group in which almost all resources will be created will have a name that looks like `rg-${flowehr_id}-dev`, e.g. `rg-myflwr-dev`.

2. Create a new repository using this template, as shown in the screenshot. Pick a name for your data pipeline repository. Do not clone the repository yet.
Expand Down Expand Up @@ -54,7 +54,7 @@ Now, you can trigger your updated pipeline. You can do so from the ADF instance.

![Trigger ADF Pipelines](../assets/TriggerADFPipelines.png)

> Note: There is currently a [bug](https://github.com/UCLH-Foundry/FlowEHR/issues/197) in FlowEHR that means the deployed pipeline code might not get updated when you run the above command. Currently, a workaround for this is to either delete the FlowEHR cluster before you re-deploy the pipeline code, or increase the Python wheel version in [pyproject.toml](../example_transform/pyproject.toml) and [pipeline.json](../example_transform/pipeline.json).
> Note: There is currently a [bug](https://github.com/SAFEHR-hdata/FlowEHR/issues/197) in FlowEHR that means the deployed pipeline code might not get updated when you run the above command. Currently, a workaround for this is to either delete the FlowEHR cluster before you re-deploy the pipeline code, or increase the Python wheel version in [pyproject.toml](../example_transform/pyproject.toml) and [pipeline.json](../example_transform/pipeline.json).

10. You will need to change one setting for the metrics to be displayed correctly. Head to the Application Insights resource deployed in your resource group, it should have a name like `transform-ai-${flowehr_id}-dev`. Head to `Usage and estimated costs`, click on `Custom metrics (preview)`, and make sure custom metrics are sent to Azure with dimensions enabled:

Expand Down
2 changes: 1 addition & 1 deletion patient_notes/config.local.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ environment: dev
transform:
spark_version: 3.4
repositories:
- url: https://github.com/UCLH-Foundry/Data-Seedling.git
- url: https://github.com/SAFEHR-hdata/Data-Seedling.git
datalake:
zones:
- bronze
Expand Down
4 changes: 2 additions & 2 deletions patient_notes/docs/design_doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,13 +44,13 @@ Tables in Gold contain additional columns as extracted during Feature Extraction

Delta Lake is open-source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Tables are natively supported in Azure Databricks, Azure Synapse and the upcoming [Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview) platform.

Delta Tables have versioning enabled by default which allows tracking of all updates to the tables. They can also [track row-level changes](https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed) which can be read downstream for incremental processing. [Automatic schema validation](https://learn.microsoft.com/en-us/azure/databricks/delta/schema-validation) and [schema evolution](https://docs.databricks.com/delta/update-schema.html) both provide flexibility in managing evolving data schema. For more details, see the [design doc](https://github.com/UCLH-Foundry/Garden-Path/blob/main/designs/data-lake.md) for enabling Data Lake in FlowEHR.
Delta Tables have versioning enabled by default which allows tracking of all updates to the tables. They can also [track row-level changes](https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed) which can be read downstream for incremental processing. [Automatic schema validation](https://learn.microsoft.com/en-us/azure/databricks/delta/schema-validation) and [schema evolution](https://docs.databricks.com/delta/update-schema.html) both provide flexibility in managing evolving data schema. For more details, see the [design doc](https://github.com/SAFEHR-hdata/Garden-Path/blob/main/designs/data-lake.md) for enabling Data Lake in FlowEHR.

### Unity Catalog

In addition to being saved in a Storage account as Delta tables, the data in the Gold layer is materialized as an External Table in Databricks Unity Catalog.

Databricks Unity Catalog is optionally enabled during the deployment of FlowEHR (see [this pull request](https://github.com/UCLH-Foundry/FlowEHR/pull/326)). It enables us to save and query tables there directly and use features such as SQL Warehouse and SQL Query Editor in Databricks. Saving Gold tables in Unity Catalogue enables administrators of the system to enable fine-grained permission control on these tables, and simplifies data analysis and discovery.
Databricks Unity Catalog is optionally enabled during the deployment of FlowEHR (see [this pull request](https://github.com/SAFEHR-hdata/FlowEHR/pull/326)). It enables us to save and query tables there directly and use features such as SQL Warehouse and SQL Query Editor in Databricks. Saving Gold tables in Unity Catalogue enables administrators of the system to enable fine-grained permission control on these tables, and simplifies data analysis and discovery.

## Processing

Expand Down
Loading