diff --git a/README.md b/README.md index 2bd5ba1..5a0618c 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # Data-Seedling -This repository serves as an example Data Pipeline designed to work with [FlowEHR](https://github.com/UCLH-Foundry/FlowEHR). +This repository serves as an example Data Pipeline designed to work with [FlowEHR](https://github.com/SAFEHR-hdata/FlowEHR). It showcases the way to write, test, deploy, and run data pipelines in production, and does so with a focus on long-term maintainability. @@ -8,7 +8,7 @@ It showcases the way to write, test, deploy, and run data pipelines in productio FlowEHR is a framework for creating secure data transformation pipelines in Azure. FlowEHR supports authoring multiple data pipelines through multiple repositories and automates the deployment of cloud resources to run these pipelines. -If you don't yet know about FlowEHR, please read [the FlowEHR README](https://github.com/UCLH-Foundry/FlowEHR/blob/main/README.md). +If you don't yet know about FlowEHR, please read [the FlowEHR README](https://github.com/SAFEHR-hdata/FlowEHR/blob/main/README.md). ## Quick Start diff --git a/docs/quick_start.md b/docs/quick_start.md index 51eb7a1..87bd163 100644 --- a/docs/quick_start.md +++ b/docs/quick_start.md @@ -2,7 +2,7 @@ This repository is designed to work with an instance of FlowEHR. To work with Data Pipelines in FlowEHR, you will need to have both the FlowEHR repo and the Data Seedling repo opened in VSCode. This document describes the way to set them up and test your changes in your personal development deployment. -1. Clone the [FlowEHR repo](https://github.com/UCLH-Foundry/FlowEHR) and deploy an instance of FlowEHR following the [steps outlined in the README](https://github.com/UCLH-Foundry/FlowEHR#getting-started). +1. Clone the [FlowEHR repo](https://github.com/SAFEHR-hdata/FlowEHR) and deploy an instance of FlowEHR following the [steps outlined in the README](https://github.com/SAFEHR-hdata/FlowEHR#getting-started). > Note that the resource group in which almost all resources will be created will have a name that looks like `rg-${flowehr_id}-dev`, e.g. `rg-myflwr-dev`. 2. Create a new repository using this template, as shown in the screenshot. Pick a name for your data pipeline repository. Do not clone the repository yet. @@ -54,7 +54,7 @@ Now, you can trigger your updated pipeline. You can do so from the ADF instance. ![Trigger ADF Pipelines](../assets/TriggerADFPipelines.png) -> Note: There is currently a [bug](https://github.com/UCLH-Foundry/FlowEHR/issues/197) in FlowEHR that means the deployed pipeline code might not get updated when you run the above command. Currently, a workaround for this is to either delete the FlowEHR cluster before you re-deploy the pipeline code, or increase the Python wheel version in [pyproject.toml](../example_transform/pyproject.toml) and [pipeline.json](../example_transform/pipeline.json). +> Note: There is currently a [bug](https://github.com/SAFEHR-hdata/FlowEHR/issues/197) in FlowEHR that means the deployed pipeline code might not get updated when you run the above command. Currently, a workaround for this is to either delete the FlowEHR cluster before you re-deploy the pipeline code, or increase the Python wheel version in [pyproject.toml](../example_transform/pyproject.toml) and [pipeline.json](../example_transform/pipeline.json). 10. You will need to change one setting for the metrics to be displayed correctly. Head to the Application Insights resource deployed in your resource group, it should have a name like `transform-ai-${flowehr_id}-dev`. Head to `Usage and estimated costs`, click on `Custom metrics (preview)`, and make sure custom metrics are sent to Azure with dimensions enabled: diff --git a/patient_notes/config.local.yaml b/patient_notes/config.local.yaml index cbce6a1..fb445f1 100644 --- a/patient_notes/config.local.yaml +++ b/patient_notes/config.local.yaml @@ -6,7 +6,7 @@ environment: dev transform: spark_version: 3.4 repositories: - - url: https://github.com/UCLH-Foundry/Data-Seedling.git + - url: https://github.com/SAFEHR-hdata/Data-Seedling.git datalake: zones: - bronze diff --git a/patient_notes/docs/design_doc.md b/patient_notes/docs/design_doc.md index 6cd817f..5bbe1dd 100644 --- a/patient_notes/docs/design_doc.md +++ b/patient_notes/docs/design_doc.md @@ -44,13 +44,13 @@ Tables in Gold contain additional columns as extracted during Feature Extraction Delta Lake is open-source software that extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta Tables are natively supported in Azure Databricks, Azure Synapse and the upcoming [Microsoft Fabric](https://learn.microsoft.com/en-us/fabric/get-started/microsoft-fabric-overview) platform. -Delta Tables have versioning enabled by default which allows tracking of all updates to the tables. They can also [track row-level changes](https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed) which can be read downstream for incremental processing. [Automatic schema validation](https://learn.microsoft.com/en-us/azure/databricks/delta/schema-validation) and [schema evolution](https://docs.databricks.com/delta/update-schema.html) both provide flexibility in managing evolving data schema. For more details, see the [design doc](https://github.com/UCLH-Foundry/Garden-Path/blob/main/designs/data-lake.md) for enabling Data Lake in FlowEHR. +Delta Tables have versioning enabled by default which allows tracking of all updates to the tables. They can also [track row-level changes](https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed) which can be read downstream for incremental processing. [Automatic schema validation](https://learn.microsoft.com/en-us/azure/databricks/delta/schema-validation) and [schema evolution](https://docs.databricks.com/delta/update-schema.html) both provide flexibility in managing evolving data schema. For more details, see the [design doc](https://github.com/SAFEHR-hdata/Garden-Path/blob/main/designs/data-lake.md) for enabling Data Lake in FlowEHR. ### Unity Catalog In addition to being saved in a Storage account as Delta tables, the data in the Gold layer is materialized as an External Table in Databricks Unity Catalog. -Databricks Unity Catalog is optionally enabled during the deployment of FlowEHR (see [this pull request](https://github.com/UCLH-Foundry/FlowEHR/pull/326)). It enables us to save and query tables there directly and use features such as SQL Warehouse and SQL Query Editor in Databricks. Saving Gold tables in Unity Catalogue enables administrators of the system to enable fine-grained permission control on these tables, and simplifies data analysis and discovery. +Databricks Unity Catalog is optionally enabled during the deployment of FlowEHR (see [this pull request](https://github.com/SAFEHR-hdata/FlowEHR/pull/326)). It enables us to save and query tables there directly and use features such as SQL Warehouse and SQL Query Editor in Databricks. Saving Gold tables in Unity Catalogue enables administrators of the system to enable fine-grained permission control on these tables, and simplifies data analysis and discovery. ## Processing