Data Enrichment for Data Pipelines at Scale
- Richard Murphy
- Feb 21, 2023
- 2 min read
On February 27, 28 the CDC Foundation is hosting the first joint CDC-ONC Industry day at the HHS building in Washington, DC. The theme for this event will be Accelerating Public Health Data Modernization through Public Private Partnerships. Sophron submitted an abstract that talks about Data Enrichment for data pipelines and will be presenting on Tuesday, Feb. 28. Please join us!
Below is the abstract/overview of the presentation:
Introduction:
Over the last 18 months Sophron.io has been conducting R&D on cloud data pipelines and data linkage services as part of our ongoing efforts to inform and enhance our professional services and consulting engagements with CDC and other agencies. This research has been focused in three areas: cloud data pipelines, data linkage, GIS services related to data pipelines.
Research:
Initial R&D efforts ran in parallel with focus on optimizing data pipelines and building user friendly data linkage interfaces. Early pipeline testing with Azure logic apps and Azure data factory eventually led to a focus on Delta Lake in Azure Databricks as the platform of choice for further optimization testing due to performance improvements and other factors. At the same time, the data linkage project produced SLINK- an R package that provided a GUI user interface to the R FASTLink data linking library- which was transformed from an internal project to an OSS project and released on CRAN and Github in late 2022.
The learnings from these projects have been inputs to the current R&D effort which is focused on creating data linkage services that can operate in a data streaming pipeline or on batches from a data lake. As part of optimizing the data linkage services and efforts to increase data quality, probabilistic algorithms have been chosen to provide for a wider range of accuracy and quality options. The pre-processing of data using address normalization has also been identified as increasing match rate and data quality. Additional benefits of the address normalization functionality are geocoding services that can also be provided from the same software solution. Packaging these scalable services into a solution offering is underway.
Conclusions:
Pipelines for health record data can benefit from data enrichment services such as geocoding, address normalization and data linkage. Building services designed specifically for high volume data processing and utilizing GIS software that was more traditionally used in mainframe environments and the financial services industry has shown to scale better for the increased volumes of health data. These approaches seem to fit in well with the recently announced Northstar architecture and could provide multiple benefits to future implementations.
Researchers:
Richard Murphy, MBA Technology Management; CTO, Sophron.io
Beijun Desai, BS Computer Science, BS Applied Mathematics, University of Georgia, Senior
Brandon Yau, BS Computer Science, University of Georgia, 2022 graduate
Comments