An architecture for federated data discovery & lineage with Apache Atlas
Comcast's Streaming Data platform comprises a variety of ingest, transformation, and storage services in the public cloud. Peer-reviewed Apache Avro schemas support end-to-end data governance. We have previously reported (DataWorks Summit 2017) on how we extended Atlas with custom entity and process types for discovery and lineage in the AWS public cloud. Custom lambda functions notify Atlas of creation of new entities and new lineage links via asynchronous kafka messaging. Recently we were presented the challenge of providing integrated data discovery and lineage across our public cloud datasources and on-prem datasources, both Hadoop-based and traditional data warehouses and RDBMSs. Can Apache Atlas meet this challenge? A resounding yes! This talk will present our federated architecture, with Atlas providing SQL-like, free-text, and graph search across select metadata from all on-prem and public cloud data sources in our purview. Lightweight, custom connectors/bridges identify metadata/lineage changes in underlying sources and publish them to Atlas via the asynchronous API. A portal layer provides Atlas query access and a federation of UIs. Once data of interest is identified via Atlas queries, interfaces specific to underlying sources may be used for special-purpose metadata mining. While metadata repositories for data discovery and lineage abound, none of them have built-in connectors and listeners for the entire complement of data sources that Comcast and many other large enterprises use to support their business needs. In-house-built solutions typically underestimate the cost of development and maintenance and often suffer from architecture-by-accretion. Atlas' commitment to extensibility, built-in provision of typed, free-text, and graph search, and REST and asynchronous APIs, position it uniquely in the build-vs-buy sweet spot. Speaker BARBARA ECKMAN Principal Data Architect Comcast

Webinar: Data Modeling & Metadata Management

Apache Iceberg: What It Is and Why Everyone’s Talking About It.

How Vacuum Potential Controls Wave Function?

Apache Nifi Crash Course

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Designing Data-Intensive Applications: Chapters 1 and 2
![The Business Glossary, Data Dictionary, Data Catalog Trifecta [Webinar] - Dataedo](https://i.ytimg.com/vi/4rV4xCE-oHc/hqdefault.jpg?sqp=-oaymwEjCNACELwBSFryq4qpAxUIARUAAAAAGAElAADIQj0AgKJDeAE=&rs=AOn4CLBUUMdTPMyUEwG9lb2gg40vInhxPw)
The Business Glossary, Data Dictionary, Data Catalog Trifecta [Webinar] - Dataedo

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise

Why and how to govern your data estate with Azure Purview

Get Certified: (DP-700) Fabric Data Engineer Essentials (US/EMEA)

Emily Gorcenski – A Data Mesh Reference Implementation on AWS | Øredev 2022

AWS re:Invent 2018: Effective Data Lakes: Challenges and Design Patterns (ANT316)

Big Data Architecture Patterns

Design Microservice Architectures the Right Way

Using Apache Arrow, Calcite and Parquet to build a Relational Cache | Dremio

Apache Atlas Tracking dataset lineage across Hadoop components

Amundsen: A Data Discovery Platform From Lyft | Lyft

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

