Backfill Streaming Data Pipelines in Kappa Architecture

Streaming data pipelines can fail due to various reasons. Since the source data, such as Kafka topics, often have limited retention, prolonged job failures can lead to data loss. Thus, streaming jobs need to be backfillable at all times to prevent data loss in case of failures. One solution is to increase the source's retention so that backfilling is simply replaying source streams, but extending Kafka retention is very costly for Netflix's data sizes. Another solution is to utilize source data stored in DWH, commonly known as the Lambda architecture. However, this method introduces significant code duplication, as it requires engineers to maintain a separate equivalent batch job. At Netflix, we have created the Iceberg Source Connector to provide backfilling capabilities to Flink streaming applications. It allows Flink to stream data stored in Apache Iceberg while mirroring Kafka's ordering semantics, enabling us to backfill large-scale stateful Flink pipelines at low retention cost. Connect with us: Website: https://databricks.com Facebook: / databricksinc Twitter: / databricks LinkedIn: / data. . Instagram: / databricksinc

Watermarks: Time and Progress in Apache Beam and Beyond

Watermarks: Time and Progress in Apache Beam and Beyond

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse

So Fresh and So Clean: Learn How to Build Real-Time Warehouses on Lakehouse

Event-Driven Architectures Done Right, Apache Kafka • Tim Berglund • Devoxx Poland 2021

Event-Driven Architectures Done Right, Apache Kafka • Tim Berglund • Devoxx Poland 2021

Delta Live Tables A to Z: Best Practices for Modern Data Pipelines

Delta Live Tables A to Z: Best Practices for Modern Data Pipelines

Kappa vs Lambda Architectures and Technology Comparison

Kappa vs Lambda Architectures and Technology Comparison

Data Warehousing on the Lakehouse

Data Warehousing on the Lakehouse

Shift Left Stream Processing for Better Data Governance and Quality | Life Is But A Stream Podcast

Shift Left Stream Processing for Better Data Governance and Quality | Life Is But A Stream Podcast

Data Lake Modeling: 100 TBs into 5 TBs at Airbnb with Parquet + Run Length Encoding - DataExpert.io

Data Lake Modeling: 100 TBs into 5 TBs at Airbnb with Parquet + Run Length Encoding - DataExpert.io

Kafka Tutorial for Beginners | Everything you need to get started

Kafka Tutorial for Beginners | Everything you need to get started

Beyond Monitoring: The Rise of Data Observability

Beyond Monitoring: The Rise of Data Observability

Modern Architecture 101 for New Engineers & Forgetful Experts - Jerry Nixon - NDC Copenhagen 2025

Modern Architecture 101 for New Engineers & Forgetful Experts - Jerry Nixon - NDC Copenhagen 2025

System Design Explained: APIs, Databases, Caching, CDNs, Load Balancing & Production Infra

System Design Explained: APIs, Databases, Caching, CDNs, Load Balancing & Production Infra

Dive Deeper into Data Engineering on Databricks

Dive Deeper into Data Engineering on Databricks

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

Rethinking Orchestration as Reconciliation: Software-Defined Assets in Dagster

Streaming Concepts & Introduction to Flink - Event Time and Watermarks

Streaming Concepts & Introduction to Flink - Event Time and Watermarks

Streaming from Apache Iceberg - Building Low-Latency and Cost-Effective Data Pipelines

Streaming from Apache Iceberg - Building Low-Latency and Cost-Effective Data Pipelines

Making Apache Spark™ Better with Delta Lake

Making Apache Spark™ Better with Delta Lake

Real Time Streaming with Azure Databricks and Event Hubs

Real Time Streaming with Azure Databricks and Event Hubs

MLOps on Databricks: A How-To Guide

MLOps on Databricks: A How-To Guide

Tableflow: Materialize Apache Kafka® Topics as Apache Iceberg™ and Delta Lake Tables With Zero ETL

Tableflow: Materialize Apache Kafka® Topics as Apache Iceberg™ and Delta Lake Tables With Zero ETL