Building Identity Graphs over Heterogeneous Data

In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together. A challenging problem is to unify the identities of a user into single connected component, to provide a unified identity view. This capability needs to extend beyond channels and create true unification of identity.Since every interaction or a transaction event contains some form of identity, a highly scalable platform is required to identify and link the identities belonging to a user as a connected component. Therefore, we built the Identity Graph platform using Spark processing engine, with a distributed version of Union-find algorithm with path compression. We would like to present the following: The journey of building a highly scalable Identity Graph platform that handles 25+ Billion vertices and 30+ billion edges and an incremental 200M new linkages every day. Why we chose to build our own Graph processing framework using Spark instead of other distributed graph databases. How we handle Data Quality challenges. Optimization strategies implemented to overcome scalability and performance challenges faced while building and traversing the Graph. A peek into online version of Identity Graph to enable real-time graph building, querying, and traversals Takeaway: The feasibility of building a highly scalable Graph framework using Spark. The idea of building and leveraging Graph in real-time to achieve freshness. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie... Connect with us: Website: https://databricks.com Facebook: / databricksinc Twitter: / databricks LinkedIn: / databricks Instagram: / databricksinc Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

GraphRAG: The Marriage of Knowledge Graphs and RAG: Emil Eifrem

AWS re:Invent 2020: Building the post-cookie identity graph for marketing

AWS re:Invent 2020: Building the post-cookie identity graph for marketing

Data Engineering Interview - Netflix Clickstream Data Pipeline

Data Engineering Interview - Netflix Clickstream Data Pipeline

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Scaling Identity Graph Ingestion to 1M Events/Sec with Spark Streaming & Delta Lake

Identity Graphs in Amazon Neptune | Amazon Web Services

Identity Graphs in Amazon Neptune | Amazon Web Services

Designing Data-intensive Applications with Martin Kleppmann

Designing Data-intensive Applications with Martin Kleppmann

From Idea to $650M Exit: Lessons in Building AI Startups

From Idea to $650M Exit: Lessons in Building AI Startups

But what is quantum computing? (Grover's Algorithm)

But what is quantum computing? (Grover's Algorithm)

Building Enterprise-Ready Agents using Agent Bricks

Building Enterprise-Ready Agents using Agent Bricks

How Data Engineering Works

How Data Engineering Works

How ASML Makes Chips Faster With Its New $400 Million High NA Machine

How ASML Makes Chips Faster With Its New $400 Million High NA Machine

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

What do tech pioneers think about the AI revolution? - The Engineers, BBC World Service

What is Databricks? The Story Behind the Modern Data Platform (Visual Explanation)

What is Databricks? The Story Behind the Modern Data Platform (Visual Explanation)

Introduction to Generative AI

Introduction to Generative AI

Identity Resolution Essentials from a Data Scientist - Nathan Wang

Identity Resolution Essentials from a Data Scientist - Nathan Wang

What is a Vector Database? Powering Semantic Search & AI Applications

What is a Vector Database? Powering Semantic Search & AI Applications

How to Start Coding | Programming for Beginners | Learn Coding | Intellipaat

How to Start Coding | Programming for Beginners | Learn Coding | Intellipaat

How to Build Systems to Actually Achieve Your Goals

How to Build Systems to Actually Achieve Your Goals

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial

Free Event: Power BI Beginner to Pro 2026 Edition - Full Hands-On Tutorial