Building Identity Graphs over Heterogeneous Data

In today’s world, customers and service providers (e.g., Social networks, ad targeting, retail, etc.) interact in a variety of modes and channels such as browsers, apps, devices, etc. In each such interaction, users are identified using a token (possibly different token for each mode/channel). Examples of such identity tokens include cookies, app IDs etc. As the user engages more with these services, linkages are generated between tokens belonging to the same user; linkages connect multiple identity tokens together. A challenging problem is to unify the identities of a user into single connected component, to provide a unified identity view. This capability needs to extend beyond channels and create true unification of identity.Since every interaction or a transaction event contains some form of identity, a highly scalable platform is required to identify and link the identities belonging to a user as a connected component. Therefore, we built the Identity Graph platform using Spark processing engine, with a distributed version of Union-find algorithm with path compression. We would like to present the following: The journey of building a highly scalable Identity Graph platform that handles 25+ Billion vertices and 30+ billion edges and an incremental 200M new linkages every day. Why we chose to build our own Graph processing framework using Spark instead of other distributed graph databases. How we handle Data Quality challenges. Optimization strategies implemented to overcome scalability and performance challenges faced while building and traversing the Graph. A peek into online version of Identity Graph to enable real-time graph building, querying, and traversals Takeaway: The feasibility of building a highly scalable Graph framework using Spark. The idea of building and leveraging Graph in real-time to achieve freshness. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unifie... Connect with us: Website: https://databricks.com Facebook:   / databricksinc   Twitter:   / databricks   LinkedIn:   / databricks   Instagram:   / databricksinc   Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...