GeoSpark: Manage Big Geospatial Data in Apache Spark

GeoSpark: Manage Big Geospatial Data in Apache Spark Jia Yu, Mohamed Sarwat A presentation from ApacheCon @Home 2020 https://apachecon.com/acah2020/ The volume of spatial data increases at a staggering rate. This talk comprehensively studies how GeoSpark extends Apache Spark to uphold massive-scale spatial data. During this talk, we first provide a background introduction of the characteristics of spatial data and the history of distributed spatial data management systems. A follow-up section presents the vital components in GeoSpark, such as spatial data partitioning, index, and query algorithms. The third section then introduces the latest updates in GeoSpark including geospatial visualization, integration with Apache Zeppelin, Python and R wrapper. The fourth part finally concludes this talk to help the audience better grasp the overall content and points out future research directions. Jia Yu: Jia Yu is an Assistant Professor at Washington State University School of EECS. He obtained his Ph.D. in Computer Science from Arizona State University in Summer 2020. Jia’s research focuses on database systems and geospatial data management. In particular, he worked on distributed data management systems, database indexing, and data visualization. He is the main contributor of several open-sourced research projects such as Apache Sedona (incubating), a cluster computing framework for processing big spatial data. Mohamed Sarwat : Mohamed is an assistant professor of computer science at Arizona State University. Dr. Sarwat is a recipient of the 2019 National Science Foundation CAREER award. His general research interest lies in developing robust and scalable data systems for spatial and spatiotemporal applications. The outcome of his research has been recognized by two best research paper awards in the IEEE International Conference on Mobile Data Management (MDM 2015) and the International Symposium on Spatial and Temporal Databases (SSTD 2011), a best of conference citation in the IEEE International Conference on Data Engineering (ICDE 2012) as well as a best vision paper award (3rd place) in SSTD 2017. Besides impact through scientific publications, Mohamed is also the co-architect of several software artifacts, which include GeoSpark (a scalable system for processing big geospatial data) that is being used by major tech companies. He is an associate editor for the GeoInformatica journal and has served as an organizer / reviewer / program committee member for major data management and spatial computing venues. In June 2019, Dr. Sarwat has been named an Early Career Distinguished Lecturer by the IEEE Mobile Data Management community.