Demystifying CXL Memory Computation Yongil Jung, XCENA

Modern analytical engines demand massive memory capacity and bandwidth for large-scale scans, aggregations, and joins, yet conventional server architectures face hard limits in slot count, density, and cost. CXL offers a compelling alternative by enabling cache-coherent access to device-attached memory over PCIe. Memory expansion allows more datasets to reside in-memory, memory pooling shares capacity across multiple hosts to improve utilization and reduce inter-host data transfer overhead, and near-data processing pushes query operations closer to where data resides, reducing unnecessary movement. In this talk, we introduce the MX1, a CXL computational memory device that goes beyond expansion by offloading columnar query operations — decompression, filtering, aggregation, and string search — directly at the memory controller. We present microbenchmark results showing up to 5× throughput and 19× energy efficiency improvements over host CPU execution with CXL memory, and demonstrate how these kernels compose into TPC-H query plans with end-to-end performance gains. We then share our experience integrating with Velox, describing how we leveraged its extensibility interfaces to offload query operators to XFLARE, our Rust-based OLAP query engine built for accelerating MX1. We discuss what worked, the extensibility challenges we encountered, and future directions including a contribution idea for a CXL-aware memory allocation.