"Why We Built Our Own Distributed Column Store" by Sam Stokes
How do you understand the behaviour of complex distributed systems in production? Distributed systems can fail in unpredictable, hard-to-detect ways. To track down problems quickly, you need to look for patterns and correlations in your data, trying different ways of breaking it down. "Does the problem occur on just one host, or one partition, or for particular customers?" Sub-second complex queries over large data volumes in real time: sounds like a tall order. The Scuba paper from Facebook describes an architecture that can do it: a low-latency, distributed, schemaless database. Scuba achieves fast queries by storing all data in memory. It stores the raw events, and fans out queries to multiple nodes, so it can support complex queries including aggregates (like mean and percentile statistics) and breakdowns by fields of arbitrary cardinality. Building Honeycomb, we needed a database with these properties, but we had additional constraints: multi-tenancy, cost to serve, and the limited resources of a startup. This talk describes Retriever, a custom-built database inspired by Scuba. Retriever ingests events from Kafka, and chooses disk over memory, using an efficient column-oriented storage model. I'll discuss interesting aspects of the implementation, and lessons learned from operating a hand-rolled database at production scale with paying customers. Sam Stokes HONEYCOMB Sam Stokes is a software engineer who can't leave well enough alone. He's compelled to fix broken things, whether they are software systems, engineering processes or cultures. After watching too many systems catch fire, he's building better smoke detectors at Honeycomb; in a past life he cofounded Rapportive and built recommendation systems at LinkedIn.

"Stop Rate Limiting! Capacity Management Done Right" by Jon Moore

"Tackling Concurrency Bugs with TLA+" by Hillel Wayne

Distributed Systems in One Lesson by Tim Berglund

"Datafun: a functional query language" by Michael Arntzenius

AWS re:Invent 2018: Amazon DynamoDB Under the Hood: How We Built a Hyper-Scale Database (DAT321)

Turing Award Winner: Disagreeing with Google, Postgres, Future Problems | Mike Stonebraker

Why Aliens Would NEVER Invade Africa

"Transactions: myths, surprises and opportunities" by Martin Kleppmann

"Automating Cloud Security and Incident Response (DevSecOps)" by Jearvon Dharrie

Andrej Karpathy: From Vibe Coding to Agentic Engineering w/ Stephanie Zhan

Google & AWS Veteran: What Top Tier Software Architects Actually Do

dotGo 2017 - Sameer Ajmani - Simulating a real-world system in Go

If Prime Numbers Become Increasingly Rare, Then Why Do They Keep Showing Up In Pairs?

The Big Short (2015): The Jenga Scene – Explaining the Financial Collapse

"CRDTs Illustrated" by Arnout Engelen

"Concatenative programming and stack-based languages" by Douglas Creager

Unfortunately, I Was Right

CRDTs and the Quest for Distributed Consistency

"Zuul's Journey to Non-Blocking" by Arthur Gonigberg

