Distributed Systems Decoded 1 What Is a Distributed System and Why Its Secretly Brutal

Free to reuse. Free to remix. No attribution required. Make your own at   / madscihub   QUICK SUMMARY A distributed system is a collection of independent computers that coordinate only by passing messages, to look like one single machine. The whole field exists to fight one enemy: partial failure, where one part dies while another keeps running and no survivor can tell whether a silent machine is dead, slow, or unreachable. This is the course-overview episode, from the unanswered text message all the way to a database that runs on atomic clocks. KEY CONCEPTS 1. The Distributed System Illusion - Thousands of independent machines passing messages to pretend they are one coherent computer. 2. The Three Lost Comforts - Splitting one program across machines repossesses shared memory, a global clock, and clean all-or-nothing failure. 3. Partial Failure - The boss enemy: half the system can be dead while the live half cannot tell that anything died. Dead and slow look identical. 4. Logical Time - Order events by cause and effect using the happens-before relation instead of trusting clocks that lie. 5. The Impossibility Walls - CAP and FLP are proven theorems that say certain things you want are flatly impossible, not just hard. 6. Consensus - Getting many unreliable machines to agree on one value anyway, via Paxos and Raft. DEFINITIONS Distributed System: Independent computers coordinating only by messages to appear as one coherent system. Lamport: a system in which the failure of a computer you did not even know existed can render your own computer unusable. Partial Failure: A failure mode where some nodes fail while others keep running, and the survivors cannot reliably tell who died. Happens-Before: Lamport's partial ordering: event A happens-before B if A could have caused B. Events with no causal link are concurrent. CAP Theorem: During a network partition you must sacrifice either consistency or availability; you cannot keep both. Conjectured by Brewer, proven by Gilbert and Lynch. FLP Impossibility: In an asynchronous system, no protocol can guarantee consensus if even one process may fail. Proven by Fischer, Lynch, and Paterson in 1985. Consensus: Getting a group of machines to agree on a single value despite crashes, lost messages, and delays. Solved in practice by Paxos and Raft. TrueTime: Google Spanner's clock service that gives a bounded uncertainty window for the current time using atomic clocks and GPS, then waits the uncertainty out. HOW IT WORKS 1. One program on one machine quietly enjoys shared memory, a single clock, and clean failure. 2. Split it across machines and all three comforts vanish: reads become messages, clocks disagree, and failure becomes partial. 3. Engineers rebuild time from logic using happens-before, making the lying clocks irrelevant instead of fixing them. 4. Mathematics draws hard walls: CAP and FLP prove some goals are impossible, not merely difficult. 5. Consensus protocols slip through the FLP loophole to make thousands of machines agree millions of times a second. 6. Spanner assembles every idea, using atomic clocks and TrueTime to make a planet-scale database act like one machine. KEY ARGUMENTS 1. The unanswered text is the same problem as every multi-computer system: acting on a state you cannot observe. 2. A single computer hands you three invisible gifts; distribution repossesses all three at once. 3. Partial failure is the root difficulty because dead and slow wear the same costume. 4. You cannot out-engineer the missing clock with a better network: clocks are physics and crashes are unavoidable. 5. Logical clocks beat the clock problem without synchronizing anything, by ordering only causally related events. 6. CAP and FLP are real proofs, yet real systems agree constantly because partial synchrony is the loophole. 7. Spanner is not magic; it is every idea in the course assembled into one machine. KEY TAKEAWAYS A distributed system is an illusion of oneness built on a foundation of permanent uncertainty. The deepest difficulty is partial failure: the live half cannot tell which half died. Time and ordering can be rebuilt from causality alone, no trustworthy clock required. Some properties are provably impossible, but real systems route around the proofs through partial synchrony. Consensus turning unreliable machines into one reliable mind is the crown jewel of the field. MEMORY HOOKS An orchestra sealed in soundproof booths, passing notes under the door, any musician free to walk out unnoticed. A detective reading order from a muddy footprint on top of broken glass when no clock is on the wall. SOURCE https://en.wikipedia.org/wiki/Fallaci... #distributedsystems #computerscience #systemdesign #cap #consensus #raft #paxos #spanner #lamport #softwareengineering #coding #techinterview #madscilecture #decoded #pilot #science