I was wrong — the talk followup.

November 1, 2018

I criticize myself more harshly than any other. I advise my bosses to be direct and honest with their feedback whenever possible, so it leaves out any chance that I could misinterpret it and fall down a rabbit hole of self loathing. Maybe this is what Xanax is supposed to help with, I’m not entirely sure.

Last week at Hashiconf I gave the most intense talk I’ve given yet, “You. Must. Build. A. Raft!”. It was fun and lighthearted, meant to explain concepts such as Byzantine Fault Tolerance and Paxos in a manner where my dog could understand it. I wanted to give the talk some type of storyline, so I framed it around a discussion I had with my lead awhile back.

I drool over redundancy. I don’t know why, but the more reliable and redundant a system is, the more I want to find out how I _could_ break it. I was discussing with my lead how I wanted to implement a multi-master, multi-region set up for an application we were working on. He advised me that because of the application uses Raft, the latency may cause issues between region and that I shouldn’t do it. I heeded his advise, but a voice in the back of my head wondered if it was Raft, or the application’s fault. So my talk was born.

For the sake of everyone reading, I’ll only focus on the Raft portion of my talk. Each node in Raft has a randomized timer, when these timers hit their limit they attempt to go into leader election. The way we prevent these timers from timing out is by having the leader send heartbeats to each node, basically refilling their timer.

Since the timers are randomized, I believed that if we used Raft across multiple regions that any latency issues would be solved by the leader election process.

You.Must.Build.A.Raft!

If latency occurs, and a timeout happens, then a new term occurs and the leader election process happens. I figured this leader election process was supposed to be flexible and happen a lot.

I was wrong. A week later, the Github Outage occurred. Github was down for 24 hours and 11 minutes (hugs to their Ops team).

In a brief summary, what occurred is that due to latency a leader was promoted across region. This caused inconsistent data and latency for users.

Github Outage 2018

I ended my talk last week saying my lead and I were both correct, but leaning towards myself being more correct due to how Raft should handle leader election. The problem is ideas, and something in practice, are two completely different monsters. From GitHub’s outage report, I was not correct.

Raft at this large of scale, without private networks with controllable and predictable latency is on a path to destruction. Were there things that could have prevented this from happening? Maybe, but with the replication at this scale across regions, that’s a hell of a problem.

I’m sending this out so that before my talk is released on Youtube, and three people watch it in full, and one of those three calls me out on this, I’ll have a retort. For those of you who do watch my talk, if what happened to GitHub is as they described, this could be seen as a Byzantine Failure due to the loss of consensus and inconsistency in reports. Which is pretty frickin’ cool.