Build Jeeps Not Ferraris
This video dates back to around 2010 from an unknown source but remains my favorite. I give a version of this talk at every company I join. It's nothing new, but if you haven't heard it you need to hear it.
Mean time to recovery is better than mean time between failures for most types of failures. This talk has some service oriented architecture to improve your failure domains thrown in for good measure.
Also my car is a 2008 Honda Civic and it is also "not a jeep", but doesn't take a week to fix most problems.
I love this video because it encompasses one of my favorite concepts of software development. That mean time to recovery is better than mean time between failure for most types of failures. And there's a whole blog post about this that you can read by John Allspaw one of our industries leaders on failure. He studies how we fail and how our systems fail. The takeaway I want you to have is that we should be building Jeeps not Ferraris. (Or Rolls Royce's from Allspaw's post which I just realized he stole from Artur Bergman.)
What we see in this video is that they're taking a Jeep completely apart and putting it back together in under three and half minutes. Every single component there, is understood. It has a simple interface and it's isolated from the rest of the jeep. They are able to recover from any kind of failure this Jeep has, as long as they have the spare parts, in less than 4 minutes. So, if you were to compare this to say, a Ferrari, it might be in the shop for weeks to repair critical components. Only specific parts of a Ferrari are designed to be serviced with any speed. Engine issues may take a long time to debug and it may take a long time to service. It also has very specialized parts built with low tolerances, and many parts that are buried deep inside and difficult to access.
I had to replace the starter in my car (a 2008 Civic) and you gotta take the whole car apart to do it. It doesn't pop out. To replace the starter in that Jeep? It's not going to take you more than three and a half minutes. So, what does a Jeep look like in a distributed system world?
In our distributed systems world, we have services or interfaces to different areas of concern with very clear boundaries. And this is actually hard to build. Specifically, simple is hard. There's a wonderful talk called Simple Made Easy by Rich Hickey. He pitches that we achieve this by only passing data between these different areas of concern. So our interfaces are purely data and that's simple because data doesn't have behavior.
Data is very easy to understand, data doesn't care about who's reading it or who wrote it. It just is. And so a service boundary helps us enforce that data separation between concerns by design. It is a very easy way to do it. It also gets us other things. We can have different teams working on different services at different times. You can build something that works and then different teams can work on improving performance, availability, features, etc without impacting other parts of your system. And in theory nobody is blocking each other.
It's not that services oriented architectures are a panacea. It's that they enforce interfaces between components and bring constraints in our systems design that can help us build complex systems with teams of people. (Similar to how type systems can bring constraints that help avoid specific kinds of errors in programming.) These interfaces and constraints are what's important. Not the services themselves. They're important especially when you're growing a company and especially when you're scaling a monolith. That monolith is a Ferrari, and in order to make changes in there, you need to make sure everything is tuned and everything works well in concert. It's hard to make changes because it's hard to test each piece in isolation as it's not built that way.
Of course, it is also hard to know how everything works in concert when you have services, but the interactions become clearer as their interfaces are strictly enforced. You can model them as SLOs, you can measure latency, throughput and request size. You're afforded a level of observability that's potentially higher. And even though it's overall more complex (I'm quoted heavily saying "adding the network to a system has never made it easier to understand".) the interactions between teams become more simple. It's easier to organize our design, planning, communication and execution with services. At least it's easier than guessing if the tweak inside the engine block of your Ferrari has the right impact.
So I hope this rant is a little elucidating. And if you remember anything, it's that we should be building Jeeps, not Ferraris. And I hope you enjoyed the video.