paint-brush
Mastering the Cloud: A Guide to Distributed Systemsby@samarthmshah
118 reads

Mastering the Cloud: A Guide to Distributed Systems

by Samarth ShahJanuary 7th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Think about your system in a conceptual manner - above the code. Thinking conceptually helps engineers predict system behaviors better, troubleshoot issues, and design systems effectively.
featured image - Mastering the Cloud: A Guide to Distributed Systems
Samarth Shah HackerNoon profile picture
0-item
1-item
2-item

Modern technology relies heavily on distributed systems to achieve scalability, resilience, and always-on availability. But the concept of distributed systems can often overwhelm even the most seasoned engineers. This article explores how conceptual frameworks can help simplify the design and understanding of distributed systems, making them easier to work with.


It is important to think about your system in a conceptual manner - above the code. Thinking conceptually helps engineers predict system behaviors better, troubleshoot issues, and design systems effectively. Think about a complex distributed system as a finely tuned orchestra, if you will. Musicians represent individual components performing their own parts independently but in harmony. The sheet music connects these components (musicians) together. And finally, the conductor makes sure there is synchronization and direction to it all.


In simple words, a Distributed System should:

  • Prevent undesirable outcomes like a split brain.
  • Ensure eventual progress, and fix the state eventually.
  • Scale to meet your application SLAs
  • Be reliable against downtime.

Challenges in Distributed Systems

Syncing between local and global state

One of the biggest challenges is the synchronization between local and global states to achieve consistency. Addressing state inconsistencies during network partitions, and syncing components to resolve conflicts are two big challenges here.

The CAP Juggle

The CAP theorem states that distributed databases can have at most two of the three properties: consistency, availability, and partition tolerance. As a result, database systems prioritize only two properties at a time. It challenges engineers to make strategic trade-offs based on their system’s priorities and constraints. Imagine trying to juggle three flaming torches (consistency, availability, and partition tolerance) while riding a unicycle across a tightrope, you can only drop one without everything falling apart!


As a reader, if you’re interested in how enterprise systems deal with such juggling, check out Google Spanner’s example. Google Spanner is a globally distributed database system. Spanner uses TrueTime, a globally synchronized clock, to maintain strong consistency (or “external” consistency) across distributed nodes.


Their clever "time travel" trick allows Spanner to coordinate operations with precise timestamps, ensuring consistency while still handling partition tolerance and availability. For more entertaining insights, check out MIT’s 6.824 lecture on Spanner here.


As a system, engineers have to prioritize one over the others depending on what the application demands from them. There is no one-size-fits-all answer here.

Designing Robust Distributed Systems

So, what are the tricks up our sleeves for the above challenges?

Leverage Multiple Perspectives

Think about the state conceptually. As an example,

  • Visualize inflight messages between services to understand concrete concepts of latencies and failures.
  • Viewing them as state transitions shifts focus to processing logic.

Focus on Abstractions (Beyond Code)

As a software engineer, it is often hard to think beyond existing services and existing code. And while prototyping via code helps, abstractions like state machines and consensus algorithms allow engineers to understand broader system dynamics such as deadlocks or race conditions.

Conclusion

Modern applications demand systems that can handle explosive growth, carefully juggle CAP to meet your goals, and make sure it recovers from inevitable failures. Distributed systems meet these demands through redundancy and coordination, making them core to cloud computing and large-scale platforms.


If you liked this, please read my other blog where I aim to demystify complex things like Kubernetes.

References

  1. Leslie Lamport’s Paxos paper: Paxos Made Simple.


  2. Martin Kleppmann’s Designing Data-Intensive Applications