System Design: Difference between revisions
(66 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=Internal= | =Internal= | ||
* [[Software Engineering]] | * [[Software Engineering]] | ||
* [[System Design Interview Resources]] | |||
=Overview= | =Overview= | ||
The goals of system design is to build software systems that first and foremost are correct, in that the system correctly implements the functions it was built to implement. Additionally, they should aim to maximize reliability, scalability and maintainability. | The goals of system design is to build software systems that first and foremost are correct, in that the system correctly implements the functions it was built to implement. Additionally, they should aim to maximize reliability, scalability and maintainability. | ||
Line 13: | Line 12: | ||
Depending on the actual requirements on the system, these bits and pieces are implemented by different products, which are integrated into the system and stitched together with application code. The exercise of developing the system consists in combining standard building blocks into a structure that provides custom functionality. When you combine several tools to provide a custom service, the service's API hides implementation details from clients, so in effect you create a special-purpose system from general-purpose components, which provides specific functionality and reliability guarantees. Building such system require system design skills, and the result of the process should accomplish the four main [[#System_Design_Goals|system design goals]]. | Depending on the actual requirements on the system, these bits and pieces are implemented by different products, which are integrated into the system and stitched together with application code. The exercise of developing the system consists in combining standard building blocks into a structure that provides custom functionality. When you combine several tools to provide a custom service, the service's API hides implementation details from clients, so in effect you create a special-purpose system from general-purpose components, which provides specific functionality and reliability guarantees. Building such system require system design skills, and the result of the process should accomplish the four main [[#System_Design_Goals|system design goals]]. | ||
::::::[[File:ATypicalSystem.png| | ::::::[[File:ATypicalSystem.png|689px]] | ||
==Clients== | ==Clients== | ||
Line 19: | Line 18: | ||
==Frontend/Backend Communication== | ==Frontend/Backend Communication== | ||
In both cases, the clients send requests over HTTP. The server returns the response as part of the same HTTP request/response pair, either a HTML page to be rendered or [[JSON]]-serialized data. | In both cases, the clients send requests over HTTP. The server returns the response as part of the same HTTP request/response pair, either a HTML page to be rendered or [[JSON]]-serialized data. | ||
==Load Balancer== | |||
A load balancer distributes traffic among a backend pool of services according to its configured load balancing policies. Usually the load balancer communicates with the backend services via private IPs. Load balancers can be used to address [[#Scalability|scalability]] issues, by distributing traffic to a larger service set, and also [[#Reliability|reliability]] issues, by increasing [[#High-Availability|high-availability]]. | |||
==API Gateway== | |||
An API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc. | |||
==Applications== | ==Applications== | ||
It is worth trying to keep application services stateless, so they can scale horizontally. State can be maintained in the database, and for situation in which the state is transient - such as session state - it can be maintained in a cache or replicated cache. | |||
Once a service instance is created and the state is cached in the service's memory, it is worth to direct requests belonging to the same session to the same service instance. This mechanism is called "sticky session" and to implement it, the [[#Load_Balancer|load balancer]] must cooperate. Alternatively, the services maintain the session state externally in the above mentioned cache, and sticky session becomes unnecessary. | |||
==Database== | ==Database== | ||
The database can be [[Relational_Databases#Overview|relational]] or NoSQL. The | The [[Databases|database]] can be [[Relational_Databases#Overview|relational]] or NoSQL. The following article describes the situations when a NoSQL database may be preferable over a relational database: {{Internal|NoSQL#Overview|NoSQL}} | ||
Special measures need to be taken to protect the database against hardware and other faults, thus improving its [[#Reliability|reliability]] and [[#High-Availability|high-availability]], and also to increase its scalability in presence of higher [[Performance_Concepts#Load|load]]. More details on this subject are available in {{Internal|Databases#Increasing_Database_Reliability_and_Scalability|Increasing Database Reliability and Scalability}} | |||
==Cache== | ==Cache== | ||
A cache is a component that provides temporary storage to store in memory frequently accessed database data or the results of expensive computations. The goal is to speed up subsequent requests that need the same data. The cache tier is much faster than the [[#Database|database]], so using a cache tier improves the system's [[Performance_Concepts#Performance|performance]] and also reduces the [[Performance_Concepts#Load|load]] on the database. The cache tier can be scaled independently. {{Internal|Cache#Overview|Cache}} | |||
==Content Delivery Network (CDN)== | |||
Applications that serve a lot of static content may consider using a CDN. A CDN is a network of geographically dispersed servers optimized to delivery such static content. | |||
==Data Centers== | |||
Multiple data centers allow positioning data and logic close to users. | |||
==Search Index== | ==Search Index== | ||
Full-text search servers: [[Elasticsearch]] and [[Solr]]. | Full-text search servers: [[Elasticsearch]] and [[Solr]]. | ||
==Stream Processing== | ==Stream Processing== | ||
{{Internal| | {{Internal|Stream Processing|Stream Processing}} | ||
Also see: | |||
<span id='Asynchronous_Communication'></span>{{Internal|Asynchronous Communication|Asynchronous Communication}} | |||
==Batch Processing== | ==Batch Processing== | ||
<font color=darkkhaki> | |||
TODO: | |||
* [[Hadoop]] | |||
* [[Spark]] | |||
* [[Flink]] | |||
</font> | |||
=System Design Goals= | =System Design Goals= | ||
==Correctness== | ==Correctness== | ||
The system should behave correctly, providing the expected functionality. Correctness is ensured during application development via '''testing'''. Aside from uncovering functional inconsistencies, testing can discover conditions leading to [[#System_Failure|system failures]] <font color=darkkhaki>(see [https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, Yu Luo]).</font> By deliberately inducing faults it can be ensured that the fault-tolerance machinery is continuously exercised and tested, which can increase the confidence that faults will be handled correctly when they occur naturally. Also see: {{Internal|Software_Testing_Concepts|Software Testing Concepts}} | The system should behave correctly, providing the expected functionality. Correctness in this context has the same semantics as [[Algorithms#Algorithm_Correctness|algorithm correctness]]. Correctness is ensured during application development via '''testing'''. Aside from uncovering functional inconsistencies, testing can discover conditions leading to [[#System_Failure|system failures]] <font color=darkkhaki>(see [https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, Yu Luo]).</font> By deliberately inducing faults it can be ensured that the fault-tolerance machinery is continuously exercised and tested, which can increase the confidence that faults will be handled correctly when they occur naturally. Also see: {{Internal|Software_Testing_Concepts|Software Testing Concepts}} | ||
==Reliability== | ==Reliability== | ||
{{Note|'''Reliability''' describes the capacity of a system to work [[#Correctness|correctly]] at the desired level of performance even in presence of a certain amount hardware and software faults or human error.}} | {{Note|'''Reliability''' describes the capacity of a system to work [[#Correctness|correctly]] at the desired level of performance even in presence of a certain amount hardware and software faults or human error.}} | ||
Reliability implies that the persisted data survives storage faults, | <span id='High-Availability'></span>Reliability implies that the persisted data survives storage faults, but also that the system as a whole remains available in presence of networking or other kinds of faults, so reliability implies '''high-availability'''. | ||
A '''fault''' is defined as '''one''' component of the system deviating from its specification. A <span id='System_Failure'></span>'''failure''' is defined as a system as a whole stopping providing the service. A fault can lead to other faults, a failure or neither. Faults can be caused by hardware, software bugs that make processes crash on bad input, resource leaks followed by resource starvation, operations errors, etc. | A '''fault''' is defined as '''one''' component of the system deviating from its specification. A <span id='System_Failure'></span>'''failure''' is defined as a system as a whole stopping providing the service. A fault can lead to other faults, a failure or neither. Faults can be caused by hardware, software bugs that make processes crash on bad input, resource leaks followed by resource starvation, operations errors, etc. | ||
Line 54: | Line 78: | ||
==Scalability== | ==Scalability== | ||
{{Note| | {{Note|[[Performance_Concepts#Scalability|Scalability]] is a measure of how adding resources affects the [[Performance_Concepts#Performance|performance]] of the system and describes the ability of the system to cope with increased [[Performance_Concepts#Load|load]].}} | ||
As the system grows in data volume, traffic or complexity, there should be a reasonable way of dealing with growth. The most common reason for performance degradation, and a scalability concern, is increased load. | As the system grows in data volume, traffic or complexity, there should be a reasonable way of dealing with growth. The most common reason for performance degradation, and a scalability concern, is increased load. | ||
[[Performance_Concepts#Load|Load]] is a statement of how much stress a system is under and can be described numerically with [[Performance_Concepts#Load_Parameters|load parameters]]. The best choice of parameters depends on the architecture of the system. In case of a web server, an essential load parameter is the number of requests per second. For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate. | [[Performance_Concepts#Load|Load]] is a statement of how much stress a system is under and can be described numerically with [[Performance_Concepts#Load_Parameters|load parameters]]. The best choice of parameters depends on the architecture of the system. In case of a web server, an essential load parameter is the number of requests per second. For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate. | ||
To understand how a specific amount of load affects the system, we need to describe the | To understand how a specific amount of load affects the system, we need to describe the [[Performance_Concepts#Performance|performance]] of the system by collecting and analyzing [[Performance_Concepts#Performance_Metrics|performance metrics]]. For an on-line system, a good performance metric is [[Performance_Concepts#Response_Time|response time]]. For a batch system, [[Performance_Concepts#Throughput|throughput]] is more relevant. More details: {{Internal|Performance Concepts|Performance Concepts}} | ||
Once the load and the performance metrics of a system are understood and measured, you can investigate what happens when the load increases. This can be approached it in two ways: ① How does the performance degrade, as reflected by the performance metrics, if the load, as reflected by load parameters, increases while maintaining the resources of the system (CPUs, memory, network bandwidth) constant. ② What resources do we need to increase to maintain the performance metrics constant under increased load. | |||
Typical solutions to address an increase in load and the associated decrease in performance are: | |||
* '''Vertical scaling''' (scaling up). Vertical scaling means adding more resources to the existing machine and making it more powerful. A system that can run on a single machine is often simpler. However, high end machines can become very expensive and ultimately, vertical scaling has hard limits to how much the system can be extended. | |||
* '''Horizontal scaling''' (scaling out). Horizontal scaling means distributing the load across multiple smaller machines and adding more machines to the pool. The distributed systems are more difficult to design and build, especially if the system requires to distribute services that share state, but they extend beyond the limits of what vertical scaling can achieve. | |||
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare - the [[Performance_Concepts#Load_Parameters|load parameters]]. | |||
Also, an architecture that is appropriate for one level of load is unlikely to cope with ten times that load, so it is likely that you will need to rethink the architecture on every order of magnitude of load increase. | |||
==Maintainability== | ==Maintainability== | ||
It should be possible to keep working on the system productively as the system evolves over time. | It should be possible to keep working on the system productively as the system evolves over time. | ||
[[#Operability|Operability]], '''simplicity''' and '''evolvability'''. | |||
===Operability=== | |||
Operability includes observability, monitoring, log aggregation and processing, identity management. | |||
=Request Flow= | =Designing Modular Systems= | ||
{{Internal|Designing Modular Systems#Overview|Designing Modular Systems}} | |||
=Request Flow and Data Access Patterns= | |||
<font color=darkkhaki> | <font color=darkkhaki> | ||
Write path | * Write path | ||
* Read path | |||
* Read-to-write ratio. | |||
* Write-intensive system (time-based logs). | |||
* Read-intensive system (user profile) | |||
* Returned data is always unique (search queries) | |||
</font> | </font> | ||
=Operations= | =Operations= | ||
Operations include log management, monitoring, deployment, etc. | |||
{{Internal|Operations#Overview|Operations}} | |||
= | =Capacity Estimation= | ||
{{External|Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best Design http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html}} | |||
{{External|Numbers Everyone Should Know http://highscalability.com/numbers-everyone-should-know}} | |||
{{Internal|Powers_of_2|Powers of 2}} | |||
Back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to a get a good feel for which designs will meet your requirements. | |||
=Reference Systems= | =Reference Systems= | ||
* [[Twitter System Design|Twitter]] | * [[Twitter System Design|Twitter]] | ||
* [[Rate Limiter System Design|Rate Limiter]] | |||
* [[Consistent Hashing System Design|Consistent Hashing]] | |||
=Subjects= | |||
* [[Data Encoding and Evolution]] | |||
* [[Serialization]] | |||
* [[Forward and Backward Compatibility]] | |||
=Organizatorium= | =Organizatorium= | ||
Line 124: | Line 166: | ||
* Michael Jurewitz: The Human Impact of Bugs http://jury.me/blog/2013/3/14/the-human-impact-of-bugs | * Michael Jurewitz: The Human Impact of Bugs http://jury.me/blog/2013/3/14/the-human-impact-of-bugs | ||
* Raffi Krikorian: Timelines at Scale https://www.infoq.com/presentations/Twitter-Timeline-Scalability/ | * Raffi Krikorian: Timelines at Scale https://www.infoq.com/presentations/Twitter-Timeline-Scalability/ | ||
* Active-Active for Multi-Regional Resiliency Active-Active for Multi-Regional Resiliency by Ruslan Meshenberg, Naresh Gopalani, and Luke Kosewski (Netflix Technology Blog) https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b | |||
* https://github.com/donnemartin/system-design-primer | |||
</font> | </font> |
Latest revision as of 23:37, 8 May 2024
Internal
Overview
The goals of system design is to build software systems that first and foremost are correct, in that the system correctly implements the functions it was built to implement. Additionally, they should aim to maximize reliability, scalability and maintainability.
A Typical System
Many web and mobile applications in use today are conceptually similar to the generic system described below. The functionality differs, as well as reliability, scalability and maintainability requirements, but most systems share at least several of the elements described here.
The applications have mobile and browser clients that communicate with the backend via HTTP or WebSocket protocols. They include a business logic layer deployed as a monolith or a set of microservices, in most cases in containers managed by a container orchestration system like Kubernetes. They persist their data in databases, either relational or NoSQL. They may need to remember results of expensive operations to speed up reads, so they can use caches for that. In case they need to allow users to search data by keyword, search indexes are available and can be integrated. They may need to rely on message brokers or streaming systems for asynchronous processing and better decoupling. If they need to periodically process large amounts of accumulated data, they can use batch processing tools.
Depending on the actual requirements on the system, these bits and pieces are implemented by different products, which are integrated into the system and stitched together with application code. The exercise of developing the system consists in combining standard building blocks into a structure that provides custom functionality. When you combine several tools to provide a custom service, the service's API hides implementation details from clients, so in effect you create a special-purpose system from general-purpose components, which provides specific functionality and reliability guarantees. Building such system require system design skills, and the result of the process should accomplish the four main system design goals.
Clients
The clients can be web applications or mobile applications. A web application uses a combination of server-side logic written in a language as Java or Python and packaged and deployed as containerized services to handle business logic and storage, and a client-side language as HTML or JavaScript for presentation. A mobile application uses the same type of backend infrastructure, but a device-specific language to implement the client.
Frontend/Backend Communication
In both cases, the clients send requests over HTTP. The server returns the response as part of the same HTTP request/response pair, either a HTML page to be rendered or JSON-serialized data.
Load Balancer
A load balancer distributes traffic among a backend pool of services according to its configured load balancing policies. Usually the load balancer communicates with the backend services via private IPs. Load balancers can be used to address scalability issues, by distributing traffic to a larger service set, and also reliability issues, by increasing high-availability.
API Gateway
An API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc.
Applications
It is worth trying to keep application services stateless, so they can scale horizontally. State can be maintained in the database, and for situation in which the state is transient - such as session state - it can be maintained in a cache or replicated cache.
Once a service instance is created and the state is cached in the service's memory, it is worth to direct requests belonging to the same session to the same service instance. This mechanism is called "sticky session" and to implement it, the load balancer must cooperate. Alternatively, the services maintain the session state externally in the above mentioned cache, and sticky session becomes unnecessary.
Database
The database can be relational or NoSQL. The following article describes the situations when a NoSQL database may be preferable over a relational database:
Special measures need to be taken to protect the database against hardware and other faults, thus improving its reliability and high-availability, and also to increase its scalability in presence of higher load. More details on this subject are available in
Cache
A cache is a component that provides temporary storage to store in memory frequently accessed database data or the results of expensive computations. The goal is to speed up subsequent requests that need the same data. The cache tier is much faster than the database, so using a cache tier improves the system's performance and also reduces the load on the database. The cache tier can be scaled independently.
Content Delivery Network (CDN)
Applications that serve a lot of static content may consider using a CDN. A CDN is a network of geographically dispersed servers optimized to delivery such static content.
Data Centers
Multiple data centers allow positioning data and logic close to users.
Search Index
Full-text search servers: Elasticsearch and Solr.
Stream Processing
Also see:
Batch Processing
TODO:
System Design Goals
Correctness
The system should behave correctly, providing the expected functionality. Correctness in this context has the same semantics as algorithm correctness. Correctness is ensured during application development via testing. Aside from uncovering functional inconsistencies, testing can discover conditions leading to system failures (see Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, Yu Luo). By deliberately inducing faults it can be ensured that the fault-tolerance machinery is continuously exercised and tested, which can increase the confidence that faults will be handled correctly when they occur naturally. Also see:
Reliability
Reliability describes the capacity of a system to work correctly at the desired level of performance even in presence of a certain amount hardware and software faults or human error.
Reliability implies that the persisted data survives storage faults, but also that the system as a whole remains available in presence of networking or other kinds of faults, so reliability implies high-availability.
A fault is defined as one component of the system deviating from its specification. A failure is defined as a system as a whole stopping providing the service. A fault can lead to other faults, a failure or neither. Faults can be caused by hardware, software bugs that make processes crash on bad input, resource leaks followed by resource starvation, operations errors, etc.
It is impossible to reduce the probability of a fault to zero, therefore it is usually best to design mechanism that prevent faults from causing failures. A system that experiences faults may continue to provide its service, that is, not fail. Such a system is said to be fault tolerant or resilient. The observable effect of a fault at the system boundary is called a symptom. The most extreme symptom of a fault is failure, but it might also be something as benign as a high reading on a temperature gauge. For more terminology see A Conceptual Framework for System Fault Tolerance by Walter L. Heimerdinger and Charles B. Weinstock.
System design includes techniques for building reliable systems from unreliable parts. Depending on the type of faults, these techniques include:
- Add redundancy to individual hardware components to prevent hardware faults. Example: RAID disks, dual power supplies and hot swappable CPUs.
- Add software redundancy systems that can make the loss of entire machines tolerable.
- Carefully think about assumptions and interactions in the system to reduce the probability of defects in software.
- Automatically test the software.
- Design the systems in a way that minimizes opportunities for human error during operations. For example, well-designed abstractions, APIs and user interfaces should make it easy to do "the right thing" and discourage "the wrong thing". However, if the interfaces are too restrictive, the people will work around them, negating the benefit.
- Setup monitoring that alerts the operator on faults.
- Put in place reliable deployment operations that allow quick rollback and gradual release.
Scalability
Scalability is a measure of how adding resources affects the performance of the system and describes the ability of the system to cope with increased load.
As the system grows in data volume, traffic or complexity, there should be a reasonable way of dealing with growth. The most common reason for performance degradation, and a scalability concern, is increased load.
Load is a statement of how much stress a system is under and can be described numerically with load parameters. The best choice of parameters depends on the architecture of the system. In case of a web server, an essential load parameter is the number of requests per second. For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate.
To understand how a specific amount of load affects the system, we need to describe the performance of the system by collecting and analyzing performance metrics. For an on-line system, a good performance metric is response time. For a batch system, throughput is more relevant. More details:
Once the load and the performance metrics of a system are understood and measured, you can investigate what happens when the load increases. This can be approached it in two ways: ① How does the performance degrade, as reflected by the performance metrics, if the load, as reflected by load parameters, increases while maintaining the resources of the system (CPUs, memory, network bandwidth) constant. ② What resources do we need to increase to maintain the performance metrics constant under increased load.
Typical solutions to address an increase in load and the associated decrease in performance are:
- Vertical scaling (scaling up). Vertical scaling means adding more resources to the existing machine and making it more powerful. A system that can run on a single machine is often simpler. However, high end machines can become very expensive and ultimately, vertical scaling has hard limits to how much the system can be extended.
- Horizontal scaling (scaling out). Horizontal scaling means distributing the load across multiple smaller machines and adding more machines to the pool. The distributed systems are more difficult to design and build, especially if the system requires to distribute services that share state, but they extend beyond the limits of what vertical scaling can achieve.
An architecture that scales well for a particular application is built around assumptions of which operations will be common and which will be rare - the load parameters.
Also, an architecture that is appropriate for one level of load is unlikely to cope with ten times that load, so it is likely that you will need to rethink the architecture on every order of magnitude of load increase.
Maintainability
It should be possible to keep working on the system productively as the system evolves over time.
Operability, simplicity and evolvability.
Operability
Operability includes observability, monitoring, log aggregation and processing, identity management.
Designing Modular Systems
Request Flow and Data Access Patterns
- Write path
- Read path
- Read-to-write ratio.
- Write-intensive system (time-based logs).
- Read-intensive system (user profile)
- Returned data is always unique (search queries)
Operations
Operations include log management, monitoring, deployment, etc.
Capacity Estimation
- Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best Design http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html
- Numbers Everyone Should Know http://highscalability.com/numbers-everyone-should-know
Back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to a get a good feel for which designs will meet your requirements.
Reference Systems
Subjects
Organizatorium
- Process and redistribute: Distributed Systems
- Clean Architecture https://www.amazon.com/Clean-Architecture-Craftsmans-Software-Structure/dp/0134494164/
- https://medium.com/@i.gorton/six-rules-of-thumb-for-scaling-software-architectures-a831960414f9
- http://highscalability.com
- Harvard Scalability Class David Malan https://www.youtube.com/watch?v=-W9F__D3oY4
- https://www.lecloud.net/post/7295452622/scalability-for-dummies-part-1-clones
- https://www.lecloud.net/post/7994751381/scalability-for-dummies-part-2-database
- https://www.lecloud.net/post/9246290032/scalability-for-dummies-part-3-cache
- https://www.lecloud.net/post/9699762917/scalability-for-dummies-part-4-asynchronism
- https://www.hiredintech.com/classrooms/system-design/lesson/61
- Learning/System Design/*.pdf
- https://github.com/checkcheckzz/system-design-interview
- https://github.com/donnemartin/system-design-primer
- https://github.com/mmcgrana/services-engineering
- https://github.com/orrsella/soft-eng-interview-prep/blob/master/topics/system-architecture.md
- https://queue.acm.org/detail.cfm?id=3480470
- https://blog.pramp.com/how-to-succeed-in-a-system-design-interview-27b35de0df26
- https://developers.redhat.com/articles/2021/09/21/distributed-transaction-patterns-microservices-compared
- State Transition Diagram
- Schemaless: Fowler https://martinfowler.com/articles/schemaless/
Papers and Articles Referred from "Designing Data-intensive Applications"
- “One Size Fits All”: An Idea Whose Time Has Come and Gone https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.9136&rep=rep1&type=pdf
- Yury Izrailevsky and Ariel Tseitlin: The Netflix Simian Army https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
- Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al. "What Bugs Live in the Cloud?" https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf
- Richard Cook: How complex systems fail https://www.researchgate.net/publication/228797158_How_complex_systems_fail
- Jay Kreps: Getting Real About Distributed System Reliability https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability
- Nathan Marz: Principles of Software Engineering, Part 1 http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html
- Michael Jurewitz: The Human Impact of Bugs http://jury.me/blog/2013/3/14/the-human-impact-of-bugs
- Raffi Krikorian: Timelines at Scale https://www.infoq.com/presentations/Twitter-Timeline-Scalability/
- Active-Active for Multi-Regional Resiliency Active-Active for Multi-Regional Resiliency by Ruslan Meshenberg, Naresh Gopalani, and Luke Kosewski (Netflix Technology Blog) https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b
- https://github.com/donnemartin/system-design-primer