External

Episode 06 Unqualified Engineer - Jackson Gabbard: Intro to Architecture and Systems Design Interviews https://www.youtube.com/watch?v=ZgdS0EUmn70

Internal

Software Engineering

Overview

The goals of system design is to build software systems that first and foremost are correct, in that the system correctly implements the functions it was built to implement. Additionally, they should aim to maximize reliability, scalability and maintainability.

A Typical System

Many web and mobile applications in use today are conceptually similar to the generic system described below. The functionality differs, as well as reliability, scalability and maintainability requirements, but most systems share at least several of the elements described here.

The applications have mobile and browser clients that communicate with the backend via HTTP or WebSocket protocols. They include a business logic layer deployed as a monolith or a set of microservices, in most cases in containers managed by a container orchestration system like Kubernetes. They persist their data in databases, either relational or NoSQL. They may need to remember results of expensive operations to speed up reads, so they can use caches for that. In case they need to allow users to search data by keyword, search indexes are available and can be integrated. They may need to rely on message brokers or streaming systems for asynchronous processing and better decoupling. If they need to periodically process large amounts of accumulated data, they can use batch processing tools.

Depending on the actual requirements on the system, these bits and pieces are implemented by different products, which are integrated into the system and stitched together with application code. The exercise of developing the system consists in combining standard building blocks into a structure that provides custom functionality. When you combine several tools to provide a custom service, the service's API hides implementation details from clients, so in effect you create a special-purpose system from general-purpose components, which provides specific functionality and reliability guarantees. Building such system require system design skills, and the result of the process should accomplish the four main system design goals.

Clients

The clients can be web applications or mobile applications. A web application uses a combination of server-side logic written in a language as Java or Python and packaged and deployed as containerized services to handle business logic and storage, and a client-side language as HTML or JavaScript for presentation. A mobile application uses the same type of backend infrastructure, but a device-specific language to implement the client.

Frontend/Backend Communication

In both cases, the clients send requests over HTTP. The server returns the response as part of the same HTTP request/response pair, either a HTML page to be rendered or JSON-serialized data.

Applications

Database

The database can be relational or NoSQL. The NoSQL article describes why a NoSQL database may be preferable over a relational database, in specific cases.

Cache

Search Index

Full-text search servers: Elasticsearch and Solr.

Stream Processing

Message_Brokers_and_Stream_Processing_TO_REFACTOR

Batch Processing

Dataflow

Request flow. Write path. Read path.

System Design Goals

Correctness

The system should behave correctly, providing the expected functionality. Correctness is ensured during application development via testing. Aside from uncovering functional inconsistencies, testing can discover conditions leading to system failures (see Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems by Ding Yuan, Yu Luo). By deliberately inducing faults it can be ensured that the fault-tolerance machinery is continuously exercised and tested, which can increase the confidence that faults will be handled correctly when they occur naturally. Also see:

Software Testing Concepts

Reliability

Reliability means that the system should continue to work correctly at the desired level of performance even in presence of a certain amount hardware and software faults or human error. Reliability implies that the persisted data survives storage faults, and also that the system remains available, so reliability implies high-availability.

A fault is defined as one component of the system deviating from its specification. A failure is defined as a system as a whole stopping providing the service. A fault can lead to other faults, a failure or neither. Faults can be caused by hardware, software bugs that make processes crash on bad input, resource leaks followed by resource starvation, or operation errors.

It is impossible to reduce the probability of a fault to zero, therefore it is usually best to design mechanism that prevent faults from causing failures. A system that experiences faults may continue to provide its service, that is, not fail. Such a system is said to be fault tolerant or resilient. The observable effect of a fault at the system boundary is called a symptom. The most extreme symptom of a fault is failure, but it might also be something as benign as a high reading on a temperature gauge. For more terminology see A Conceptual Framework for System Fault Tolerance by Walter L. Heimerdinger and Charles B. Weinstock.

System design includes techniques for building reliable systems from unreliable parts. Depending on the type of faults, these techniques include:

Add redundancy to individual hardware components to prevent hardware faults. Example: RAID disks, dual power supplies and hot swappable CPUs.
Add software redundancy systems that can make the loss of entire machines tolerable.
Carefully think about assumptions and interactions in the system to reduce the probability of defects in software.
Automatically test software.

Scalability

As the system grows in data volume, traffic or complexity, there should be a reasonable way of dealing with growth.

Maintainability

It should be possible to keep working on the system productively as the system evolves over time.

Operations

Logs

Monitoring

Deployment

Capacity Estimation

Organizatorium

Papers and Articles Referred from "Designing Data-intensive Applications"

“One Size Fits All”: An Idea Whose Time Has Come and Gone https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.68.9136&rep=rep1&type=pdf
Yury Izrailevsky and Ariel Tseitlin: The Netflix Simian Army https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116
Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al. "What Bugs Live in the Cloud?" https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf
Richard Cook: How complex systems fail https://www.researchgate.net/publication/228797158_How_complex_systems_fail
Jay Kreps: Getting Real About Distributed System Reliability https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability

System Design

Contents