Performance Concepts
Internal
Load
Load is a statement of how much stress a system is under. Load can be numerically described with load parameters.
Load Parameters
A load parameter is a numerical representation of a system's load. In case of a web server, an essential load parameter is the number of requests per second. For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate.
Understanding load parameters of a specific system is important during the system design phase. An architecture that scales well for a particular application is built around assumptions on load parameters - which operations will be common and which will be rare.
Performance
The performance of the system is described by performance metrics.
Performance Metrics
Response Time
The response time is the time between a client sending a request and receiving a response. It includes the time the request travels over the network from the client to the backend, the time the request is awaiting service in the backend queue, the service time and the time it takes to travel back to the client. Some monitoring system describe the request time as the time the backend takes to process the request, and in this case the travel time is not accounted for. Response time and latency are some times used interchangeably.
The response time is relevant for on-line system, such as a web site or a mobile application backend.
One single response time value is not that relevant, it makes more sense to think of response time as a distribution of values that can be measured. For a system that works well, over a specific time interval most requests are usually reasonably fast, but there are occasional outliers, that take much longer. This can be caused by the fact that the requests in question are intrinsically more expensive, but it could also be that the additional latency is introduced by infrastructure-related factors: context switch, TCP packet loss and retransmission, garbage collection pause, page fault, etc.
Average Response Time
The arithmetic mean: given n requests values, add up all the values and divide by n. This is not a very good metric because this not reflect the "typical" response time, it does not tell you how many users actually experienced the delay.
Median Response Time
The median response time for an interval is the response time of the request for which 50% of the requests are faster, and 50% of the requests are slower. The median is also known as the 50th percentile.
Response Time Percentiles
nth percentile, or quantile (ex: 99th, abbreviated "p99") is the response time threshold at which n% (99%) of requests are faster than the particular threshold (and (100-n)% are slower). DDIA Cap 1 Reliable, Scalable and Maintainable Applications → Scalability → Describing Performance. Also see Percentiles below.
Latency
Latency is the minimum time required to get any form of response, even if the work to be done is nonexistent (Martin Fowler, Patterns of Enterprise Applications Architecture). Latency is relevant in remote systems, because it includes the time the request and response to make their way across the wire, and also the time the request is waiting to be handled, on the backend - during which it is latent, awaiting service. Latency and response time are often used synonymously, the length of time it takes something interesting to happen, while some authors argues that they are not synonymous (DDIA).
Standard deviation does not have any meaning for a dataset that describes latency. It is not relevant. Latency must be measured in the context of load, measuring the latency without load is misleading.
TO Process:
- Everything You Know About Latency Is Wrong by Tyler Treat https://bravenewgeek.com/everything-you-know-about-latency-is-wrong/
- How NOT to Measure Latency https://www.infoq.com/presentations/latency-response-time
- Define saturation.
- Identify where the saturation point is. Don't run a system at saturation or over.
Throughput
Throughput is the rate at which something can be produced, consumed or processed, in a time unit. Throughput is usually relevant in case of batch processing systems, such as Hadoop, where it describes the number of records that can be processed per second.
Scalability
Scalability is a measure of how adding resources (usually hardware) affects performance and describes the ability of a system to cope with increased load. Also see:
Percentiles
The nth percentile, or quantile (ex: 99th, abbreviated "p99") is the value of the performance metric threshold at which n% (99%) of the measurements are better than the particular threshold (and (100-n)% are worse).
Averaging percentiles - reducing time resolution or combining data from several machines - is mathematically meaningless. The right way to aggregate performance metric data is to add the histograms (see: https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think.
A naive implementation of a percentile computation algorithm is to maintain a list of all performance metric readings for a time window and sort the list periodically. Better algorithms are:
- Forward decay (http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf)
- T-digest (https://github.com/tdunning/t-digest)
- HdrHistogram (http://www.hdrhistogram.org)
Queueing Theory
TODO:
- https://en.wikipedia.org/wiki/Queueing_theory
- Response Time in Queueing Theory.
- Service Time in Queueing Theory.
Response time and service time diverge as saturation becomes worse.
Organizatorium
- xth percentile (quantiles) - the value of the performance parameter at which x% of the request are better; https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think
- Tail latency amplification. See: Jeffrey Dean and Luiz André Barroso: "The Tail at Scale" https://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
- Don't censor bad data, don't throw away data selectively.
- Never average percentiles.
- Coordinated omission. Coordinate omission usually makes something that you think is a response time metric only represent a service time component.
- Response Time in Queueing Theory.
- Service Time in Queueing Theory.
Load Generatos
The load generators need to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than the would be in reality which skews the measurement.