Performance Concepts: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(53 intermediate revisions by the same user not shown)
Line 1: Line 1:
=External=
* Gil Tene on How NOT to Measure Latency https://www.infoq.com/presentations/latency-response-time https://www.youtube.com/watch?v=lJ8ydIuPFeU
=Internal=
=Internal=
* [[Performance#Subjects|Performance]]
* [[Performance#Subjects|Performance]]
* [[Statistical Concepts]]
* [[Statistical Concepts]]
* [[System_Design#Scalability|System Design]]
* [[System_Design#Scalability|System Design]]
=Load=
* [[Software_Testing_Concepts#Load|Software Testing Concepts]]
Load is a statement of how much stress a system is under. Load can be numerically described with [[#Load_Parameters|load parameters]].
==Load Parameters==
A load parameter is a numerical representation of a system's [[#Load|load]]. For example, in case of a web server, an essential load parameter is the number of requests per second. For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate.
 
Understanding load parameters of a specific system is important during the [[System_Design#Scalability|system design phase]]. An architecture that scales well for a particular application is built around assumptions on load parameters - which operations will be common and which will be rare.


=Performance=
=Performance=
The performance of the system is described by [[#Performance_Metrics|performance metrics]].
The performance of the system is described by [[#Performance_Metrics|performance metrics]].
==Performance Metrics==
==Performance Metrics==
===Response Time===
===<span id='Resources'>Resource Consumption===
The '''response time''' is the time between a client sending a request and receiving a response. It includes the time the request travels over the network from the client to the backend, the time the request is awaiting service in the backend queue, the service time and the time it takes to travel back to the client. Some monitoring system describe the request time as the time the backend takes to process the request, and in this case the travel time is not accounted for. Response time and [[#Latency|latency]] are some times used interchangeably.
CPU, memory, disk I/O.
===<span id='Response_Time'></span><span id='Latency'></span>Latency (Response Time)===
The '''latency''', or '''response time''', is the minimum time required to get any form of response from a service, even if the work to be done is nonexistent (Martin Fowler, Patterns of Enterprise Applications Architecture). Another definition is the length of time for something interesting to happen. The latency can be measured practically as the time between a client finishing sending a request and fully receiving a response. This measured interval includes the time the request travels over the network from the client to the backend, the time the request is awaiting service in the backend queue, the service time and the time it takes to travel back to the client. The name comes from the fact that, from the client perspective, once the request is fully submitted, it is '''latent''', awaiting service. The latency is an important parameter for an on-line system, such as a web site or a mobile application backend.


The response time is relevant for on-line system, such as a web site or a mobile application backend.
Latency and response time are often used synonymously, but some authors argues that they are not synonymous ([[DDIA]]).


One single response time value is not that relevant, it makes more sense to think of response time as a distribution of values that can be measured. For a system that works well, over a specific time interval most requests are usually reasonably fast, but there are occasional outliers, that take much longer. This can be caused by the fact that the requests in question are intrinsically more expensive, but it could also be that the additional latency is introduced by infrastructure-related factors: context switch, TCP packet loss and retransmission, garbage collection pause, page fault, etc.
Latency is especially relevant in remote systems, because the time spent propagating the request over the network, and the response back is not negligible, and in many cases, it constitutes the majority of the measured time. Some monitoring system describe the request time as the time the backend takes to process the request, and in this case the travel time is not accounted for.
 
One single response time value is not that relevant for a system, it makes more sense to think of response time as a distribution of values that can be measured over significant intervals of time. There is no one single value for latency, a latency dataset is made of a large number of data points, measured over an interval of time, and trying to characterize such a dataset with only one number ("the average latency is 100 ms") is in most cases misleading. For a system that works well, over the measurement time interval most requests are usually reasonably fast, but there are occasional outliers, that take much longer. This can be caused by the fact that the requests in question are intrinsically more expensive, but it could also be that the additional latency is introduced by infrastructure-related factors: context switch, TCP packet loss and retransmission, garbage collection pause, page fault, etc.
 
The challenge is to come with an expressive enough characterization of the latency for all requests, and over time, to be useful. Showing a P95 percentile value simply throws away 5% of the worst-case data, which should be investigated first, so for troubleshooting, do not throw away the max values. The max values show you how "bad the bad stuff is".
 
Standard deviation does not have any meaning for a dataset that describes latency. It is not relevant. Latency must be measured in the context of [[#Load|load]], measuring the latency without load is misleading.


====Average Response Time====
====Average Response Time====
Line 26: Line 32:
====Median Response Time====
====Median Response Time====


The median response time for an interval is the response time of the request for which 50% of the requests are faster, and 50% of the requests are slower. The median is also known as the '''50<sup>th</sup> percentile'''.
The median response time for an interval is the response time of the request for which 50% of the requests are faster, and 50% of the requests are slower. The median is also known as the 50<sup>th</sup> percentile or P50.


====Response Time Percentiles====
====Response Time Percentiles====
n<sup>th</sup> percentile, or quantile (ex: 99<sup>th</sup>, abbreviated "p99") is the response time threshold at which n% (99%) of requests are faster than the particular threshold (and (100-n)% are slower). <font color=darkkhaki>[[DDIA]] Cap 1 Reliable, Scalable and Maintainable Applications → Scalability →  Describing Performance</font>. Also see [[#Percentiles|Percentiles]] below.
n<sup>th</sup> percentile, or quantile (ex: 99<sup>th</sup>, abbreviated P99) is the response time threshold at which n% (99%) of requests are faster than the particular threshold (and (100-n)% are slower). <font color=darkkhaki>[[DDIA]] Cap 1 Reliable, Scalable and Maintainable Applications → Scalability →  Describing Performance</font>.  
 
===Latency===


Latency is the minimum time required to get any form of response, even if the work to be done is nonexistent (Martin Fowler, Patterns of Enterprise Applications Architecture). Latency is relevant in remote systems, because it includes the time the request and response to make their way across the wire, and also the time the request is waiting to be handled, on the backend - during which it is '''latent''', awaiting service. Latency and [[#Response_Time|response time]] are often used synonymously, the length of time it takes something interesting to happen, while some authors argues that they are not synonymous ([[DDIA]]).
Also see: {{Internal|Percentile#Overview|Percentiles}}
 
Standard deviation does not have any meaning for a dataset that describes latency. It is not relevant. Latency must be measured in the context of [[#Load|load]], measuring the latency without load is misleading.


====Articles and Talks on Latency====
<font color=darkkhaki>
<font color=darkkhaki>
TO Process:
To Process:
* Your Load Generator is Probably Lying to You - Take the Red Pill and Find Out Why https://highscalability.com/your-load-generator-is-probably-lying-to-you-take-the-red-pi/
* Everything You Know About Latency Is Wrong by Tyler Treat https://bravenewgeek.com/everything-you-know-about-latency-is-wrong/
* Everything You Know About Latency Is Wrong by Tyler Treat https://bravenewgeek.com/everything-you-know-about-latency-is-wrong/
* How NOT to Measure Latency https://www.infoq.com/presentations/latency-response-time
* Define saturation.
* Identify where the saturation point is. Don't run a system at saturation or over.
</font>
</font>


==Throughput==
===Throughput===
Throughput is the rate at which something can be produced, consumed or processed, in a time unit. Throughput is usually relevant in case of [[System_Design#Batch_Processing|batch processing systems]], such as Hadoop, where it describes the number of records that can be processed per second.
Throughput is the rate at which something can be produced, consumed or processed, in a time unit. Throughput is usually relevant in case of [[System_Design#Batch_Processing|batch processing systems]], such as Hadoop, where it describes the number of records that can be processed per second.
===Saturation===
<font color=darkkhaki>
Define saturation. Identify where the saturation point is. Don't run a system at saturation or over.
</font>
=Load=
Load is a statement of how much stress a system is under. Load can be numerically described with [[#Load_Parameters|load parameters]].
==Load Parameters==
A load parameter is a numerical representation of a system's [[#Load|load]]. For example, in case of a web server, an essential load parameter is the number of requests per second (RPS) or queries per second (QPS). For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate. Understanding load parameters of a specific system is important during the [[System_Design#Scalability|system design phase]]. An architecture that scales well for a particular application is built around assumptions on load parameters, and requires an understanding of which operations will be common and which will be rare.
==Load Testing==
'''Load testing''' is the analysis of a system under load while producing different kinds of load, by varying [[#Load_Parameters|load parameters]]. For example, one way to vary the load is to modify the request rate while keeping the request type (payload, processing cost, etc.) constant. Another way is to modify request type (payload, processing cost, etc.) while keeping the request rate constant.
===Coordinated Omission===
The coordinated omission problem is a systematic error introduced by some load testing tools, which do not produce correct data in presence of a certain type of failure of the system under test. Coordinated omission problem causes the data to be skewed only towards good things, so you're looking at a percentile of good things instead of a percentile of things.
The problem happens in two common ways.
One is with load generators, as shown below, and the other one is with self-monitoring systems.
For load generators, the simplest way to describe it is that the load generator does not do the next thing in the sequence until the previous one does not come back. This can be generalized even if the load generator uses multiple threads - the same behavior occurs on each thread.
The load generators need to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, this is fine, and the load generator produces correct data, if the response comes back before the client is supposed to send the next one. However, if the system under test starts to misbehave and the response comes back after the client is supposed to send the next one, and the client waits (or each thread waits), in effect we send request at a lower rate than we are supposed to. The behavior has the effect of artificially keeping the queues shorter on the service under test, when it misbehaves. Effectively, we cut it some slack. You are coordinating with the system under load and avoid measuring it specifically during a bad time. That skews the measurement, because in fact it applies a "lighter" load than we think. This is why we call it "coordinated" omission, as opposed to random omission, which would not be that bad.
For example, if the load client as 10 threads, and they keep sending a request per second and get 10 ms response times, and then the system under test freezes for an hour, all 10 thread will get stuck waiting for 10 responses from a frozen system, for an hour. When the system unfreezes after 3600 seconds, the load client will produce only 10 very bad results (3,600,000 ms vs 10 ms), which will get lost in the very high percentile, while it should have produced 10 x 3600 = 36,000 bad results. This is the coordinate omission problem. The load client must be specifically coded to avoid this problem, and [[Vegeta#Vegeta_and_Coordinated_Omission|Vegeta]] is one of the load tools that avoids it.
<font color=darkkhaki>TODO: https://www.scylladb.com/2021/04/22/on-coordinated-omission/</font>
===<span id='Tools'></span>Load Testing Tools===
* [[Apache_JMeter#Overview|Apache JMeter]]
* [[SoapUI]]
* [[Gatling]]
* ApacheBench https://httpd.apache.org/docs/2.4/programs/ab.html
====Written in Go====
(in the descending orders of stars on GitHub):
* https://github.com/grafana/k6
* [[Vegeta#Overview|Vegeta]]
* https://github.com/getanteon/anteon
* https://github.com/codesenberg/bombardier
* [[Fortio#Overview|Fortio]]
* https://github.com/rogerwelin/cassowary


=Benchmarking=
Benchmarking is testing the system under load at peak capacity.
=Scalability=
=Scalability=
Scalability is a measure of how adding resources (usually hardware) affects performance and describes the ability of a system to cope with increased [[#Load|load]]. Also see: {{Internal|System_Design#Scalability|System Design &#124; Scalability}}
Scalability is a measure of how adding resources (usually hardware) affects performance and describes the ability of a system to cope with increased [[#Load|load]]. Also see: {{Internal|System_Design#Scalability|System Design &#124; Scalability}}
=Percentiles=
=<span id='HDRHistogram'></span>HDR Histogram=
The n<sup>th</sup> percentile, or quantile (ex: 99<sup>th</sup>, abbreviated "p99") is the value of the performance metric threshold at which n% (99%) of the measurements are better than the particular threshold (and (100-n)% are worse).
{{External|http://hdrhistogram.org}}
 
<font color=darkkhaki>Parse: https://www.infoq.com/presentations/latency-response-time/</font>
Averaging percentiles - reducing time resolution or combining data from several machines - is mathematically meaningless. The right way to aggregate performance metric data is to add the histograms <font color=darkkhaki>(see: https://www.vividcortex.com/blog/why-percentiles-dont-work-the-way-you-think</font>.
 
A naive implementation of a percentile computation algorithm is to maintain a list of all performance metric readings for a time window and sort the list periodically. Better algorithms are:
* Forward decay (http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf)
* T-digest (https://github.com/tdunning/t-digest)
* HdrHistogram (http://www.hdrhistogram.org)


=Queueing Theory=
=Queueing Theory=
Line 78: Line 114:
* Never average percentiles.
* Never average percentiles.
* Coordinated omission. Coordinate omission usually makes something that you think is a response time metric only represent a service time component.
* Coordinated omission. Coordinate omission usually makes something that you think is a response time metric only represent a service time component.
* Response Time in Queueing Theory.
* Service Time in Queueing Theory.
</font>
</font>
=Load Generatos=
The load generators need to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, that behavior has the effect of artificially keeping the queues shorter in the test than the would be in reality which skews the measurement.
* [[SoapUI]]
* [[Gatling]]

Latest revision as of 02:09, 1 August 2024

External

Internal

Performance

The performance of the system is described by performance metrics.

Performance Metrics

Resource Consumption

CPU, memory, disk I/O.

Latency (Response Time)

The latency, or response time, is the minimum time required to get any form of response from a service, even if the work to be done is nonexistent (Martin Fowler, Patterns of Enterprise Applications Architecture). Another definition is the length of time for something interesting to happen. The latency can be measured practically as the time between a client finishing sending a request and fully receiving a response. This measured interval includes the time the request travels over the network from the client to the backend, the time the request is awaiting service in the backend queue, the service time and the time it takes to travel back to the client. The name comes from the fact that, from the client perspective, once the request is fully submitted, it is latent, awaiting service. The latency is an important parameter for an on-line system, such as a web site or a mobile application backend.

Latency and response time are often used synonymously, but some authors argues that they are not synonymous (DDIA).

Latency is especially relevant in remote systems, because the time spent propagating the request over the network, and the response back is not negligible, and in many cases, it constitutes the majority of the measured time. Some monitoring system describe the request time as the time the backend takes to process the request, and in this case the travel time is not accounted for.

One single response time value is not that relevant for a system, it makes more sense to think of response time as a distribution of values that can be measured over significant intervals of time. There is no one single value for latency, a latency dataset is made of a large number of data points, measured over an interval of time, and trying to characterize such a dataset with only one number ("the average latency is 100 ms") is in most cases misleading. For a system that works well, over the measurement time interval most requests are usually reasonably fast, but there are occasional outliers, that take much longer. This can be caused by the fact that the requests in question are intrinsically more expensive, but it could also be that the additional latency is introduced by infrastructure-related factors: context switch, TCP packet loss and retransmission, garbage collection pause, page fault, etc.

The challenge is to come with an expressive enough characterization of the latency for all requests, and over time, to be useful. Showing a P95 percentile value simply throws away 5% of the worst-case data, which should be investigated first, so for troubleshooting, do not throw away the max values. The max values show you how "bad the bad stuff is".

Standard deviation does not have any meaning for a dataset that describes latency. It is not relevant. Latency must be measured in the context of load, measuring the latency without load is misleading.

Average Response Time

The arithmetic mean: given n requests values, add up all the values and divide by n. This is not a very good metric because this not reflect the "typical" response time, it does not tell you how many users actually experienced the delay.

Median Response Time

The median response time for an interval is the response time of the request for which 50% of the requests are faster, and 50% of the requests are slower. The median is also known as the 50th percentile or P50.

Response Time Percentiles

nth percentile, or quantile (ex: 99th, abbreviated P99) is the response time threshold at which n% (99%) of requests are faster than the particular threshold (and (100-n)% are slower). DDIA Cap 1 Reliable, Scalable and Maintainable Applications → Scalability → Describing Performance.

Also see:

Percentiles

Articles and Talks on Latency

To Process:

Throughput

Throughput is the rate at which something can be produced, consumed or processed, in a time unit. Throughput is usually relevant in case of batch processing systems, such as Hadoop, where it describes the number of records that can be processed per second.

Saturation

Define saturation. Identify where the saturation point is. Don't run a system at saturation or over.

Load

Load is a statement of how much stress a system is under. Load can be numerically described with load parameters.

Load Parameters

A load parameter is a numerical representation of a system's load. For example, in case of a web server, an essential load parameter is the number of requests per second (RPS) or queries per second (QPS). For a database, it could be the ratio of reads to writes. For a cache, it is the miss rate. Understanding load parameters of a specific system is important during the system design phase. An architecture that scales well for a particular application is built around assumptions on load parameters, and requires an understanding of which operations will be common and which will be rare.

Load Testing

Load testing is the analysis of a system under load while producing different kinds of load, by varying load parameters. For example, one way to vary the load is to modify the request rate while keeping the request type (payload, processing cost, etc.) constant. Another way is to modify request type (payload, processing cost, etc.) while keeping the request rate constant.

Coordinated Omission

The coordinated omission problem is a systematic error introduced by some load testing tools, which do not produce correct data in presence of a certain type of failure of the system under test. Coordinated omission problem causes the data to be skewed only towards good things, so you're looking at a percentile of good things instead of a percentile of things.

The problem happens in two common ways.

One is with load generators, as shown below, and the other one is with self-monitoring systems.

For load generators, the simplest way to describe it is that the load generator does not do the next thing in the sequence until the previous one does not come back. This can be generalized even if the load generator uses multiple threads - the same behavior occurs on each thread.

The load generators need to keep sending requests independently of the response time. If the client waits for the previous request to complete before sending the next one, this is fine, and the load generator produces correct data, if the response comes back before the client is supposed to send the next one. However, if the system under test starts to misbehave and the response comes back after the client is supposed to send the next one, and the client waits (or each thread waits), in effect we send request at a lower rate than we are supposed to. The behavior has the effect of artificially keeping the queues shorter on the service under test, when it misbehaves. Effectively, we cut it some slack. You are coordinating with the system under load and avoid measuring it specifically during a bad time. That skews the measurement, because in fact it applies a "lighter" load than we think. This is why we call it "coordinated" omission, as opposed to random omission, which would not be that bad.

For example, if the load client as 10 threads, and they keep sending a request per second and get 10 ms response times, and then the system under test freezes for an hour, all 10 thread will get stuck waiting for 10 responses from a frozen system, for an hour. When the system unfreezes after 3600 seconds, the load client will produce only 10 very bad results (3,600,000 ms vs 10 ms), which will get lost in the very high percentile, while it should have produced 10 x 3600 = 36,000 bad results. This is the coordinate omission problem. The load client must be specifically coded to avoid this problem, and Vegeta is one of the load tools that avoids it.

TODO: https://www.scylladb.com/2021/04/22/on-coordinated-omission/

Load Testing Tools

Written in Go

(in the descending orders of stars on GitHub):

Benchmarking

Benchmarking is testing the system under load at peak capacity.

Scalability

Scalability is a measure of how adding resources (usually hardware) affects performance and describes the ability of a system to cope with increased load. Also see:

System Design | Scalability

HDR Histogram

http://hdrhistogram.org

Parse: https://www.infoq.com/presentations/latency-response-time/

Queueing Theory

TODO:

Response time and service time diverge as saturation becomes worse.

Organizatorium