Amazon Kinesis: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(46 intermediate revisions by the same user not shown)
Line 4: Line 4:
* Amazon Kinesis Whitepaper https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf
* Amazon Kinesis Whitepaper https://d0.awsstatic.com/whitepapers/whitepaper-streaming-data-solutions-on-aws-with-amazon-kinesis.pdf
* https://docs.aws.amazon.com/kinesis/
* https://docs.aws.amazon.com/kinesis/
* Amazon Kinesis Data Streams Developer Guide https://docs.aws.amazon.com/streams/latest/dev/introduction.html


=Internal=
=Internal=
Line 9: Line 10:
* [[Amazon AWS#Subjects|Amazon AWS]]
* [[Amazon AWS#Subjects|Amazon AWS]]
* [[Stream Processing]]
* [[Stream Processing]]
* [[Spring Cloud Stream]]
=Subjects=
* [[Amazon Kinesis Operations|Operations]]


=Overview=
=Overview=


Kinesis acts as a highly available conduit to stream messages between data producers and data consumers.
Kinesis acts as a highly available conduit to stream messages between data producers and data consumers.  The Kinesis service can be integrated and exposed externally via the [[Amazon_API_Gateway_Concepts#Amazon_API_Gateway|Amazon API Gateway]].


=Concepts=
=Concepts=
Line 18: Line 24:
==Stream==
==Stream==


A Kinesis data stream is a named set of [[#Shard|shards]]. Streams can be created from the AWS Management Console, with [[AWS CLI]] and via the [[#Kinesis_Data_Streams_API|Kinesis Data Stream API]]. The name space if defined by the [[Amazon_AWS_Concepts#Account|AWS Account]] and [[Amazon_AWS_Concepts#Region|AWS region]]: for the same account, streams with the same name can exist in different regions.
A Kinesis data stream is a named set of [[#Shard|shards]]. Streams can be created from the AWS Management Console, with [[AWS CLI]] and via the [[#Kinesis_Data_Streams_API|Kinesis Data Stream API]].  
 
===Stream Name===
 
The name space if defined by the [[Amazon_AWS_Concepts#Account|AWS Account]] and [[Amazon_AWS_Concepts#Region|AWS region]]: for the same account, streams with the same name can exist in different regions.
 
===Data Stream===
 
Associated with Kinesis Streams.
 
===Delivery Stream===
 
Associated with Kinesis Firehose.


==Shard==
==Shard==
A shard is a uniquely identified sequence of [[#Record|data records]] in a stream. Shards are identified in a [[#Stream|stream]] by their [[#Partition_Key|partition key]]. They also automatically get a Shard ID, which can be obtained with [[Amazon_Kinesis_Streams#Get_Details_about_a_Specific_Stream|AWS CLI describe-stream command]]. Shards have a fixed capacity: they can support up to 5 transaction per second for reads, and a maximum total data read rate of 2MB/sec and up to 1,000 records per seconds for writes, up to a maximum total data write rate of 1 MB per second, including partition keys. To increase the capacity of the stream you can add more shards.
===Shard ID===
The shard ID is different from the [[#Partition_Key|partition key]].
==Shard Iterator==
{{External|[https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetShardIterator.html#Kinesis-GetShardIterator-request-ShardIteratorType Kinesis API Reference - Shard Iterators]}}
A shard iterator represents the position of the stream and shard from which the consumer will read.
{{Warn|The choice of shard iterator is important. Different shard iterators may yield different behaviors while receiving records, including not receiving some records.}}
Shard Iterators:
===LATEST===
Start reading just after the most recent record in the shard, so that you always read the most recent data in the shard. If a consumer using the "LATEST" shard iterator and a producer as start concurrently at about the same time - which may be the case in testing - receiving the record is prone to race conditions, the record may or may not be received depending on whether the consumer was started before or after the producer.
===TRIM_HORIZON===
Start reading at the last untrimmed record in the shard in the system, which is the oldest data record in the shard. If all untrimmed records in the stream are read with one TRIM_HORIZON iterator, then a new TRIM_HORIZON is created and used, the same records are received again.
===AT_SEQUENCE_NUMBER===
Start reading from the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.
===AFTER_SEQUENCE_NUMBER===
Start reading right after the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.
===AT_TIMESTAMP===
Start reading from the position denoted by a specific time stamp, provided in the value Timestamp.


==Record==
==Record==


Units of data stored in a stream. Records are made up of a [[#Sequence_Number|sequence number]], [[#Partition_Key|partition key]] and [[#Data_Blob|data blob]]. After the data blob is stored in a record, Kinesis does not inspect, interpret or change it in any way.
A record is a unit of data stored in a stream. A records has a [[#Sequence_Number|sequence number]], a [[#Partition_Key|partition key]] and a [[#Data_Blob|data blob]]. After the data blob is stored in a record, Kinesis does not inspect, interpret or change it in any way.


==Data Blob==
==Data Blob==


The data blob is the payload of data contained within a [[#Record|record]].
The data blob is the immutable sequence of bytes constituting the payload of a [[#Record|record]]. Kinesis Data Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be up to 1 MB.


==Partition Key==
==Partition Key==


The partition key is used to identify different shards in a stream, and allow a data producer to distribute data across shards.  
The partition key is used to identify different [[#Shard|shards]] in a [[#Stream|stream]], and allow a data producer to distribute data across shards. The partition key that is associated with each data record is used to determine which shard a given data record belongs to. Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards. When an application puts data into a stream, it must specify a partition key - setting the partition key is mandatory when using the API.


==Sequence Number==
==Sequence Number==


Unique identifiers for records inserted into a shard. They increase monotonically, and are specific to individual shards.
A sequence number is a unique identifier for [[#Record|records]] inserted into a shard. The sequence number is assigned by Kinesis after data is written to the stream with client.putRecords or client.putRecord. Sequence numbers increase monotonically, and are specific to individual shards. The sequence number for a record is accessible as a header ("aws_receivedSequenceNumber").
 
==Producer==
 
Producers put data [[#Record|records]] into streams. The producers can continually push data to Kinesis Data Streams.
 
==Consumer==
 
The consumers may process data in real time. A Kinesis Data Stream consumer can be a custom application running on Amazon EC2 or an Amazon Kinesis Data Firehose delivery stream.
 
==Retention Period==
 
The retention period is the length of time that data records are accessible after they are added to the stream. The default retention period is 24 hours, and it can be increased up to 168 hours (7 days).
 
==Limits==
 
{{External|https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html}}


=Services=
=Services=
Line 42: Line 108:
==Amazon Kinesis Streams==
==Amazon Kinesis Streams==


* https://aws.amazon.com/kinesis/data-streams/
{{Internal|Amazon Kinesis Streams#Overview|Amazon Kinesis Streams}}
* https://docs.aws.amazon.com/streams/latest/dev/introduction.html
* https://www.sumologic.com/blog/devops/kinesis-streams-vs-firehose/
 
Kinesis Streams is aimed at users who want to build custom applications to process or analyze streaming data.


==Amazon Kinesis Firehose==
==Amazon Kinesis Firehose==


* https://aws.amazon.com/kinesis/data-firehose/
{{Internal|Amazon Kinesis Firehose#Overview|Amazon Kinesis Firehose}}
* https://docs.aws.amazon.com/firehose/latest/dev/writing-with-sdk.html
* https://www.sumologic.com/blog/devops/kinesis-streams-vs-firehose/
 
Kinesis Firehose is a solution for loading streaming data from all kinds of sources (web applications, mobile application, IoT, telemetry) directly into AWS storage. There is no need to write applications or manage resources.


==Amazon Kinesis Analytics==
==Amazon Kinesis Analytics==


https://aws.amazon.com/kinesis/data-analytics/
{{Internal|Amazon Kinesis Analytics#Overview|Amazon Kinesis Analytics}}


=Kinesis Data Streams API=
=Kinesis Data Streams API=

Latest revision as of 21:08, 27 April 2019

External

Internal

Subjects

Overview

Kinesis acts as a highly available conduit to stream messages between data producers and data consumers. The Kinesis service can be integrated and exposed externally via the Amazon API Gateway.

Concepts

Stream

A Kinesis data stream is a named set of shards. Streams can be created from the AWS Management Console, with AWS CLI and via the Kinesis Data Stream API.

Stream Name

The name space if defined by the AWS Account and AWS region: for the same account, streams with the same name can exist in different regions.

Data Stream

Associated with Kinesis Streams.

Delivery Stream

Associated with Kinesis Firehose.

Shard

A shard is a uniquely identified sequence of data records in a stream. Shards are identified in a stream by their partition key. They also automatically get a Shard ID, which can be obtained with AWS CLI describe-stream command. Shards have a fixed capacity: they can support up to 5 transaction per second for reads, and a maximum total data read rate of 2MB/sec and up to 1,000 records per seconds for writes, up to a maximum total data write rate of 1 MB per second, including partition keys. To increase the capacity of the stream you can add more shards.

Shard ID

The shard ID is different from the partition key.

Shard Iterator

Kinesis API Reference - Shard Iterators

A shard iterator represents the position of the stream and shard from which the consumer will read.


The choice of shard iterator is important. Different shard iterators may yield different behaviors while receiving records, including not receiving some records.

Shard Iterators:

LATEST

Start reading just after the most recent record in the shard, so that you always read the most recent data in the shard. If a consumer using the "LATEST" shard iterator and a producer as start concurrently at about the same time - which may be the case in testing - receiving the record is prone to race conditions, the record may or may not be received depending on whether the consumer was started before or after the producer.

TRIM_HORIZON

Start reading at the last untrimmed record in the shard in the system, which is the oldest data record in the shard. If all untrimmed records in the stream are read with one TRIM_HORIZON iterator, then a new TRIM_HORIZON is created and used, the same records are received again.

AT_SEQUENCE_NUMBER

Start reading from the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.

AFTER_SEQUENCE_NUMBER

Start reading right after the position denoted by a specific sequence number, provided in the value StartingSequenceNumber.

AT_TIMESTAMP

Start reading from the position denoted by a specific time stamp, provided in the value Timestamp.

Record

A record is a unit of data stored in a stream. A records has a sequence number, a partition key and a data blob. After the data blob is stored in a record, Kinesis does not inspect, interpret or change it in any way.

Data Blob

The data blob is the immutable sequence of bytes constituting the payload of a record. Kinesis Data Streams does not inspect, interpret, or change the data in the blob in any way. A data blob can be up to 1 MB.

Partition Key

The partition key is used to identify different shards in a stream, and allow a data producer to distribute data across shards. The partition key that is associated with each data record is used to determine which shard a given data record belongs to. Partition keys are Unicode strings with a maximum length limit of 256 bytes. An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards. When an application puts data into a stream, it must specify a partition key - setting the partition key is mandatory when using the API.

Sequence Number

A sequence number is a unique identifier for records inserted into a shard. The sequence number is assigned by Kinesis after data is written to the stream with client.putRecords or client.putRecord. Sequence numbers increase monotonically, and are specific to individual shards. The sequence number for a record is accessible as a header ("aws_receivedSequenceNumber").

Producer

Producers put data records into streams. The producers can continually push data to Kinesis Data Streams.

Consumer

The consumers may process data in real time. A Kinesis Data Stream consumer can be a custom application running on Amazon EC2 or an Amazon Kinesis Data Firehose delivery stream.

Retention Period

The retention period is the length of time that data records are accessible after they are added to the stream. The default retention period is 24 hours, and it can be increased up to 168 hours (7 days).

Limits

https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

Services

Amazon Kinesis Streams

Amazon Kinesis Streams

Amazon Kinesis Firehose

Amazon Kinesis Firehose

Amazon Kinesis Analytics

Amazon Kinesis Analytics

Kinesis Data Streams API