Spark: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
|||
(17 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* https://spark.apache.org | * https://spark.apache.org | ||
* https://spark.apache.org/docs/latest/index.html | * https://spark.apache.org/docs/latest/index.html | ||
* https://www.macrometa.com/event-stream-processing/spark-vs-flink | |||
=Internal= | =Internal= | ||
* [[Distributed_Systems#Distributed_Computation|Distributed Systems]] | * [[Distributed_Systems#Distributed_Computation|Distributed Systems]] | ||
* [[Stream Processing]] | |||
* [[Flink]] | * [[Flink]] | ||
* [[Beam]] | |||
* [[Iceberg]] | |||
* [[Alluxio]] | |||
* [[Spark Operator]] | |||
* [[Genie]] | |||
* [[Livy]] | |||
* [[dbt]] | |||
=Overview= | =Overview= | ||
Spark is a third generation unified analytics engine for large-scale data processing. It natively supports [[System_Design#Batch_Processing|batch processing]] and [[System_Design#Stream_Processing|stream processing]]. Stream processing is implemented as micro-batching. | Spark is a third generation unified analytics engine for large-scale data processing. It natively supports [[System_Design#Batch_Processing|batch processing]] and [[System_Design#Stream_Processing|stream processing]]. Stream processing is implemented as micro-batching. It uses [[HDFS]] as state backend. | ||
=Subjects= | =Subjects= | ||
* [[Spark Concepts|Concepts]] | * [[Spark Concepts|Concepts]] | ||
=Organizatorium= | |||
* <span id='Spark_SQL'></span>Spark SQL | |||
* PySpark/Spark SQL in interactive mode on [[JupyterHub]]. | |||
* Spark batch and streaming. | |||
* Spark job. | |||
* Spark UI | |||
* Spark history server | |||
* Spark remote shuffle service | |||
* [[Spark Operator]] |
Latest revision as of 16:25, 17 May 2022
External
- https://spark.apache.org
- https://spark.apache.org/docs/latest/index.html
- https://www.macrometa.com/event-stream-processing/spark-vs-flink
Internal
Overview
Spark is a third generation unified analytics engine for large-scale data processing. It natively supports batch processing and stream processing. Stream processing is implemented as micro-batching. It uses HDFS as state backend.
Subjects
Organizatorium
- Spark SQL
- PySpark/Spark SQL in interactive mode on JupyterHub.
- Spark batch and streaming.
- Spark job.
- Spark UI
- Spark history server
- Spark remote shuffle service
- Spark Operator