Spark: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 6: Line 6:
=Internal=
=Internal=
* [[Distributed_Systems#Distributed_Computation|Distributed Systems]]
* [[Distributed_Systems#Distributed_Computation|Distributed Systems]]
* [[Stream Processing]]
* [[Flink]]
* [[Flink]]
* [[Beam]]
* [[Iceberg]]
* [[Iceberg]]
* [[Alluxio]]
* [[Alluxio]]
* [[Spark K8s Operator]]
* [[Spark Operator]]
* [[Genie]]
* [[Livy]]
* [[dbt]]


=Overview=
=Overview=
Line 17: Line 22:
* [[Spark Concepts|Concepts]]
* [[Spark Concepts|Concepts]]
=Organizatorium=
=Organizatorium=
* Spark SQL
* <span id='Spark_SQL'></span>Spark SQL
* PySpark/Spark SQL in interactive mode on [[JupyterHub]].
* PySpark/Spark SQL in interactive mode on [[JupyterHub]].
* Spark batch and streaming.
* Spark batch and streaming.
Line 24: Line 29:
* Spark history server
* Spark history server
* Spark remote shuffle service
* Spark remote shuffle service
* [[Spark K8s Operator]]
* [[Spark Operator]]

Latest revision as of 16:25, 17 May 2022

External

Internal

Overview

Spark is a third generation unified analytics engine for large-scale data processing. It natively supports batch processing and stream processing. Stream processing is implemented as micro-batching. It uses HDFS as state backend.

Subjects

Organizatorium

  • Spark SQL
  • PySpark/Spark SQL in interactive mode on JupyterHub.
  • Spark batch and streaming.
  • Spark job.
  • Spark UI
  • Spark history server
  • Spark remote shuffle service
  • Spark Operator