[代码重构](master): 新坑 Spark基础

Spark 基础总结
master
土豆兄弟 2 years ago
parent 81ab198024
commit f84029c438

@ -9,29 +9,19 @@
- 学习版本 【截止2022-7-20】
## 2. 本地开发入门
## 3. Flink 部署
## 4. Flink 实时处理核心API研究
## 5. Flink 时间语义及Window API
## 6. Flink Watermark
## 7. Flink 状态管理
## 8. Flink DataSet
## 9. Flink Table & SQL API
## 10. Flink 版本升级
## 第一章Apache Flink介绍
### 01 | 课程介绍
### 02 | 内容综述
### 03 | 流处理技术概览
### 04 | Flink发展历史与应用场景
### 05 | Flink核心特性
## 第二章Flink部署与应用
## 第三章Flink DataStream API实践原理
## 第四章Flink状态管理和容错
## 第五章Flink Table & SQL实践原理
## 第六章Flink Runtime设计与实现
## 第七章Flink监控与性能优化
## 第八章Flink组件栈介绍与使用
## 第九章:项目实战-使用Flink构建推荐系统实时数据流

@ -0,0 +1,191 @@
<h1><div style="text-align: center; color: #fa4861">Spark核心手册-自整理</div></h1>
## 0. 目录
- Spark 基础
- Spark SQL
- SparkMLlib
- StructuredStreaming
- Spark性能调优
## 1. 基础认知
- 官方QuickStart: https://spark.apache.org/
- 官方文档: https://spark.apache.org/docs/latest/
- 最新版本 【截止2022-7-20】
- 学习版本 【截止2022-7-20】
## 2. 基础
### 2.1 Spark 的 Hello World
- IDEA 开发配置
- 插件 Scala 下载 + [项目结构 -> 库 -> 添加Jar -> 选spark的jars目录下的文件]
#### A. Word Count 代码实现
- 读取内容、分词、分组计数这三步来看看 Word Count具体怎么实现。
- **第一步,读取内容**
- 首先,我们调用 SparkContext 的 textFile 方法,读取源文件,也就是 wikiOfSpark.txt
```scala
import org.apache.spark.rdd.RDD
val rootPath: String = _
val file: String = s"${rootPath}/wikiOfSpark.txt"
// 读取文件内容
val lineRDD: RDD[String] = spark.sparkContext.textFile(file)
```
- 3 个新概念,分别是 spark、sparkContext 和 RDD
- 其中spark 和 sparkContext 分别是两种不同的开发入口实例:
- spark 是开发入口 SparkSession 实例InstanceSparkSession 在 spark-shell 中会由系统自动创建;
- sparkContext 是开发入口 SparkContext 实例。
- 在 Spark 版本演进的过程中,从 2.0 版本开始SparkSession 取代了 SparkContext成为统一的开发入口。换句话说要开发 Spark 应用,你必须先创建 SparkSession。
- RDD 的全称是 Resilient Distributed Dataset意思是“弹性分布式数据集”。RDD 是 Spark 对于分布式数据的统一抽象,它定义了一系列分布式数据的基本属性与处理方法。
- **第二步,分词**
- “分词”就是把“数组”的行元素打散为单词。要实现这一点,我们可以调用 RDD 的flatMap 方法来完成。flatMap 操作在逻辑上可以分成两个步骤:**映射和展平**。
```scala
// 以行为单位做分词
val wordRDD: RDD[String] = lineRDD.flatMap(line => line.split(" "))
```
- 要把 lineRDD 的行元素转换为单词我们得先用分隔符对每个行元素进行分割Split这里的分隔符是空格
- 分割之后,每个行元素就都变成了单词数组,元素类型也从 String 变成了 Array[String],像这样以元素为单位进行转换的操作,统一称作“映射”。
- RDD 类型由原来的 RDD[String]变为 RDD[Array[String]]
- RDD[String]看成是“数组”的话,那么 RDD[Array[String]]就是一个“二维数组”,它的每一个元素都是单词。
- ![以行为单位做分词](pic/以行为单位做分词.png)
- 为了后续对单词做分组,我们还需要对这个“二维数组”做展平,也就是去掉内层的嵌套
- 结构,把“二维数组”还原成“一维数组”,如下图所示。
- ![分词后做展平](pic/分词后做展平.png)
- 就这样,在 flatMap 算子的作用下,原来以行为元素的 lineRDD转换成了以单词为元素的 wordRDD。
- 不过,值得注意的是,我们用“空格”去分割句子,有可能会产生空字符串。所以,在完成“映射”和“展平”之后,对于这样的“单词”,我们要把其中的空字符串都过滤掉,这里我们调用 RDD 的 filter 方法来过滤:
```scala
// 过滤掉空字符串
val cleanWordRDD: RDD[String] = wordRDD.filter(word => !word.equals(""))
```
- 这样一来我们在分词阶段就得到了过滤掉空字符串之后的单词“数组”类型是RDD[String]。接下来,我们就可以准备做分组计数了。
- **第三步,分组计数**
- 在 RDD 的开发框架下聚合类操作如计数、求和、求均值需要依赖键值对Key Value Pair类型的数据元素也就是KeyValue形式的“数组”元素。
- 因此,在调用聚合算子做分组计数之前,我们要先把 RDD 元素转换为KeyValue的形式也就是把 RDD[String]映射成 RDD[(String, Int)]。
- 其中,我们统一把所有的 Value 置为 1。这样一来对于同一个的单词在后续的计数运算中我们只要对 Value 做累加即可,就像这样:
- ![把元素转换为Key-Value形式](pic/把元素转换为Key-Value形式.png)
- 下面是对应的代码:
```scala
// 把RDD元素转换为KeyValue的形式
val kvRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1))
```
- 这样一来RDD 就由原来存储 String 元素的 cleanWordRDD转换为了存储StringInt的 kvRDD。
- 完成了形式的转换之后,我们就该正式做分组计数了。分组计数其实是两个步骤,也就是先“分组”,再“计数”。下面,我们使用聚合算子 reduceByKey 来同时完成分组和计数这两个操作。
- 对于 kvRDD 这个键值对“数组”reduceByKey 先是按照 Key也就是单词来做分组分组之后每个单词都有一个与之对应的 Value 列表。然后根据用户提供的聚合函数,对同一个 Key 的所有 Value 做 reduce 运算。
- 这里的 reduce你可以理解成是一种计算步骤或是一种计算方法。当我们给定聚合函数后它会用折叠的方式把包含多个元素的列表转换为单个元素值从而统计出不同元素的数量。
- 在 Word Count 的示例中,我们调用 reduceByKey 实现分组计算的代码如下:
```scala
// 按照单词做分组计数
val wordCounts: RDD[(String, Int)] = kvRDD.reduceByKey((x, y) => x + y)
```
- reduceByKey 分组聚合
- 可以看到,我们传递给 reduceByKey 算子的聚合函数是 (x, y) => x + y也就是累加函数。因此在每个单词分组之后reduce 会使用累加函数,依次折叠计算 Value 列表中的所有元素,
最终把元素列表转换为单词的频次。对于任意一个单词来说reduce 的计算过程都是一样的,如下图所示。
- ![reduce操作示意图](pic/reduce操作示意图.png)
- reduceByKey 完成计算之后,我们得到的依然是类型为 RDD[(String, Int)]的 RDD。不过与 kvRDD 不同wordCounts 元素的 Value 值,记录的是每个单词的统计词频。
- ![reduceByKey转换示意图](pic/reduceByKey转换示意图.png)
- 在程序的最后,我们还要把 wordCounts 按照词频做排序,并把词频最高的 5 个单词打印到屏幕上,代码如下所示。
```scala
// 打印词频最高的5个词汇
wordCounts.map{case (k, v) => (v, k)}.sortByKey(false).take(5)
```
- 完整代码 : HelloWorldDemo
### 2.2 RDD与编程模型延迟计算是怎么回事
#### A. 什么是 RDD
- RDD 是构建 Spark 分布式内存计算引擎的基石很多Spark 核心概念与核心组件,如 DAG 和调度系统都衍生自 RDD。因此深入理解 RDD有利于你更全面、系统地学习 Spark 的工作原理
- 尽管 RDD API 使用频率越来越低,绝大多数人也都已经习惯于 DataFrame 和Dataset API但是无论采用哪种 API 或是哪种开发语言,你的应用在 Spark 内部最终都会转化为 RDD 之上的分布式计算。
- 换句话说,如果你想要对 Spark 作业有更好的把握,前提是你要对 RDD 足够了解。
- 用一句话来概括,**RDD 是一种抽象是Spark 对于分布式数据集的抽象,它用于囊括所有内存中和磁盘中的分布式数据实体**。
- RDD和数组类似, RDD 和数组对比
- ![RDD与数组的对比](pic/RDD与数组的对比.png)
- 首先,就概念本身来说,数组是实体,它是一种存储同类元素的数据结构,而 RDD 是一种抽象,它所囊括的是分布式计算环境中的分布式数据集。
- 因此,这两者第二方面的不同就是在活动范围,数组的“活动范围”很窄,仅限于单个计算节点的某个进程内,而 RDD 代表的数据集是跨进程、跨节点的,它的“活动范围”是整个集群。
- 至于数组和 RDD 的第三个不同,则是在数据定位方面。在数组中,承载数据的基本单元是元素,而 RDD 中承载数据的基本单元是数据分片。在分布式计算环境中,一份完整的数据集,会按照某种规则切割成多份数据分片。
这些数据分片被均匀地分发给集群内不同的计算节点和执行进程,从而实现分布式并行计算。
- 通过以上对比,不难发现,**数据分片Partitions** 是 RDD 抽象的重要属性之一。
- 接下来咱们换个视角,从 RDD 的重要属性出发去进一步深入理解RDD。要想吃透 RDD我们必须掌握它的 4 大属性:
- partitions数据分片
- partitioner分片切割规则
- dependenciesRDD 依赖
- compute转换函数
#### B. 从薯片的加工流程看 RDD 的 4 大属性
- ![RDD的生活化类比](pic/RDD的生活化类比.png)
- 为了充分利用每一颗土豆、降低生产成本
- 工坊使用 3 条流水线来同时生产 3 种不同尺寸的桶装薯片。3 条流水线可以同时加工 3 颗土豆,每条流水线的作业流程都是一样的
- 分别是清洗、切片、烘焙、分发和装桶
- 其中,分发环节用于区分小、中、大号 3 种薯片3 种不同尺寸的薯片分别被发往第 1、2、3 条流水线。
- 那如果我们把每一条流水线看作是分布式运行环境的计算节点,用薯片生产的流程去类比 Spark 分布式计算,会有哪些有趣的发现呢?
- 这里的每一种食材形态,如“带泥土豆”、“干净土豆”、“土豆片”等,都可以看成是一个个 RDD。
- 而薯片的制作过程,实际上就是不同食材形态的转换过程。
- 从上到下的方向,去观察上图中土豆工坊的制作工艺。
- 我们可以看到对于每一种食材形态来说,流水线上都有多个实物与之对应,比如,“带泥土豆”是一种食材形态,流水线上总共有 3 颗“脏兮兮”的土豆同属于这一形态。
- 如果把“带泥土豆”看成是 RDD 的话,那么 RDD 的 partitions 属性囊括的正是麻袋里那一颗颗脏兮兮的土豆。同理流水线上所有洗净的土豆一同构成了“干净土豆”RDD的 partitions 属性。
- 我们再来看 RDD 的 partitioner 属性这个属性定义了把原始数据集切割成数据分片的切割规则。在土豆工坊的例子中“带泥土豆”RDD 的切割规则是随机拿取,也就是从麻袋中随机拿取一颗脏兮兮的土豆放到流水线上。
后面的食材形态如“干净土豆”、“土豆片”和“即食薯片”则沿用了“带泥土豆”RDD 的切割规则。换句话说后续的这些RDD分别继承了前一个 RDD 的 partitioner 属性。
- 这里面与众不同的是“分发的即食薯片”。显然“分发的即食薯片”是通过对“即食薯片”按照大、中、小号做分发得到的。也就是说对于“分发的即食薯片”来说它的partitioner 属性,重新定义了这个 RDD 数据分片的切割规则,
也就是把先前 RDD 的数据分片打散,按照薯片尺寸重新构建数据分片。
- 由这个例子我们可以看出,数据分片的分布,是由 RDD 的 partitioner 决定的。因此RDD 的 partitions 属性,与它的 partitioner 属性是强相关的。
- 接下来,我们横向地,也就是沿着从左至右的方向,再来观察土豆工坊的制作工艺。
- 流水线上的每一种食材形态,都是上一种食材形态在某种操作下进行转换得到的。比如,“土豆片”依赖的食材形态是“干净土豆”,这中间用于转换的操作是“切片”这个动作。回顾 Word Count 当中 RDD 之间的转换关系,我们也会发现类似的现象。
- ![WordCount中的RDD转换](pic/WordCount中的RDD转换.png)
- 在数据形态的转换过程中,每个 RDD 都会通过 dependencies 属性来记录它所依赖的前一个、或是多个 RDD简称“父 RDD”。与此同时RDD 使用 compute 属性,来记录从父 RDD 到当前 RDD 的转换操作。
- 拿 Word Count 当中的 wordRDD 来举例,它的父 RDD 是 lineRDD因此它的dependencies 属性记录的是 lineRDD。从 lineRDD 到 wordRDD 的转换,其所依赖的操作是 flatMap
因此wordRDD 的 compute 属性,记录的是 flatMap 这个转换函数。
- 总结下来,薯片的加工流程,与 RDD 的概念和 4 大属性是一一对应的:
- 不同的食材形态,如带泥土豆、土豆片、即食薯片等等,对应的就是 RDD 概念;
- 同一种食材形态在不同流水线上的具体实物,就是 RDD 的 partitions 属性;
- 食材按照什么规则被分配到哪条流水线,对应的就是 RDD 的 partitioner 属性;
- 每一种食材形态都会依赖上一种形态,这种依赖关系对应的是 RDD 中的dependencies 属性;
- 不同环节的加工方法对应 RDD 的 compute 属性。
- 编程模型指导我们如何进行代码实现,而延迟计算是 Spark 分布式运行机制的基础。只有搞明白编程模型与延迟计算,你才能流畅地在 Spark 之上做应用开发,在实现业务逻辑的同时,避免埋下性能隐患。
#### C. 编程模型与延迟计算
- map、filter、flatMap 和 reduceByKey 这些算子,有哪些共同点?
- 首先,这 4 个算子都是作用Apply在 RDD 之上、用来做 RDD 之间的转换。比如flatMap 作用在 lineRDD 之上,把 lineRDD 转换为 wordRDD。
- 其次这些算子本身是函数而且它们的参数也是函数。参数是函数、或者返回值是函数的函数我们把这类函数统称为“高阶函数”Higher-order Functions
- 这里,我们先专注在 RDD 算子的第一个共性RDD 转换。
- RDD 是 Spark 对于分布式数据集的抽象,每一个 RDD 都代表着**一种分布式数据形态**。比如 lineRDD它表示数据在集群中以行Line的形式存在而 wordRDD 则意味着数据的形态是单词,分布在计算集群中。
- RDD 代表的是分布式数据形态,因此,**RDD 到 RDD 之间的转换本质上是数据形态上的转换Transformations**。
- 在 RDD 的编程模型中一共有两种算子Transformations 类算子和 Actions 类算子。
- 开发者需要使用 Transformations 类算子定义并描述数据形态的转换过程然后调用Actions 类算子,将计算结果收集起来、或是物化到磁盘。
- 在这样的编程模型下Spark 在运行时的计算被划分为两个环节。
- 基于不用数据形态之间的转换构建计算流图DAGDirected Acyclic Graph
- 通过 Actions 类算子,以回溯的方式去触发执行这个计算流图。
- 换句话说,开发者调用的各类 Transformations 算子,并不立即执行计算,当且仅当开发者调用 Actions 算子时之前调用的转换算子才会付诸执行。在业内这样的计算模式有个专门的术语叫作“延迟计算”Lazy Evaluation
- 为什么 Word Count 在执行的过程中,只有最后一行代码会花费很长时间,而前面的代码都是瞬间执行完毕的呢?
- 答案正是 Spark 的延迟计算。flatMap、filter、map 这些算子,仅用于构建计算流图,因此,当你在 spark-shell 中敲入这些代码时spark-shell 会立即返回。只有在你敲入最后那行包含 take 的代码时,
Spark 才会触发执行从头到尾的计算流程,所以直观地看上去,最后一行代码是最耗时的。
- Spark 程序的整个运行流程如下图所示:
- ![延迟计算](pic/延迟计算.png)
- 常用的 RDD 算子进行了归类,并整理到了下面的表格中,随时查阅
- ![RDD算子归类](pic/RDD算子归类.png)
- 参考: https://spark.apache.org/docs/latest/rdd-programming-guide.html

@ -0,0 +1,23 @@
import org.apache.spark
import org.apache.spark.rdd.RDD
class HelloWorldDemo {
def main(args: Array[String]): Unit = {
val rootPath: String = _
val file: String = s"${rootPath}/wikiOfSpark.txt"
// 读取文件内容
// fixme oldAPI : val lineRDD: RDD[String] = spark.sparkContext.textFile(file)
val lineRDD: RDD[String] = spark.SparkContext.getOrCreate().textFile(file)
// 以行为单位做分词
val wordRDD: RDD[String] = lineRDD.flatMap(line => line.split(" "))
val cleanWordRDD: RDD[String] = wordRDD.filter(word => !word.equals(""))
// 把RDD元素转换为KeyValue的形式
val kvRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1))
// 按照单词做分组计数
val wordCounts: RDD[(String, Int)] = kvRDD.reduceByKey((x, y) => x + y)
// 打印词频最高的5个词汇
wordCounts.map{case (k, v) => (v, k)}.sortByKey(false).take(5)
}
}

@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>dev-protocol</artifactId>
<groupId>org.example</groupId>
<version>1.0-SNAPSHOT</version>
<relativePath>../../../pom.xml</relativePath>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>best-spark</artifactId>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
</properties>
</project>

@ -0,0 +1,190 @@
Apache Spark
From Wikipedia, the free encyclopedia
Jump to navigationJump to search
Apache Spark
Spark Logo
Original author(s) Matei Zaharia
Developer(s) Apache Spark
Initial release May 26, 2014; 6 years ago
Stable release
3.1.1 / March 2, 2021; 2 months ago
Repository Spark Repository
Written in Scala[1]
Operating system Microsoft Windows, macOS, Linux
Available in Scala, Java, SQL, Python, R, C#, F#
Type Data analytics, machine learning algorithms
License Apache License 2.0
Website spark.apache.org Edit this at Wikidata
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Contents
1 Overview
1.1 Spark Core
1.2 Spark SQL
1.3 Spark Streaming
1.4 MLlib Machine Learning Library
1.5 GraphX
1.6 Language support
2 History
2.1 Developers
3 See also
4 Notes
5 References
6 External links
Overview
Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.[2] The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged[3] even though the RDD API is not deprecated.[4][5] The RDD technology still underlies the Dataset API.[6][7]
Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.[8]
Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation.[2][9] Among the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark.[10]
Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster, where you can launch a cluster either manually or use the launch scripts provided by the install package. It is also possible to run these daemons on a single machine for testing), Hadoop YARN, Apache Mesos or Kubernetes. [11] For distributed storage, Spark can interface with a wide variety, including Alluxio, Hadoop Distributed File System (HDFS),[12] MapR File System (MapR-FS),[13] Cassandra,[14] OpenStack Swift, Amazon S3, Kudu, Lustre file system,[15] or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.
Spark Core
Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, .NET[16] and R) centered on the RDD abstraction (the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as Julia[17]). This interface mirrors a functional/higher-order model of programming: a "driver" program invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster.[2] These operations, and additional ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the "lineage" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, .NET, Java, or Scala objects.
Besides the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: broadcast variables reference read-only data that needs to be available on all nodes, while accumulators can be used to program reductions in an imperative style.[2]
A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones. Each map, flatMap (a variant of map) and reduceByKey takes an anonymous function that performs a simple operation on a single data item (or a pair of items), and applies its argument to transform an RDD into a new RDD.
val conf = new SparkConf().setAppName("wiki_test") // create a spark config object
val sc = new SparkContext(conf) // Create a spark context
val data = sc.textFile("/path/to/somedir") // Read files from "somedir" into an RDD of (filename, content) pairs.
val tokens = data.flatMap(_.split(" ")) // Split each file into a list of tokens (words).
val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a count of one to each token, then sum the counts per word type.
wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10) // Get the top 10 words. Swap word and count to sort by count.
Spark SQL
Spark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames,[a] which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, Python or .NET.[16] It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.
import org.apache.spark.sql.SparkSession
val url = "jdbc:mysql://yourIP:yourPort/test?user=yourUsername;password=yourPassword" // URL for your database server.
val spark = SparkSession.builder().getOrCreate() // Create a Spark session object
val df = spark
.read
.format("jdbc")
.option("url", url)
.option("dbtable", "people")
.load()
df.printSchema() // Looks the schema of this DataFrame.
val countsByAge = df.groupBy("age").count() // Counts people by age
//or alternatively via SQL:
//df.createOrReplaceTempView("people")
//val countsByAge = spark.sql("SELECT age, count(*) FROM people GROUP BY age")
Spark Streaming
Spark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture.[19][20] However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines that process event by event rather than in mini-batches include Storm and the streaming component of Flink.[21] Spark Streaming has support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.[22]
In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming.[23]
Spark can be deployed in a traditional on-premises data center as well as in the cloud.
MLlib Machine Learning Library
Spark MLlib is a distributed machine-learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit.[24] Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:
summary statistics, correlations, stratified sampling, hypothesis testing, random data generation[25]
classification and regression: support vector machines, logistic regression, linear regression, naive Bayes classification, Decision Tree, Random Forest, Gradient-Boosted Tree
collaborative filtering techniques including alternating least squares (ALS)
cluster analysis methods including k-means, and latent Dirichlet allocation (LDA)
dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)
feature extraction and transformation functions
optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)
GraphX
GraphX is a distributed graph-processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database.[26] GraphX provides two separate APIs for implementation of massively parallel algorithms (such as PageRank): a Pregel abstraction, and a more general MapReduce-style API.[27] Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices).[28]
GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.[29]
Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.[30]
Language support
Apache Spark has built-in support for Scala, Java, R, and Python with 3rd party support for the .net languages,[31] Julia,[32] and more.
History
Spark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.[33]
In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.[34]
In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark.[35][33]
Spark had in excess of 1000 contributors in 2015,[36] making it one of the most active projects in the Apache Software Foundation[37] and one of the most active open source big data projects.
Version Original release date Latest version Release date
0.5 2012-06-12 0.5.1 2012-10-07
0.6 2012-10-14 0.6.2 2013-02-07
0.7 2013-02-27 0.7.3 2013-07-16
0.8 2013-09-25 0.8.1 2013-12-19
0.9 2014-02-02 0.9.2 2014-07-23
1.0 2014-05-26 1.0.2 2014-08-05
1.1 2014-09-11 1.1.1 2014-11-26
1.2 2014-12-18 1.2.2 2015-04-17
1.3 2015-03-13 1.3.1 2015-04-17
1.4 2015-06-11 1.4.1 2015-07-15
1.5 2015-09-09 1.5.2 2015-11-09
1.6 2016-01-04 1.6.3 2016-11-07
2.0 2016-07-26 2.0.2 2016-11-14
2.1 2016-12-28 2.1.3 2018-06-26
2.2 2017-07-11 2.2.3 2019-01-11
2.3 2018-02-28 2.3.4 2019-09-09
2.4 LTS 2018-11-02 2.4.7 2020-10-12[38]
3.0 2020-06-18 3.0.2 2020-02-19[39]
3.1 2021-03-02 3.1.1 2021-03-02[40]
Legend:Old versionOlder version, still maintainedLatest versionLatest preview version
Developers
Apache Spark is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC). The current PMC is Aaron Davidson, Andy Konwinski, Andrew Or, Ankur Dave, Robert Joseph Evans, DB Tsai, Dongjoon Hyun, Felix Cheung, Hyukjin Kwon, Haoyuan Li, Ram Sriharsha, Holden Karau, Herman van Hövell, Imran Rashid, Jason Dai, Joseph Kurata Bradley, Joseph E. Gonzalez, Josh Rosen, Jerry Shao, Kay Ousterhout, Cheng Lian, Xiao Li, Mark Hamstra, Michael Armbrust, Matei Zaharia, Xiangrui Meng, Nicholas Pentreath, Mosharaf Chowdhury, Mridul Muralidharan, Prashant Sharma, Patrick Wendell, Reynold Xin, Ryan LeCompte, Shane Huang, Shivaram Venkataraman, Sean McNamara, Sean R. Owen, Stephen Haberman, Tathagata Das, Thomas Graves, Thomas Dudziak, Takuya Ueshin, Marcelo Masiero Vanzin, Wenchen Fan, Charles Reiss, Andrew Xia, Yin Huai, Yanbo Liang, Shixiong Zhu.[41]
See also
List of concurrent and parallel programming APIs/Frameworks
Notes
Called SchemaRDDs before Spark 1.3[18]
References
"Spark Release 2.0.0". MLlib in R: SparkR now offers MLlib APIs [..] Python: PySpark now offers many more MLlib algorithms"
Zaharia, Matei; Chowdhury, Mosharaf; Franklin, Michael J.; Shenker, Scott; Stoica, Ion. Spark: Cluster Computing with Working Sets (PDF). USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
"Spark 2.2.0 Quick Start". apache.org. 2017-07-11. Retrieved 2017-10-19. we highly recommend you to switch to use Dataset, which has better performance than RDD
"Spark 2.2.0 deprecation list". apache.org. 2017-07-11. Retrieved 2017-10-10.
Damji, Jules (2016-07-14). "A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets: When to use them and why". databricks.com. Retrieved 2017-10-19.
Chambers, Bill (2017-08-10). "12". Spark: The Definitive Guide. O'Reilly Media. virtually all Spark code you run, where DataFrames or Datasets, compiles down to an RDD
"What is Apache Spark? Spark Tutorial Guide for Beginner". janbasktraining.com. 2018-04-13. Retrieved 2018-04-13.
Zaharia, Matei; Chowdhury, Mosharaf; Das, Tathagata; Dave, Ankur; Ma, Justin; McCauley, Murphy; J., Michael; Shenker, Scott; Stoica, Ion (2010). Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (PDF). USENIX Symp. Networked Systems Design and Implementation.
Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion (June 2013). "Shark: SQL and Rich Analytics at Scale" (PDF). arXiv:1211.6176. Bibcode:2012arXiv1211.6176X. Unknown parameter |conference= ignored (help);
Harris, Derrick (28 June 2014). "4 reasons why Spark could jolt Hadoop into hyperdrive". Gigaom.
"Cluster Mode Overview - Spark 2.4.0 Documentation - Cluster Manager Types". apache.org. Apache Foundation. 2019-07-09. Retrieved 2019-07-09.
Figure showing Spark in relation to other open-source Software projects including Hadoop
MapR ecosystem support matrix
Doan, DuyHai (2014-09-10). "Re: cassandra + spark / pyspark". Cassandra User (Mailing list). Retrieved 2014-11-21.
Wang, Yandong; Goldstone, Robin; Yu, Weikuan; Wang, Teng (May 2014). "Characterization and Optimization of Memory-Resident MapReduce on HPC Systems". 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE. pp. 799808. doi:10.1109/IPDPS.2014.87. ISBN 978-1-4799-3800-1. S2CID 11157612.
dotnet/spark, .NET Platform, 2020-09-14, retrieved 2020-09-14
"GitHub - DFDX/Spark.jl: Julia binding for Apache Spark". 2019-05-24.
"Spark Release 1.3.0 | Apache Spark".
"Applying the Lambda Architecture with Spark, Kafka, and Cassandra | Pluralsight". www.pluralsight.com. Retrieved 2016-11-20.
Shapira, Gwen (29 August 2014). "Building Lambda Architecture with Spark Streaming". cloudera.com. Cloudera. Archived from the original on 14 June 2016. Retrieved 17 June 2016. re-use the same aggregates we wrote for our batch application on a real-time data stream
Chintapalli, Sanket; Dagit, Derek; Evans, Bobby; Farivar, Reza; Graves, Thomas; Holderbaugh, Mark; Liu, Zhuo; Nusbaum, Kyle; Patil, Kishorkumar; Peng, Boyang Jerry; Poulosky, Paul (May 2016). "Benchmarking Streaming Computation Engines: Storm, Flink and Spark Streaming". 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE. pp. 17891792. doi:10.1109/IPDPSW.2016.138. ISBN 978-1-5090-3682-0. S2CID 2180634.
Kharbanda, Arush (17 March 2015). "Getting Data into Spark Streaming". sigmoid.com. Sigmoid (Sunnyvale, California IT product company). Archived from the original on 15 August 2016. Retrieved 7 July 2016.
Zaharia, Matei (2016-07-28). "Structured Streaming In Apache Spark: A new high-level API for streaming". databricks.com. Retrieved 2017-10-19.
Sparks, Evan; Talwalkar, Ameet (2013-08-06). "Spark Meetup: MLbase, Distributed Machine Learning with Spark". slideshare.net. Spark User Meetup, San Francisco, California. Retrieved 10 February 2014.
"MLlib | Apache Spark". spark.apache.org. Retrieved 2016-01-18.
Malak, Michael (14 June 2016). "Finding Graph Isomorphisms In GraphX And GraphFrames: Graph Processing vs. Graph Database". slideshare.net. sparksummit.org. Retrieved 11 July 2016.
Malak, Michael (1 July 2016). Spark GraphX in Action. Manning. p. 89. ISBN 9781617292521. Pregel and its little sibling aggregateMessages() are the cornerstones of graph processing in GraphX. ... algorithms that require more flexibility for the terminating condition have to be implemented using aggregateMessages()
Malak, Michael (14 June 2016). "Finding Graph Isomorphisms In GraphX And GraphFrames: Graph Processing vs. Graph Database". slideshare.net. sparksummit.org. Retrieved 11 July 2016.
Malak, Michael (1 July 2016). Spark GraphX in Action. Manning. p. 9. ISBN 9781617292521. Giraph is limited to slow Hadoop Map/Reduce
Gonzalez, Joseph; Xin, Reynold; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael; Stoica, Ion (Oct 2014). "GraphX: Graph Processing in a Distributed Dataflow Framework" (PDF). Unknown parameter |conference= ignored (help);
[1]
[2]
Clark, Lindsay. "Apache Spark speeds up big data decision-making". ComputerWeekly.com. Retrieved 2018-05-16.
"The Apache Software Foundation Announces Apache&#8482 Spark&#8482 as a Top-Level Project". apache.org. Apache Software Foundation. 27 February 2014. Retrieved 4 March 2014.
Spark officially sets a new record in large-scale sorting
Open HUB Spark development activity
"The Apache Software Foundation Announces Apache&#8482 Spark&#8482 as a Top-Level Project". apache.org. Apache Software Foundation. 27 February 2014. Retrieved 4 March 2014.
"Spark News". apache.org.
"Spark News". apache.org.
"Spark News". apache.org.
https://projects.apache.org/committee.html?spark
External links
Official website Edit this at Wikidata
vte
Apache Software Foundation
vte
Parallel computing
Categories: Apache Software Foundation projectsBig data productsCluster computingData mining and machine learning softwareFree software programmed in ScalaHadoopJava platformSoftware using the Apache licenseUniversity of California, Berkeley

Binary file not shown.

After

Width:  |  Height:  |  Size: 195 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 130 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 261 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 83 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

@ -44,6 +44,7 @@
<module>longpolling/demo/demo3/dev-protocol-netty-server</module>
<module>longpolling/demo/demo3/dev-protocol-netty-common</module>
<module>code-language/java/java-demo</module>
<module>bigdata/spark/best-spark</module>
</modules>
<properties>

Loading…
Cancel
Save