设为首页 加入收藏

TOP

spark完整入门
2019-05-15 01:05:26 】 浏览:344
Tags:spark 完整 入门

1、下载spark2.1.0,下载地址https://spark.apache.org/downloads.html

2、上传到linux服务器,解压即可简单应用,具体验证是否可用步骤

第一步:进入spark的bin目录

第二步:执行spark-shell,命令./spark-shell

启动成功界面:

3、通过java编写实现spark简单的统计单词数量:

eclipse+maven+jdk1.7及以上

第一步:创建maven的java工程

第二步:依赖jar包

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cn</groupId>
<artifactId>spark</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>spark Maven Webapp</name>

<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>net.sf.jopt-simple</groupId>
<artifactId>jopt-simple</artifactId>
<version>4.3</version>
</dependency>

<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.0</version>
</dependency>

</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

pom文件编写完成后运行maven instanll下载jar包、编译
第三步:编写统计单词代码

package com.databricks.test_spark;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;

import scala.Tuple2;

public class TestSpark {
public static void main(String[] args) {
//appname可以随意编写
SparkConf conf = new SparkConf().setAppName("wordCount");
JavaSparkContext sc = new JavaSparkContext(conf);

List<String> list=new ArrayList<String>();
list.add("a b c d e");
list.add("a b c d e");
JavaRDD<String> RddList=sc.parallelize(list);
//先切分为单词,扁平化处理
JavaRDD<String> words = RddList.flatMap(
new FlatMapFunction<String, String>() {
public Iterator<String> call(String x) {

//该处一定要注意****一定要转为Iterator,否则spark提交任务时,直接报错。我就被坑了。。。
return Arrays.asList(x.split(" ")).iterator();
}
}
);
//再转化为键值对并计算
JavaPairRDD<String, Integer> counts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) throws Exception {
return new Tuple2<String, Integer>(x, 1);
}
}
).reduceByKey(new Function2<Integer,Integer,Integer>(){

public Integer call(Integer x, Integer y) throws Exception {
return x+y;
}

});
counts.saveAsTextFile("/home/spark/spark_data");
// sc.close();
}
}

4、将项目的jar包上传到服务器任意路径,测试spark提交任务

./spark-submit \
--class com.databricks.test_spark.TestSpark(类路径) \
/home/spark/spark-1.0.jar(服务器中jar包的路径)

我也是菜鸟一个,希望能帮助到入门无处的人。

】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇Spark提交任务的方式 下一篇Spark提交参数说明和常见优化

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目