1、下载spark2.1.0,下载地址https://spark.apache.org/downloads.html
2、上传到linux服务器,解压即可简单应用,具体验证是否可用步骤
第一步:进入spark的bin目录
第二步:执行spark-shell,命令./spark-shell
启动成功界面:
3、通过java编写实现spark简单的统计单词数量:
eclipse+maven+jdk1.7及以上
第一步:创建maven的java工程
第二步:依赖jar包
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cn</groupId>
<artifactId>spark</artifactId>
<packaging>jar</packaging>
<version>1.0</version>
<name>spark Maven Webapp</name>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>1.2.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>net.sf.jopt-simple</groupId>
<artifactId>jopt-simple</artifactId>
<version>4.3</version>
</dependency>
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.0</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
pom文件编写完成后运行maven instanll下载jar包、编译
第三步:编写统计单词代码
package com.databricks.test_spark;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class TestSpark {
public static void main(String[] args) {
//appname可以随意编写
SparkConf conf = new SparkConf().setAppName("wordCount");
JavaSparkContext sc = new JavaSparkContext(conf);
List<String> list=new ArrayList<String>();
list.add("a b c d e");
list.add("a b c d e");
JavaRDD<String> RddList=sc.parallelize(list);
//先切分为单词,扁平化处理
JavaRDD<String> words = RddList.flatMap(
new FlatMapFunction<String, String>() {
public Iterator<String> call(String x) {
//该处一定要注意****一定要转为Iterator,否则spark提交任务时,直接报错。我就被坑了。。。
return Arrays.asList(x.split(" ")).iterator();
}
}
);
//再转化为键值对并计算
JavaPairRDD<String, Integer> counts = words.mapToPair(
new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) throws Exception {
return new Tuple2<String, Integer>(x, 1);
}
}
).reduceByKey(new Function2<Integer,Integer,Integer>(){
public Integer call(Integer x, Integer y) throws Exception {
return x+y;
}
});
counts.saveAsTextFile("/home/spark/spark_data");
// sc.close();
}
}
4、将项目的jar包上传到服务器任意路径,测试spark提交任务
./spark-submit \
--class com.databricks.test_spark.TestSpark(类路径) \
/home/spark/spark-1.0.jar(服务器中jar包的路径)
我也是菜鸟一个,希望能帮助到入门无处的人。