设为首页 加入收藏

TOP

5-1、Spark环境搭建
2018-11-13 15:20:25 】 浏览:61
Tags:5-1 Spark 环境 搭建
版权声明:分享的快乐。。 https://blog.csdn.net/baolibin528/article/details/50865575

2Spark环境搭建

2.1、官网下载

Spark官网地址:http://spark.apache.org/

下载后如下:

Linux上安装部署Spark

Jdk

Scala

SSH

Hadoop

Spark

2.2、安装模式

Local模式(学习、测试之用)

Standalone模式(内置的资源管理和调度框架)

MesosApache

YarnHadoop

2.3local模式

软件:

jdk1.8

centos6.5

hadoop2.6.0

spark-1.5.2-bin-hadoop2.6.tgz

1、解压,编辑spark-env.sh文件

[html]viewplaincopyprint

1.[root@spark0conf]#cpspark-env.sh.templatespark-env.sh

2.[root@spark0conf]#vimspark-env.sh

3.SPARK_MASTER_IP=192.168.6.2

2、设置主节点IP地址:

3、设置从节点IP地址:

添加IP地址:

4、启动:

[html]viewplaincopyprint

1.sbin/start-all.sh

查看进程:

5、启动spark-shell

[html]viewplaincopyprint

1.bin/spark-shell

6、运行一个简单的例子:

[html]viewplaincopyprint

1.valrdd=sc.textFile("/home/spark0/spark.txt").collect

7、网页http://192.168.6.2:4040/jobs/查看任务:

2.4Standalone模式

提前安装好hadoop

我准备了两个节点,jdkhadoop先安装好。

我用的两个节点,电脑配置不行,3个节点演示能更好些

1、解压

2、编辑文件:

[html]viewplaincopyprint

1.[root@spark0conf]#cpspark-env.sh.templatespark-env.sh

2.[root@spark0conf]#vimspark-env.sh

3.SPARK_MASTER_IP=192.168.6.2

4.

5.[root@spark0conf]#cpslaves.templateslaves

6.[root@spark0conf]#vimslaves

7.#ASparkWorkerwillbestartedoneachofthemachineslistedbelow.

8.192.168.6.2

9.192.168.6.3

10.


3、配置好的拷贝到另一个几点上面去:

[root@spark0local]#scp-rsparkspark1:/usr/local/

4、查看主节点的进程,除了worker还有master

5、查看另一个节点,除了hadoop进程,只有worker,没有master进程

6、例子,先在本地设置一个文件

7、打开spark-shell,运行一个小例子:

[html]viewplaincopyprint

1.scala>valrdd=sc.textFile("/home/spark0/spark.txt").collect

8、关闭,两个节点的spark集群:

2.5IntellijIDEA搭建配置Scala环境

IntellijIDEA搭建配置Scala环境

1、配置Scala插件:

选择安装插件:

点击Install:

安装进度:

安装完插件之后,重启IntellijIDEA:

2、配置JDK

把本地的JDK加载进去:

加载了JdkScala:

3、搭建一个Scala项目:

项目名称和项目地址:

右键src目录,新建package和类:

书写简单的测验代码:

代码如下:

/**
*Createdbyadminon2015/12/15.
*/
objectDemo1{
defmain(args:Array[String]){
vala=123
println(a)
}
}

右键该类,运行代码:

运行结果:

4、打包:

Nextstep

选择主类:

Nextstep

点击应用、OK

Nextstep

Nextstep

Nextstep

本地打好的包:

2.6IntellijIDEA开发集群提交运行Spark代码

IntellijIDEA开发集群提交运行Spark代码

1、右键项目:

选择倒第四个OpenModuleSettings

2添加Spark依赖包:

添加进去:

3进度:

4编写Wc代码:

新建一个Scala类,开始编写spark代码:

importorg.apache.spark.{SparkContext,SparkConf}

/**
*Createdbyadminon2015/12/15.
*/
objectSparkDemo1{
defmain(args:Array[String]){
if(args.length<1){
System.err.print("Usage<File>")
System.exit(1)
}
valconf=newSparkConf()
valsc=newSparkContext(conf)
valline=sc.textFile(args(0))

line.flatMap(_.split("\t")).map((_,1)).reduceByKey(_+_).collect().foreach(println)

sc.stop()
}
}

IntellijIDA视图如下:

5打包:

打好的包:

用的是Standalone集群:

6集群运行代码:

[root@spark0spark]#bin/spark-submit--classSparkDemo1/usr/local/spark/ScalaDemo1.jar/usr/local/spark-1.5.2-bin-hadoop2.6/bigdata/wcDemo1.txt

运行结果:

整个运行过程:

[root@spark0spark]#bin/spark-submit--classSparkDemo1/usr/local/spark/ScalaDemo1.jar/usr/local/spark-1.5.2-bin-hadoop2.6/bigdata/wcDemo1.txt

UsingSpark'sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties

15/12/1514:44:21INFOSparkContext:RunningSparkversion1.5.2

15/12/1514:44:21WARNNativeCodeLoader:Unabletoloadnative-hadooplibraryforyourplatform...usingbuiltin-javaclasseswhereapplicable

15/12/1514:44:21INFOSecurityManager:Changingviewaclsto:root

15/12/1514:44:21INFOSecurityManager:Changingmodifyaclsto:root

15/12/1514:44:21INFOSecurityManager:SecurityManager:authenticationdisabled;uiaclsdisabled;userswithviewpermissions:Set(root);userswithmodifypermissions:Set(root)

15/12/1514:44:23INFOSlf4jLogger:Slf4jLoggerstarted

15/12/1514:44:23INFORemoting:Startingremoting

15/12/1514:44:23INFORemoting:Remotingstarted;listeningonaddresses:[akka.tcp://sparkDriver@192.168.6.2:40172]

15/12/1514:44:23INFOUtils:Successfullystartedservice'sparkDriver'onport40172.

15/12/1514:44:23INFOSparkEnv:RegisteringMapOutputTracker

15/12/1514:44:23INFOSparkEnv:RegisteringBlockManagerMaster

15/12/1514:44:23INFODiskBlockManager:Createdlocaldirectoryat/tmp/blockmgr-dbd5e9f3-bfa1-4c03-9bbe-9ff2661049f2

15/12/1514:44:23INFOMemoryStore:MemoryStorestartedwithcapacity534.5MB

15/12/1514:44:23INFOHttpFileServer:HTTPFileserverdirectoryis/tmp/spark-ffda6806-746c-4740-9558-bd3a047fafe2/httpd-0ded5c18-fb57-4ab7-bcd2-3ef6a5fa550d

15/12/1514:44:23INFOHttpServer:StartingHTTPServer

15/12/1514:44:23INFOUtils:Successfullystartedservice'HTTPfileserver'onport40716.

15/12/1514:44:23INFOSparkEnv:RegisteringOutputCommitCoordinator

15/12/1514:44:24INFOUtils:Successfullystartedservice'SparkUI'onport4040.

15/12/1514:44:24INFOSparkUI:StartedSparkUIathttp://192.168.6.2:4040

15/12/1514:44:24INFOSparkContext:AddedJARfile:/usr/local/spark/ScalaDemo1.jarathttp://192.168.6.2:40716/jars/ScalaDemo1.jarwithtimestamp1450161864561

15/12/1514:44:24WARNMetricsSystem:UsingdefaultnameDAGSchedulerforsourcebecausespark.app.idisnotset.

15/12/1514:44:24INFOExecutor:StartingexecutorIDdriveronhostlocalhost

15/12/1514:44:25INFOUtils:Successfullystartedservice'org.apache.spark.network.netty.NettyBlockTransferService'onport40076.

15/12/1514:44:25INFONettyBlockTransferService:Servercreatedon40076

15/12/1514:44:25INFOBlockManagerMaster:TryingtoregisterBlockManager

15/12/1514:44:25INFOBlockManagerMasterEndpoint:Registeringblockmanagerlocalhost:40076with534.5MBRAM,BlockManagerId(driver,localhost,40076)

15/12/1514:44:25INFOBlockManagerMaster:RegisteredBlockManager

15/12/1514:44:29INFOMemoryStore:ensureFreeSpace(130448)calledwithcurMem=0,maxMem=560497950

15/12/1514:44:29INFOMemoryStore:Blockbroadcast_0storedasvaluesinmemory(estimatedsize127.4KB,free534.4MB)

15/12/1514:44:29INFOMemoryStore:ensureFreeSpace(14276)calledwithcurMem=130448,maxMem=560497950

15/12/1514:44:29INFOMemoryStore:Blockbroadcast_0_piece0storedasbytesinmemory(estimatedsize13.9KB,free534.4MB)

15/12/1514:44:29INFOBlockManagerInfo:Addedbroadcast_0_piece0inmemoryonlocalhost:40076(size:13.9KB,free:534.5MB)

15/12/1514:44:29INFOSparkContext:Createdbroadcast0fromtextFileatSparkDemo1.scala:14

15/12/1514:44:30INFOFileInputFormat:Totalinputpathstoprocess:1

15/12/1514:44:30INFOSparkContext:Startingjob:collectatSparkDemo1.scala:16

15/12/1514:44:30INFODAGScheduler:RegisteringRDD3(mapatSparkDemo1.scala:16)

15/12/1514:44:30INFODAGScheduler:Gotjob0(collectatSparkDemo1.scala:16)with1outputpartitions

15/12/1514:44:30INFODAGScheduler:Finalstage:ResultStage1(collectatSparkDemo1.scala:16)

15/12/1514:44:30INFODAGScheduler:Parentsoffinalstage:List(ShuffleMapStage0)

15/12/1514:44:30INFODAGScheduler:Missingparents:List(ShuffleMapStage0)

15/12/1514:44:30INFODAGScheduler:SubmittingShuffleMapStage0(MapPartitionsRDD[3]atmapatSparkDemo1.scala:16),whichhasnomissingparents

15/12/1514:44:30INFOMemoryStore:ensureFreeSpace(4064)calledwithcurMem=144724,maxMem=560497950

15/12/1514:44:30INFOMemoryStore:Blockbroadcast_1storedasvaluesinmemory(estimatedsize4.0KB,free534.4MB)

15/12/1514:44:30INFOMemoryStore:ensureFreeSpace(2328)calledwithcurMem=148788,maxMem=560497950

15/12/1514:44:30INFOMemoryStore:Blockbroadcast_1_piece0storedasbytesinmemory(estimatedsize2.3KB,free534.4MB)

15/12/1514:44:30INFOBlockManagerInfo:Addedbroadcast_1_piece0inmemoryonlocalhost:40076(size:2.3KB,free:534.5MB)

15/12/1514:44:30INFOSparkContext:Createdbroadcast1frombroadcastatDAGScheduler.scala:861

15/12/1514:44:30INFODAGScheduler:Submitting1missingtasksfromShuffleMapStage0(MapPartitionsRDD[3]atmapatSparkDemo1.scala:16)

15/12/1514:44:30INFOTaskSchedulerImpl:Addingtaskset0.0with1tasks

15/12/1514:44:30INFOTaskSetManager:Startingtask0.0instage0.0(TID0,localhost,PROCESS_LOCAL,2213bytes)

15/12/1514:44:30INFOExecutor:Runningtask0.0instage0.0(TID0)

15/12/1514:44:30INFOExecutor:Fetchinghttp://192.168.6.2:40716/jars/ScalaDemo1.jarwithtimestamp1450161864561

15/12/1514:44:31INFOUtils:Fetchinghttp://192.168.6.2:40716/jars/ScalaDemo1.jarto/tmp/spark-ffda6806-746c-4740-9558-bd3a047fafe2/userFiles-e4174a0b-997b-424c-893b-0570668f9320/fetchFileTemp8876302010385452933.tmp

15/12/1514:44:36INFOExecutor:Addingfile:/tmp/spark-ffda6806-746c-4740-9558-bd3a047fafe2/userFiles-e4174a0b-997b-424c-893b-0570668f9320/ScalaDemo1.jartoclassloader

15/12/1514:44:36INFOHadoopRDD:Inputsplit:file:/usr/local/spark-1.5.2-bin-hadoop2.6/bigdata/wcDemo1.txt:0+142

15/12/1514:44:36INFOdeprecation:mapred.tip.idisdeprecated.Instead,usemapreduce.task.id

15/12/1514:44:36INFOdeprecation:mapred.task.idisdeprecated.Instead,usemapreduce.task.attempt.id

15/12/1514:44:36INFOdeprecation:mapred.task.is.mapisdeprecated.Instead,usemapreduce.task.ismap

15/12/1514:44:36INFOdeprecation:mapred.task.partitionisdeprecated.Instead,usemapreduce.task.partition

15/12/1514:44:36INFOdeprecation:mapred.job.idisdeprecated.Instead,usemapreduce.job.id

15/12/1514:44:37INFOExecutor:Finishedtask0.0instage0.0(TID0).2253bytesresultsenttodriver

15/12/1514:44:37INFOTaskSetManager:Finishedtask0.0instage0.0(TID0)in7131msonlocalhost(1/1)

15/12/1514:44:37INFOTaskSchedulerImpl:RemovedTaskSet0.0,whosetaskshaveallcompleted,frompool

15/12/1514:44:38INFODAGScheduler:ShuffleMapStage0(mapatSparkDemo1.scala:16)finishedin7.402s

15/12/1514:44:38INFODAGScheduler:lookingfornewlyrunnablestages

15/12/1514:44:38INFODAGScheduler:running:Set()

15/12/1514:44:38INFODAGScheduler:waiting:Set(ResultStage1)

15/12/1514:44:38INFODAGScheduler:failed:Set()

15/12/1514:44:38INFODAGScheduler:MissingparentsforResultStage1:List()

15/12/1514:44:38INFODAGScheduler:SubmittingResultStage1(ShuffledRDD[4]atreduceByKeyatSparkDemo1.scala:16),whichisnowrunnable

15/12/1514:44:38INFOMemoryStore:ensureFreeSpace(2288)calledwithcurMem=151116,maxMem=560497950

15/12/1514:44:38INFOMemoryStore:Blockbroadcast_2storedasvaluesinmemory(estimatedsize2.2KB,free534.4MB)

15/12/1514:44:38INFOMemoryStore:ensureFreeSpace(1375)calledwithcurMem=153404,maxMem=560497950

15/12/1514:44:38INFOMemoryStore:Blockbroadcast_2_piece0storedasbytesinmemory(estimatedsize1375.0B,free534.4MB)

15/12/1514:44:38INFOBlockManagerInfo:Addedbroadcast_2_piece0inmemoryonlocalhost:40076(size:1375.0B,free:534.5MB)

15/12/1514:44:38INFOSparkContext:Createdbroadcast2frombroadcastatDAGScheduler.scala:861

15/12/1514:44:38INFODAGScheduler:Submitting1missingtasksfromResultStage1(ShuffledRDD[4]atreduceByKeyatSparkDemo1.scala:16)

15/12/1514:44:38INFOTaskSchedulerImpl:Addingtaskset1.0with1tasks

15/12/1514:44:38INFOTaskSetManager:Startingtask0.0instage1.0(TID1,localhost,PROCESS_LOCAL,1955bytes)

15/12/1514:44:38INFOExecutor:Runningtask0.0instage1.0(TID1)

15/12/1514:44:38INFOShuffleBlockFetcherIterator:Getting1non-emptyblocksoutof1blocks

15/12/1514:44:38INFOShuffleBlockFetcherIterator:Started0remotefetchesin96ms

15/12/1514:44:38INFOExecutor:Finishedtask0.0instage1.0(TID1).1541bytesresultsenttodriver

15/12/1514:44:38INFODAGScheduler:ResultStage1(collectatSparkDemo1.scala:16)finishedin0.335s

15/12/1514:44:38INFOTaskSetManager:Finishedtask0.0instage1.0(TID1)in335msonlocalhost(1/1)

15/12/1514:44:38INFOTaskSchedulerImpl:RemovedTaskSet1.0,whosetaskshaveallcompleted,frompool

15/12/1514:44:38INFODAGScheduler:Job0finished:collectatSparkDemo1.scala:16,took8.480429s

(spark,5)

(hive,3)

(hadoop,5)

(docker,1)

(flume,1)

(solr,1)

(storm,1)

(elasticsearch,1)

(kafka,1)

(sqoop,1)

(redis,1)

(hbase,1)

15/12/1514:44:39INFOSparkUI:StoppedSparkwebUIathttp://192.168.6.2:4040

15/12/1514:44:39INFODAGScheduler:StoppingDAGScheduler

15/12/1514:44:42INFOMapOutputTrackerMasterEndpoint:MapOutputTrackerMasterEndpointstopped!

15/12/1514:44:42INFOMemoryStore:MemoryStorecleared

15/12/1514:44:42INFOBlockManager:BlockManagerstopped

15/12/1514:44:42INFOBlockManagerMaster:BlockManagerMasterstopped

15/12/1514:44:42INFOOutputCommitCoordinator$OutputCommitCoordinatorEndpoint:OutputCommitCoordinatorstopped!

15/12/1514:44:42INFORemoteActorRefProvider$RemotingTerminator:Shuttingdownremotedaemon.

15/12/1514:44:42INFORemoteActorRefProvider$RemotingTerminator:Remotedaemonshutdown;proceedingwithflushingremotetransports.

15/12/1514:44:42INFOSparkContext:SuccessfullystoppedSparkContext

15/12/1514:44:42INFOShutdownHookManager:Shutdownhookcalled

15/12/1514:44:42INFOShutdownHookManager:Deletingdirectory/tmp/spark-ffda6806-746c-4740-9558-bd3a047fafe2

[root@spark0spark]#

7、关闭Spark集群:

2.7sparkunionjoin操作演示

union简介:

通常如果我们需要将两个select语句的结果作为一个整体显示出来,我们就需要用到union或者unionall关键字。union(或称为联合)的作用是将多个结果合并在一起显示出来。

Union:对两个结果集进行并集操作,不包括重复行,同时进行默认规则的排序;

(UnionAll:对两个结果集进行并集操作,包括重复行,不进行排序;)

Join连接

SQL中大概有这么几种JOIN

crossjoin交叉连接(笛卡尔积)

innerjoin内连接

leftouterjoin左外连接(左面有的,右面没有的,右面填NULL)

rightouterjoin右外连接

fullouterjoin全连接

创建rdd1

scala>valrdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))

rdd1:org.apache.spark.rdd.RDD[(Char,Int)]=ParallelCollectionRDD[6]atparallelizeat<console>:15

创建rdd2:

scala>valrdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))

rdd2:org.apache.spark.rdd.RDD[(Char,Int)]=ParallelCollectionRDD[7]atparallelizeat<console>:15

执行union

scala>valunionrdd=rdd1unionrdd2

unionrdd:org.apache.spark.rdd.RDD[(Char,Int)]=UnionRDD[8]atunionat<console>:19

scala>unionrdd.collect

res2:Array[(Char,Int)]=Array((a,2),(b,4),(c,6),(d,9),(c,6),(c,7),(d,8),(e,10))

执行:join

scala>valjoinrdd=rdd1joinrdd2

joinrdd:org.apache.spark.rdd.RDD[(Char,(Int,Int))]=MapPartitionsRDD[11]atjoinat<console>:19

scala>joinrdd.collect

res3:Array[(Char,(Int,Int))]=Array((d,(9,8)),(c,(6,6)),(c,(6,7)))

】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇Spark算子:RDD行动Action操作(3).. 下一篇第46课:Spark中的新解析引擎Cata..

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目