设为首页 加入收藏

TOP

使用IDEA 搭建 spark on yarn 的开发环境+调试~
2018-11-13 14:11:48 】 浏览:358
Tags:使用 IDEA 搭建 spark yarn 开发 环境 调试
版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_31806205/article/details/80451743

1.导入yarn和hdfs配置文件

因为spark on yarn 是依赖于yarn和hdfs的,所以获取yarn和hdfs配置文件是首要条件,将core-site.xml、hdfs-site.xml
、yarn-site.xml 这三个文本考入到你IDEA项目里面的resource目录下,如下图所示:

这里写图片描述

2.添加项目依赖

除了pom里面你要添加的:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <!--<scope>test</scope>-->
            <version>2.7.3</version>
</dependency>

<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <!--<scope>test</scope>-->
   <version>2.7.3</version>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <!--<scope>test</scope>-->
    <version>2.2.1</version>
</dependency>

<dependency>
   <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
   <!--<scope>test</scope>-->
   <version>2.2.1</version>
</dependency>

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.34</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-hive_2.11</artifactId>
    <version>2.2.1</version>
</dependency>

而且还要添加spark-yarn的依赖包到你的dependencies 方法如下所示:
这里写图片描述

这里写图片描述

这里写图片描述

这里是抛砖引玉,只是说我缺少spark-yarn_2.11-2.2.1.jar,具体缺什么你们自己加就好了,对了,spark依赖的包都在${SPARK_HOME}/jars目录下,缺什么自己去找即可,实在不行,你就来个*全部加载进去,很暴力,很好用。

如果你不加,你会报出如下错误:

Caused by: org.apache.spark.SparkException: Unable to load YARN support
    at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:413)
    at org.apache.spark.deploy.SparkHadoopUtil$.yarn$lzycompute(SparkHadoopUtil.scala:408)
    at org.apache.spark.deploy.SparkHadoopUtil$.yarn(SparkHadoopUtil.scala:408)
    at org.apache.spark.deploy.SparkHadoopUtil$.get(SparkHadoopUtil.scala:433)
    at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:2381)
    at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:156)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:351)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:175)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:257)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:432)
    at com.timanetworks.spark.faw.CommonStaticConst$.loadHdfsConfig(CommonStaticConst.scala:37)
    at com.timanetworks.spark.faw.CommonStaticConst$.<init>(CommonStaticConst.scala:23)
    at com.timanetworks.spark.faw.CommonStaticConst$.<clinit>(CommonStaticConst.scala)
    ... 3 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
    at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:409)

3。修改core-site的如下配置:

在core-site.xml中注释掉如下配置:
这里写图片描述

说白了就是注释掉这个:

<property>
  <name>net.topology.script.file.name</name>
  <value>/etc/hadoop/conf/topology_script.py</value>
</property>

你如果想试着把这个py脚本从linux下拷贝下来,然后改成windows路径,很负责的告诉你,我试过。。。不得行!

<property>
   <name>net.topology.script.file.name</name>
    <value>D:\spark\spark-2.2.1-bin-hadoop2.7\topology_script.py</value>
</property>

这样是不行的,一定要注释掉!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

不然你会发现,这个错误:

1.  java.io.IOException: Cannot run program "/etc/hadoop/conf/topology_script.py" (in directory "D:\workspace\fawmc-new44\operation-report-calc"): CreateProcess error=2, 系统找不到指定的文件。
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:520)
        at org.apache.hadoop.util.Shell.run(Shell.java:479)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
        at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
        at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
        at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
        at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
        at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:37)
        at org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:225)
		at org.apache.spark.scheduler.TaskSetManager$$anonfun$addPendingTask$1.apply(TaskSetManager.scala:206)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.TaskSetManager.addPendingTask(TaskSetManager.scala:206)
        at org.apache.spark.scheduler.TaskSetManager$$anonfun$1.apply$mcVI$sp(TaskSetManager.scala:178)
		at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:166)
		at org.apache.spark.scheduler.TaskSetManager.<init>(TaskSetManager.scala:177)
		at org.apache.spark.scheduler.TaskSchedulerImpl.createTaskSetManager(TaskSchedulerImpl.scala:229)
		at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:193)
		at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1055)
		at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
        at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1695)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    Caused by: java.io.IOException: CreateProcess error=2, 系统找不到指定的文件。
        at java.lang.ProcessImpl.create(Native Method)
        at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
        at java.lang.ProcessImpl.start(ProcessImpl.java:137)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
        ... 26 more

或者这个错误:

"D:\spark\spark-2.2.1-bin-hadoop2.7\topology_script.py" (in directory "D:\workspace\fawmc-new44\operation-report-calc"): CreateProcess error=193, %1 不是有效的 Win32 应用程序。

好了,完成了上述步骤,就可以用IDEA调试spark on yarn了,美滋滋,对了,一般都是用yarn-client做调试哦~

成果图!

这里写图片描述

好啦,就是这样,如果你们还遇到了hdfs什么权限问题,不能创建,读写啊什么之类的,网上有很多资料,总结下来无非就是hadoop fs -chmod/chown这俩命令,然后用法和linux类似。
如果你们遇到了windows下的某些权限问题,可以看我的另一片博文:

windows下搭建hadoop/spark环境常见问题
https://blog.csdn.net/qq_31806205/article/details/79819724

】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇python hdfs 模块的一些使用笔记 下一篇HDFS API 读操作 -seek指针操作 ..

最新文章

热门文章

Hot 文章

Python

C 语言

C++基础

大数据基础

linux编程基础

C/C++面试题目