设为首页 加入收藏

TOP

spark master和spark worker挂掉application恢复问题
2019-02-11 13:28:45 】 浏览:204
Tags:spark master worker 挂掉 application 恢复 问题

首先分5中情况:

1,spark master进程挂掉了

2,spark master在执行中挂掉了

3,spark worker提交任务前全部挂掉了

4,spark worker在执行application过程中挂掉了

5,spark worker在执行application过程中全部挂掉了


1,spark master进程挂掉了

提交不了application,所以不用考虑application恢复问题

2,spark master在执行中挂掉了

不影响application正常执行,因为执行过程在worker中完成,并直接由worker返回结果。

3,spark worker提交任务前全部挂掉了

报错信息如下,启动woker后,application恢复正常。

17/01/04 19:31:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

4,spark worker在执行application过程中挂掉了

报错信息如下:

17/01/04 19:41:50 ERROR TaskSchedulerImpl: Lost executor 0 on 192.168.91.128: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

移除RPC client,检查DAG是否丢失关键路径,如果丢失重新计算,如果lost :0 ,从BlockManagerMaster删除失败的executor,重新分发失败的executor到其他worker。

5,spark worker在执行application过程中全部挂掉了

报错信息如下,

17/01/04 19:34:16 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.91.128: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

application进入停滞状态,等待worker注册。

启动worker后:executor重新注册

删除不可用executor,并重新注册

CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0

CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (192.168.91.128:55126) with ID 1

worker启动后application恢复正常返回结果,但仍然报错如下。(2.1.0以后版本修复此bug)。

org.apache.spark.SparkException: Could not find CoarseGrainedScheduler

下一篇讲述Spark源码调试环境的搭建


编程开发网
】【打印繁体】【投稿】【收藏】 【推荐】【举报】【评论】 【关闭】 【返回顶部
上一篇spark-streaming 下一篇spark on yarn 搭建

评论

帐  号: 密码: (新用户注册)
验 证 码:
表  情:
内  容:

array(4) { ["type"]=> int(8) ["message"]=> string(24) "Undefined variable: jobs" ["file"]=> string(32) "/mnt/wp/cppentry/do/bencandy.php" ["line"]=> int(217) }