版权声明:本文为博主原创文章,未经博主允许随机转载。 https://blog.csdn.net/mtj66/article/details/79094518
搭建请参考上一篇文章。
1.广播环境变量,指定Python的路径
export PYTHON_ROOT=/data/Python
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python
export SPARK_YARN_USER_ENV=”PYSPARK_PYTHON=Python/bin/python ”
2.提交具体任务
遇到权限问题是普遍现象,层层排查
hdfs dfs -chmod 777 /user/hdfs
hdfs dfs -ls chmod 766 /user/hdfs
hdfs dfs -mkdir /user/hdfs/mnist_model
chmod hdfs:hdfs -R /data/TensorflowOnSpark
因为输出目录为yarn创建,所以确保路径的执行以及读写权限
spark-submit –master yarn –deploy-mode cluster –num-executors 3 –executor-memory 2g \
–queue default \
–py-files TensorFlowOnSpark/tfspark.zip,TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
–conf spark.dynamicAllocation.enabled=false –conf spark.yarn.maxAppAttempts=1 \
–archives hdfs:///user/${USER}/Python.zip#Python \
–conf spark.executorEnv.LD_LIBRARY_PATH=”/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64:/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64/libhdfs.so:/usr/java /jdk1.8.0_131/jre/lib/amd64/server/” \
TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py \
–images mnist/tfr/train \
–format tfr \
–mode train \
–model mnist_model
spark-submit –master yarn –deploy-mode cluster –queue default \
–num-executors 3 \
–executor-memory 3g \
–py-files /data/TensorFlowOnSpark/tfspark.zip,/data/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
–conf spark.dynamicAllocation.enabled=false \
–conf spark.yarn.maxAppAttempts=1 \
–archives hdfs:///user/${USER}/Python.zip#Python \
–conf spark.executorEnv.LD_LIBRARY_PATH=”/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64:/usr/java /jdk1.8.0_131/jre/lib/amd64/server/” \
/data/TensorFlowOnSpark/examples/mnist/tf/mnist_spark.py –images mnist/tfr/test –mode inference \
–model mnist_model \
-o predictions2
相关说明
/usr/java/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64/libhdfs.so
/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-hdfs/lib/native/libhdfs.so
TensorFlowOnSpark框架相关的依赖 ,以及分布式执行mnist所需要的mnist_dist.py
–py-files /data/TensorFlowOnSpark/tfspark.zip,/data/TensorFlowOnSpark/examples/mnist/tf/mnist_dist.py \
指定已经编译好的Python的路径,这里USER是指hdfs,在切换到hdfs用户的时候,环境变量已经包含
–archives hdfs:///user/${USER}/Python.zip#Python
指定hdfs文件操作相关的操作的依赖包
–conf spark.executorEnv.LD_LIBRARY_PATH=”/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64:/usr/java/jdk1.8.0_131/jre/lib/amd64/server/” \
平台错误
NotFoundError: /data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-hdfs/lib/native/libhdfs.so: cannot open shared object file: No such file or directory
ansible test_hadoop -m shell -a “mkdir -p /data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-hdfs/lib/native/”
ansible test_hadoop -m shell -a “ln -s /data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib64/libhdfs.so /data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/lib/hadoop-hdfs/lib/native/libhdfs.so”