--master yarn)时,可以通过http://hadoop000:8088页面监控到整个job的执行过程;
注:如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://hadoop000:7077,那么在启动spark-sql时不指定master也是运行在standalone集群之上。
spark-sql使用
启动spark-sql: 由于我已经在spark-defaults.conf中配置了spark.master spark://hadoop000:7077,就没在spark-sql启动时指定master了
cd $SPARK_HOME/bin
spark-sql
SELECT track_time, url, session_id, referer, ip, end_user_id, city_id FROM page_views WHERE city_id = -1000 limit 10;
SELECT session_id, count(*) c FROM page_views group by session_id order by c desc limit 10;
上面两个sql语句用到的表现在存在hive中了,如果没有则手工创建下,创建脚本以及导入数据脚本如下:
create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
load data local inpath '/home/spark/software/data/page_views.dat' overwrite into table page_views;