Project01---GraphX根据边构建连通图（connectedComponents） - Hadoop

TOP

Project01---GraphX根据边构建连通图（connectedComponents）

2019-05-11 00:44:01 【大中小】浏览:136次

Tags：Project01---GraphX 根据构建连通 connectedComponents

目的：根据表a（fromID，toID）构建连通图，最终输出表样式（fromID，toID，minID）其中minID表示连通图中的最小ID。输出到hive表中做统计计算。
在这里插入图片描述
如上图所示：连通图中的最小ID就是最终组成的所有连通图表为：
（fromID,toID,minID）: (1,2,1) (3,1,1) (1,4,1) (2,4,1) (5,6,5) (7,5,5)

从hive表中导出需要的数据

insert overwrite directory ‘/output’ ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' select fromID,toID from tablename;

	注意：导出的数据最好是以“\t”为字段的分隔符。

编写sparkGraphx代码如下
样例数据：
（fromID,toID）
4 1
1 2
6 3
7 3
7 6
6 7
3 7
Scala代码：

package com.arua.relstionship
import org.apache.spark.graphx.{Graph, GraphLoader, VertexId, VertexRDD}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object GreateGraph {
  def main(args: Array[String]): Unit = {
//    获取编程入口（初始化SparkContext对象）
    val conf = new SparkConf().setMaster(args(0)).setAppName("GreateGraph")
    val sc = new SparkContext(conf)
//    根据表中数据构建图数据
    val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,"/Jerry/a.txt")
//    通过图数据构建连通图
    val vertices: VertexRDD[VertexId] = graph.connectedComponents().vertices
    /*
    * vertices:
    * (4,1)
    * (6,3)
    * (2,1)
    * (1,1)
    * (3,3)
    * (7,3)
    * */
//    根据表中数据构建RDD，目的是和vertices进行join
    val edgesID: RDD[(VertexId, String)] = sc.textFile("/Jerry/a.txt").map(line => {
      val ff: Array[String] = line.split("\t")
      (ff(0).toLong, ff(0))
    })
    /*
    * edgesID:
    * (4,4)
    * (6,6)
    * (2,2)
    * (1,1)
    * (3,3)
    * (7,7)
    * */
//    两个RDD进行join
    edgesID.join(vertices).map{
      case (fromID,(toID,vertexID))=>(fromID,toID,vertexID)
    }.map(t=>{
      t._1+","+t._2+","+t._3
    }).saveAsTextFile(args(1))
    /*输出的结果文件：
    * (4,1,1)
    * (1,2,1)
    * (6,3,3)
    * (7,3,3)
    * (7,6,3)
    * (6,7,3)
    * */
    sc.stop()
  }
}

	注意：VertexID的数据类型为Long类型的，所以表中的数据类型应为Long类型或者可以转化为Long类型的数据类型。

打jar包上传集群，提交任务

注意：找到Maven Project 1.clean 清理掉原有的jar包，2.package打新的jar包

注意：1.Project栏找到target 2.找到jar包右键Show in Explorer找到jar包存在的物理地址，长传到集群。
Linux中提交任务（或者你可以将命令写到一个shell脚本里直接到脚本所在目录“./脚本名称”执行）

/home/hadoop/apps/spark-2.3.1-bin-hadoop2.7/bin/spark-submit \ //表示：启动任务的命令spark-submit在linux中的位置
–master spark://hadoop02:7077,hadoop03:7077 \ ：//表示master主节点
–class com.arua.relstionship.GreateGraph \ //表示：类的全限定名称
–executor-memory 512M \ //表示：启动进程用的内存大小（自定义）
–total-executor-cores 1 \ //表示：启动进程节点核数（自定义）
/home/hadoop/Jerry/hello-1.0-SNAPSHOT.jar \ //表示：jar包存放的位置
spark://hadoop02:7077,hadoop03:7077 \ //表示：主节点
/output1234 //表示：结果输出存放路径
注意：spark整合yarn后更改任务调度命令：–master yarn和主节点：–yarn
生产的结果文件（集群上）导入到hive表中
1.建表
create external table graphx_relationship（fromid int,toid int,vertexid int） row format delimited fields terminated by "\t";
2. 导入数据到hive表中
local data inpath "集群上存放结果文件的地址" into table graphx_relationship.
1. hive相关任务
  1. 求关系对的数量：
```
	select count(*) from (select count(*) from graphx_relationship group by fromid,toid) a;
```
  1. 求点的数量（即fromid和toid的数量）
```
	select count(*) from (select * from (select fromid from graphx_relationship union select toid from graphx_relatiohship ) a) b;
```
  1. 求创建的连通图数量
```
	select count(*) from (select vertexid,count(vertexid)num from grphex_relationship group by vertexid) a;
```


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：结合案例讲解MapReduce重要知识点..	下一篇：window7下的ElasticSearch的安装..