c++ 操作hadoop - Hadoop - 程序员开发

TOP

c++ 操作hadoop

2019-04-22 00:40:12 【大中小】浏览:77次

hadoop框架原理：

流程是，将input转换成mapper使用的context格式，然后经过mapper处理后，转换成reducer使用的context格式，经过reducer处理之后，产生output。

c++类库和头文件：

hadoop提供的c++api类库和头文件，安装hadoop之后，类库在hadoop/hadoop-2.8.0/lib/native下，头文件在hadoop/hadoop-2.8.0/include中，复制到系统/usr/lib64和/usr/include下，或者直接复制到自己的项目工程中，或者使用Makefile

c++编程模式：

一个公有继承Mapper的map器，一个公有继承reducer的reduce器，模板如下：

class localmapper : public HadoopPipes::Mapper
{
public:
    localmapper(HadoopPipes::Taskcontext & context){}
    void map(HadoopPipes::MapContext & context){}
};

class localreducer : public HadoopPipes::Reducer
{
public:
    localreducer(HadoopPipes::TaskContext & context){}
    void reduce(HadoopPipes::ReducerContext & context){}
};

其中localmapper类中的map方法用来自定义洗牌规则，localreducer类中的reduce方法用来自定义展示规则，一个示例：

1.写input，这里是tmp.txt，如下：

[root@master helloworld]# cat tmp.txt 
a:067
b:066
a:100
b:089
b:099

2.写map和reduce方法，如下：

#include <limits.h>
#include <stdint.h>
#include <string.h>

/* hadoop头文件 */
#include <Pipes.hh>
#include <TemplateFactory.hh>
#include <StringUtils.hh>

using namespace std;

/* hadoop的mapper，reducer，和各自使用的context */
using HadoopPipes::TaskContext;
using HadoopPipes::Mapper;
using HadoopPipes::MapContext;
using HadoopPipes::Reducer;
using HadoopPipes::ReduceContext;

/* hadoop方法集中的两种方法 */
using HadoopUtils::toInt;
using HadoopUtils::toString;

/* hadoop运行入口 */
using HadoopPipes::TemplateFactory;
using HadoopPipes::runTask;

/* 公有继承hadoop的Mapper */
class LocalMapper : public Mapper
{
public:
	LocalMapper(TaskContext & context){}
	/* map函数，使用MapContext */
	void map(MapContext & context)
	{
		/* 从文本中获取输入 */
		string line  = context.getInputValue();
		string key	 = line.substr(0, 1);
		string value = line.substr(2, 3);
		/* 根据筛选条件洗牌，这里要求value不是100 */
		if (value != "100")
		{
			context.emit(key, value);
		}
	}
};

/* 公有继承Reducer */
class LocalReducer : public Reducer
{
public:
	LocalReducer(TaskContext & context){}
	/* reduce函数，使用ReduceContext */
	void reduce(ReduceContext & context)
	{
		int max_value = 0;
		/* 遍历一个key的所有value，根据筛选条件展示输出，这里选择最大值 */
		while (context.nextValue())
		{
			max_value = max(max_value, toInt(context.getInputValue()));
		}
		context.emit(context.getInputKey(), toString(max_value));
	}
};

int main()
{
	return runTask(TemplateFactory<LocalMapper, LocalReducer>());
}

3.编译，命令如下：

g++ helloworld.cpp -lcrypto -lssl -L/root/hadoop/hadoop-2.8.0/lib/native -lhadooppipes -lhadooputils -lpthread

这里要带上-pthread，因为hadoop内部是并发算法，编译之后得到a.out

4.将a.out和tmp.txt上传到hdfs，命令如下：

hdfs dfs -mkdir /helloworld
hdfs dfs -put tmp.txt /helloworld
hdfd dfs -put a.out /helloworld

其中第一行创建一个隐藏文件夹helloworld，后面两行将可执行文件和input放入此文件下

5.启动任务，脚本如下：

[root@master helloworld]# cat start.sh 
hadoop pipes -Dhadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input /helloworld/tmp.txt -output output -program /helloworld/a.out

其中input参数指出输入文件的路径，output指出输出文件在hdfs中存放位置，会重新创建新的，要求这个文件之前不能存在，program指出可执行文件路径，执行后结果如下表示成功：

[root@master helloworld]# ./start.sh 

18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1
18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2
18/09/25 15:54:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537857363076_0004
18/09/25 15:54:24 INFO impl.YarnClientImpl: Submitted application application_1537857363076_0004
18/09/25 15:54:24 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1537857363076_0004/
18/09/25 15:54:24 INFO mapreduce.Job: Running job: job_1537857363076_0004
18/09/25 15:54:38 INFO mapreduce.Job: Job job_1537857363076_0004 running in uber mode : false
18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%
18/09/25 15:55:14 INFO mapreduce.Job: Job job_1537857363076_0004 completed successfully
18/09/25 15:55:14 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=38
		FILE: Number of bytes written=414120
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=223
		HDFS: Number of bytes written=10
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=34605
		Total time spent by all reduces in occupied slots (ms)=11779
		Total time spent by all map tasks (ms)=34605
		Total time spent by all reduce tasks (ms)=11779
		Total vcore-milliseconds taken by all map tasks=34605
		Total vcore-milliseconds taken by all reduce tasks=11779
		Total megabyte-milliseconds taken by all map tasks=35435520
		Total megabyte-milliseconds taken by all reduce tasks=12061696
	Map-Reduce Framework
		Map input records=5
		Map output records=4
		Map output bytes=24
		Map output materialized bytes=44
		Input split bytes=178
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=44
		Reduce input records=4
		Reduce output records=2
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=453
		CPU time spent (ms)=3630
		Physical memory (bytes) snapshot=548720640
		Virtual memory (bytes) snapshot=6198038528
		Total committed heap usage (bytes)=378470400
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=45
	File Output Format Counters 
		Bytes Written=10
18/09/25 15:55:14 INFO util.ExitUtil: Exiting with status 0

这一行指出作业有一个输入：

18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1

这一行指出hadoop有两个datanode：

18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2

通过如下3行可以看到hadoop框架的处理流程，先执行map，map全部执行完毕之后，执行reduce：

18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%

6.查看结果如下：

[root@master helloworld]# hdfs dfs -cat output/*
a	67
b	99

这里a中值为100的被排除了，剩下的最大的是67，b中最大的是99

7.作业日志存放位置

hadoop-2.8.0/logs/userlogs

这里存储每个作业的日志，如下：

[root@master userlogs]# ls
application_1537857363076_0002  application_1537857363076_0003  application_1537857363076_0004

每个作业内部有stderr，stdout，syslog，一般内容在syslog中

8.查看每个datanode处理了多少作业：

这里将a.out执行两次，看效果，这里有两个datanode，在第一个datanode下查看日志如下：

[root@master application_1537857363076_0003]# ls
container_1537857363076_0003_01_000001  container_1537857363076_0003_01_000004

第二个datanode下查看日志如下：

[root@slave1 application_1537857363076_0003]# ls
container_1537857363076_0003_01_000002  container_1537857363076_0003_01_000003

但是注意作业不是均分到两个datanode上的，再次执行查看日志如下：

第一个datanode：

[root@master application_1537857363076_0004]# ls
container_1537857363076_0004_01_000004

第二个datanode：

[root@slave1 application_1537857363076_0004]# ls
container_1537857363076_0004_01_000001  container_1537857363076_0004_01_000002  container_1537857363076_0004_01_000003

这里可以看到，master只处理了1个，slave1处理了3个。


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Hadoop Streaming的使用	下一篇：Nutch+Hadoop集群搭建