安装hadoop集群通常需要在集群中的所有机器上解压缩程序，或者根据操作系统执行相应的打包程序。把硬件划分为多个功能是非常重要的。
在集群中，通常一个机器被设置为NameNode，其他机器作为ResourceManager。这些都是masters，其他的服务（例如Web App代理服务器或者MapReduce任务历史服务器）既可以在专门的硬件上运行，也可以在共享的架构中运行，这取决于负载情况。
集群中剩余的机器作为DataNode和NodeManager。他们都是slaves。

配置无安全模式的hadoop

hadoop的java配置有两类重要的配置文件

Read-only(只读) 默认配置 core-default.xml, hdfs-default.xml, yarn-default.xml 和 mapre-default.xml
专业配置 etc/hadoop/core-site.xml,etc/hadoop/hdfs-site.xml,etc/hadoop/yarn-site.xml和etc/hadoop/mapred-site.xml
另外，你可以通过改动 etc/hadoop/hadoop-env.sh和etc/hadoop/yarn-env.sh的值来控制在bin目录下的脚本（上一篇提到了会有找不到环境变量的情况）
去配置hadoop集群，你将要配置hadoop守护进程的环境以及hadoop守护进程的运行参数。
HDFS守护进程是NameNode，SecondaryNameNode，和DataNode。YARN守护进程是ResourceManager，NodeManager，和WebAppProxy。如果要使用MapReduce的话，MapReduce任务历史记录服务器也将要运行。对于大型的安装程序，他们一般运行在不同的hosts。

配置hadoop守护进程环境

管理者应该使用etc/hadoop/hadoop-env.sh来定制hadoop守护进程执行环境，可选配置是etc/hadoop/mapred-env.sh和etc/hadoop/yarn-env.sh
至少，你一定要指定JAVA_HOME以便于正确的定义每个远程节点（第一搭建没看到这，果然出了问题）
管理员可以使用下表中显示的配置选项配置各个守护程序：

Daemon	Environment Variable
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
Secondary NameNode	HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_OPTS

例如，配置Namenode使用parallelGC，在hadoop-env.sh中添加下面的代码：

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC"

（显然我并不明白上面的配置是做什么用的）
etc/hadoop/hadoop-env.sh还有其他的例子（我只懂JAVA_HOME）
其他有用的参数包括：

HADOOP_PID_DIR - 守护进程的id文件所在的目录
HADOOP_LOG_DIR - 日志文件所在的目录，没有的话会自动创建
HADOOP_HEAPSIZE/YARN_HEAPSIZE - 可使用的最大堆尺寸，使用MB作为单位。如果这个值被设为1000，那么堆被会设置为1000MB。这被用来设置守护进程的堆尺寸。默认的情况下，这个值是1000。如果要为每个可以使用的守护程序单独配置值，可以使用它。

在大多数情况下你应该指定 HADOOP_PID_DIR 和HADOOP_LOG_DIR，以便它们只能由将要运行hadoop守护程序的用户写入。否则就有可能发生链接攻击。
通常也在系统范围配置HADOOP_PREFIX作为环境变量。例如，在/etc/profile.d配置简单的脚本:

HADOOP_PREFIX=/path/to/hadoop
export HADOOP_PREFIX

Daemon	Environment Variable
Resource	YARN_RESOURCEMANAGER_HEAPSIZE
NodeManager	YARN_NODEMANAGER_HEAPSIZE
WebAppProxy	YARN_PROXYSERVER_HEAPSIZE
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_HEAPSIZE

配置hadoop守护进程

这个章节介绍下述配置文件的重要参数：

etc/hadoop/core-site.xml

参数	值	备注
fs.defaultFS	NameNode URI	hdfs://host:port
io.file.buffer.size	131072	读写缓存文件的尺寸

etc/hadoop/hdfs-site.xml

NameNode的配置

参数	值	备注
dfs.namenode.name.dir	NameNode的命名空间和事务日志的本地存储路径	如果这是逗号分隔的目录列表，那么名称表将在所有目录中复制，以实现冗余。
dfs.hosts/dfs.hosts.exclude	包括或者派出的数据节点列表	如果必要的话使用这些节点控制允许的数据节点列表
dfs.blocksize	268435456	对于大型文件系统，HDFS块大小为256MB。
dfs.namenode.handler.count	100	更多NameNode服务器线程来处理来自大量DataNode的RPC。

ResourceManager的配置

参数	值	备注
yarn.resourcemanager.address	用来提交ResourceManager任务的host:port	如果设置了host:port, 会覆盖在yarn.resourcemanager.hostname的hostname的设置
yarn.resourcemanager.scheduler.address	ApplicationMasters 向调度者去获取资源的host:port	同上
yarn.resourcemanager.resource-tracker.address	Nodemanager的host:port	同上
yarn.resourcemanager.admin.address	管理者命令的host:port	同上
yarn.resourcemanager.webapp.address	web-ui的host:port.	同上
yarn.resourcemanager.hostname	ResourceManager host.（没懂）	host Single hostname that can be set in place of setting all yarn.resourcemanager * address resources. Results in default ports for ResourceManager components.
yarn.resourcemanager.scheduler.class	ResourceManager Scheduler class.	CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler. Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.
yarn.scheduler.minimum-allocation-mb	分配给每个容器的最小容量	In MBs
yarn.scheduler.maximum-allocation-mb	分配给每个容器的最大容量	In MBs
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path	List of permitted/excluded NodeManagers.	If necessary, use these files to control the list of allowable NodeManagers.

NodeManager的配置(没细看，需要的时候回来看)

参数	值	备注
yarn.nodemanager.resource.memory-mb	Resource i.e. available physical memory, in MB, for given NodeManager	Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratio	Maximum ratio by which virtual memory usage of tasks may exceed physical memory	The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs	Comma-separated list of paths on the local filesystem where intermediate data is written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirs	Comma-separated list of paths on the local filesystem where logs are written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds	10800	Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir	/logs	HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix	logs	Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/$ {user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.

Configurations for History Server (Needs to be moved elsewhere)

参数	值	备注
yarn.log-aggregation.retain-seconds	-1	How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds	-1	Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

etc/hadoop/mapred-site.xml

mapreduce应用的配置:

参数	值	备注
mapreduce.framework.name	yarn	Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb	1536	Larger resource limit for maps.
mapreduce.map.java.opts	-Xmx1024M	Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb	3072	Larger resource limit for reduces.
mapreduce.reduce.java.opts	-Xmx2560M	Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb	512	Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor	100	More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies	50	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

mapreduce任务历史的配置

参数	值	备注
mapreduce.jobhistory.address	MapReduce JobHistory Server host:port	Default port is 10020.
mapreduce.jobhistory.webapp.address	MapReduce JobHistory Server Web UI host:port	Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir	/mr-history/tmp	Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir	/mr-history/done	Directory where history files are managed by the MR JobHistory Server.

管理节点情况

Hadoop提供了一种机制，管理员可以通过该机制将NodeManager定期运行管理员提供的脚本，以确定节点是否健康。
管理员可以通过在脚本中执行对其选择的任何检查来确定节点是否处于正常状态。如果脚本检测到节点处于不健康状态，则必须以字符串ERROR开头的标准输出行。 NodeManager定期生成脚本并检查其输出。如果脚本的输出包含字符串ERROR，如上所述，节点的状态将报告为运行状况不佳，并且ResourceManager将节点列入黑名单。不会为此节点分配其他任务。但是，NodeManager继续运行脚本，因此如果节点再次变得健康，它将自动从ResourceManager上的黑名单节点中删除。在ResourceManager Web界面中，管理员可以使用节点的运行状况以及脚本的输出（如果它不健康）。自节点健康以来的时间也显示在Web界面上。
以下来自etc/hadoop/yarn-site.xml中的参数可用于监视节点运行状况。（如你所见，都是英文，说明我没细看=。=）

参数	值	备注
yarn.nodemanager.health-checker.script.path	Node health script	Script to check for node’s health status.
yarn.nodemanager.health-checker.script.opts	Node health script options	Options for script to check for node’s health status.
yarn.nodemanager.health-checker.interval-ms	Node health script interval	Time interval for running health script.
yarn.nodemanager.health-checker.script.timeout-ms	Node health script timeout interval	Timeout for health script execution.

运行hadoop集群

完成所有必要的配置后，将文件分发到所有计算机上的HADOOP_CONF_DIR目录。这应该是所有计算机上的同一目录。
通常，建议HDFS和YARN作为单独的用户运行。在大多数安装中，HDFS进程以’hdfs’的形式执行。 YARN通常使用’yarn’帐户。

hadoop运行

要启动Hadoop集群，您需要启动HDFS和YARN集群。
第一次启动HDFS时，必须对其进行格式化。将新的分布式文件系统格式化为hdfs：
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>

之前在单节点或者伪分布式中运行的命令是
$ bin/hdfs namenode -format

在指定节点上使用以下命令启动HDFS NameNode, 作为hdfs：
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode

在指定节点上使用以下命令启动HDFS DataNode, 作为hdfs：
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemons.sh --config $HADOOP_CONF_DIR --script hdfs start datanode

如果配置了etc/hadoop/slaves和ssh可信访问（请参阅单节点设置），则可以使用下述hdfs程序脚本启动所有HDFS进程：
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh

Start the YARN with the following command, run on the designated ResourceManager asyarn:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager

运行脚本以在每个指定的主机上启动NodeManager作为yarn：
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemons.sh --config $HADOOP_CONF_DIR start nodemanager

Start a standalone WebAppProxy server. Run on the WebAppProxy server as yarn. If multiple servers are used with load balancing it should be run on each of them:
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start proxyserver

如果配置了etc/hadoop/slaves和ssh 可信访问（请参阅单节点设置），则可以使用下述脚本启动所有YARN进程。作为yarn：
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh

在指定节点上使用以下命令启动MapReduce JobHistory Server，作为mapred：
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

hadoop停止

每个开始脚本均对应了一个结束脚本，参考关闭命令


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：结合案例讲解MapReduce重要知识点..	下一篇：2018-3-7 ..