flume的简介

任何一个系统在运行的时候都会产生大量的日志信息，我们需要对这些日志进行分析，在分析日志之前，我们需要将分散在生产系统中的日志收集起来。Flume就是这样的日志采集系统

Flume Agent的图示

主要有三个组件：
Source：消费web系统这样的外部数据源中的数据（一般就是web系统产生的日志），外部数据源会向flume发送某种能被flume识别的格式的事件，有以下几种类型：avro 、exec、jms、spooling directory source、kafka、netcat等
Channel：当flume source从外部source读取到数据的时候，flume会将数据先存放在一个或多个channel中，这些数据将会一直被存放在channel中直到它被sink消费了为止，channel的主要类型有：memory、jdbc、kafka、file等
Sink：消费channel中的数据，然后将其存放进外部持久化的文件系统中，Sink的类型主要有HDFS、Hive、Avro、File Roll、kafka、HBase、ElasticSearch
下面是apache官网的解释：
A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

Agent的多级相连

Agent的多级连接

有些情况下，可能会有很多的服务器产生大量的日志文件，此时我们需要先准备一些第一层级的一些flume收集系统，这些flume日志收集系统需要有一个类型为avro的sink，这个sink采集到数据后，会将数据发送出去，具体发送到哪，需要在配置文件里面进行配置，然后再下一层极的flume收集系统中，采用avro类型的source，收集上一级发送过来的日志信息，然后再在这一层集中将数据沉入到hdfs文件系统中

案例：模拟多Agent的日志采集

假设有3台主机，bigdata02，bigdata03，bigdata04，bigdata02和bigdata04用来作为第一级的采集，bigdata03作为第二级的采集

首先bigdata02和bigdata04的fluem的配置文件tail-avro.conf

#执行的命令bin/flume-ng agent -c conf -f conf/tail-hdfs.conf -n a1
########
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/hadoop/flumedata/log/test.log
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.channel = c1
a1.sinks.k1.hostname = bigdata03
a1.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

该配置用来追踪文件/home/hadoop/flumedata/log/test.log，一旦有新的信息被写入，那么就会被采集到flume中

bigdata03的配置avro-file_roll.conf

#从avro端口接收数据，下沉到file
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141

# Describe the sink
a1.sinks.k1.type = file_roll
a1.sinks.k1.sink.directory = /tmp/test
a1.sinks.k1.channel = c1 
a1.sinks.k1.sink.rollInterval = 10

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

该配置用来从avro接受消息，作为服务端，所以需要绑定自己的所有的ip地址，使用0.0.0.0,采集到所有的数据后将数据写入到本地文件系统中，如果需要写入hdfs中，可以修改这边的sink的type为hdfs，然后再修改相应的sink的配置文件即可。具体的配置文件的修改方法参见官网Apache Flume Documention

在bigdata02和bigdata04上使用shell循环模拟日志的产生

#!/bin/bash
i=1
while [ "1" = "1" ]
do
    echo "Flume get log to hdfs bigdata02"$i >> /home/hadoop/flumedata/log/test.log
    sleep 0.3
    i=` expr $i + 1 `
done

#!/bin/bash
i=1
while [ "1" = "1" ]
do
    echo "Flume get log to hdfs bigdata04"$i >> /home/hadoop/flumedata/log/test.log
    sleep 0.3
    i=` expr $i + 1 `
done

运行的结果如下：
日志采集系统文件夹

日志文件采集系统文件内容

我们可以看到文件里面的数据的确有来自于bigdata02和来自于bigdata04的


【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：flume使用hdfs sink时需要注意的..	下一篇：Hadoop数据收集与入库系统Flume与..