主要有三个组件:
Source:消费web系统这样的外部数据源中的数据(一般就是web系统产生的日志),外部数据源会向flume发送某种能被flume识别的格式的事件,有以下几种类型:avro 、exec、jms、spooling directory source、kafka、netcat等
Channel:当flume source从外部source读取到数据的时候,flume会将数据先存放在一个或多个channel中,这些数据将会一直被存放在channel中直到它被sink消费了为止,channel的主要类型有:memory、jdbc、kafka、file等
Sink:消费channel中的数据,然后将其存放进外部持久化的文件系统中,Sink的类型主要有HDFS、Hive、Avro、File Roll、kafka、HBase、ElasticSearch
下面是apache官网的解释:
A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.