вівторок, 23 липня 2013 р.

Build and run Flume agent to work w/ Twitter Streaming API



Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic applications (source)

Date flow model is perfect described at the official documentation and contains 3 components:
source gets data from external system and delivers them into flume, channel which transport data from source to sink (think about channel as about queue; also, this queue makes possibility to async source and sink execution) and sink (destination). All of them is called agent and agents can be grouped to build complex and fail-over flow.  

Several days ago I've written simple source to populate specific Twitter streaming data into HDFS for future processing. Let's go through the main steps...
First of all, it's required to create own source (if no one from standard sources is suitable). You will need to override several methods like configure, start and stop (see example).

Put jar into libs floder under you Flume home directory.
Put configuration into configs directory (see configuration file example).

After that you are ready to stary flume:
./flume-ng agent  -f ../conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent

-f path to config file
-n agent name to run

Немає коментарів:

Дописати коментар