In this tutorial I have explained how to install / deploy / configure Flume NG on single system, and how to configure Flume NG to copy data to HDFS, then configuration for copying data to HBase
Before going to configurations let’s understand what Flume NG (1.x.x) is:
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application.
In this tutorial I have used:
Hadoop 0.20.X (Apache or Cloudera)
Flume 1.1.0
HBase 0.90.4
Ubuntu 11.10
I am assuming that you have Hadoop cluster ready, if you don’t have Hadoop installed you can refer:
- For Hadoop in pseudo distributed mode please click here
- For Hadoop in distributed mode please click here
Download Flume NG from
https://www.apache.org/dyn/closer.cgi/incubator/flume/flume-1.1.0-incubating/
Or
http://archive.cloudera.com/cdh4/cdh/4/
Untar:
tar xzf flume-1.1.0-cdh4.0.0.tar.gz |
cp conf/flume-conf.properties.template conf/flume.conf cp conf/flume-env.sh.template conf/flume-env.sh |
Configuration:
Following are the Configurations for copy data from file on local file system to HDFS
Edit flume.conf and add following entries:
agent1.sources = tail agent1.channels = Channel-2 agent1.sinks = HDFS agent1.sources.tail.type = exec agent1.sources.tail.command = tail -F /var/log/apache2/access.log agent1.sources.tail.channels = Channel-2 agent1.sinks.HDFS.channel = Channel-2 agent1.sinks.HDFS.type = hdfs agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume agent1.sinks.HDFS.hdfs.fileType = DataStream agent1.channels.Channel-2.type = memory |
Start Flume to copy data to HDFS:
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 |
Note that the agent name is specified by -n agent1 and must match a agent name given in -f conf/flume.conf
You can check your data in HDFS either from web console (http://localhost:50070) or command prompt
Now lets configure Flume NG to copy data to HBase:
Here I am assuming that you have HBase installed and ready
Create table in HBase in which you want to copy data
create 'myTab', 'cf' |
Edit flume.conf and add following entries:
hbase-agent.sources=tail hbase-agent.sinks=sink1 hbase-agent.channels=ch1 hbase-agent.sources.tail.type=exec hbase-agent.sources.tail.command=tail -F /tmp/test05 hbase-agent.sources.tail.channels=ch1 hbase-agent.sinks.sink1.type=org.apache.flume.sink.hbase.HBaseSink hbase-agent.sinks.sink1.channel=ch1 hbase-agent.sinks.sink1.table=myTab hbase-agent.sinks.sink1.columnFamily=cf hbase-agent.sinks.sink1.column=c1 hbase-agent.sinks.sink1.serializer= org.apache.flume.sink.hbase.SimpleHbaseEventSerializer hbase-agent.sinks.sink1.serializer.payloadColumn=coll hbase-agent.sinks.sink1.serializer.incrementColumn=coll hbase-agent.sinks.sink1.serializer.rowPrefix=1+ hbase-agent.sinks.sink1.serializer.suffix=timestamp hbase-agent.channels.ch1.type=memory |
Set Following variables in .bashrc
export HADOOP_HOME=YOUR-PATH export HADOOP_PREFIX=YOUR-PATH export PATH==YOUR-PATH:$PATH export FLUME_HOME=YOUR-PATH export FLUME_CONF_DIR=YOUR-PATH export HBASE_HOME=YOUR-PATH export CLASSPATH=$CLASSPATH:HBASE_HOME/conf:HADOOP_HOME/conf |
Now start Flume to copy data to HBase:
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n hbase-agent |
Check whether data gets copy into HBase or not
scan 'myTab' |
Note That SimpleHbaseEventSerializer specified in the configuration is just an example of Event Serializer you should write your own class depend on your requirements
i have 2 node....agent 1 and collecter....when i configure...agent 1- as text ("XXXX") agentsink("localhost,35853")....and collecter
ReplyDeletecollectorSource(35853) console .....its getting fail in agent sink....is there any additional setting...
below is the error details
12/06/27 11:23:37 INFO debug.TextFileSource: File /home/hadoop/as/ash opened
12/06/27 11:23:37 INFO agent.LogicalNode: Node config successfully set to FlumeConfigData: {srcVer:'Wed Jun 27 11:23:33 IST 2012' snkVer:'Wed Jun 27 11:23:33 IST 2012' ts='Wed Jun 27 11:23:33 IST 2012' flowId:'default-flow' source:'text( "/home/hadoop/as/ash" )' sink:'agentSink( "localhost", 35853 )' }
12/06/27 11:23:37 INFO durability.NaiveFileWALManager: NaiveFileWALManager is now open
12/06/27 11:23:37 INFO rolling.RollSink: Created RollSink: trigger=[TimeTrigger: maxAge=10000 tagger=com.cloudera.flume.handlers.rolling.ProcessTagger@1a1c42f] checkPeriodMs = 250 spec='ackingWal'
12/06/27 11:23:37 INFO rolling.RollSink: opening RollSink 'ackingWal'
12/06/27 11:23:37 INFO hdfs.SeqfileEventSink: constructed new seqfile event sink: file=/tmp/flume-hadoop/agent/localhost/writing/20120627-112337056+0530.1716541877172.00000019
12/06/27 11:23:37 WARN rolling.RollSink: Failure when attempting to open initial sink
java.io.IOException: failure to login
Hi Jeet,
ReplyDeleteI think you are getting confused with Flume OG and Flume NG
in flume-NG architecture has been completely changed
In Flume NG
the concept of collector is removed and concept of agent is changed agent is now JVM process
I think you should refer
https://cwiki.apache.org/FLUME/flume-ng.html
in flume NG how to start a master ?
ReplyDeletei did all setting from https://cwiki.apache.org/confluence/display/FLUME/Getting+Started..
now what the command i should write to start mysource and sink
Hi Jeet,
DeleteAgain you are getting confuse with Flume NG and Flume OG
there is no master in NG
if you show architecture diagram shown above there is simply source, sink and channel connecting source and sink
all the details about the command you should run is mentioned on above tutorial
to start Flume you need to run :
"bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1"
Hi rahul...i am done with step in above...now i run below command //// and i changed agent1.sources.tail.command = tail -F /var/log/apache2/access.log to
ReplyDeleteagent1.sources.tail.command = text -F /home/hadoop/as/ash (it is in my local fs)
now in below command i am getting error as
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Unknown or unspecified command 'agent'
usage: bin/flume-ng [COMMAND] [OPTION]...
So what should i write in place of agent
////please reply..
Thanks
Ashok
in your command what is "text" ??
Deleteagent1.sources.tail.command = text -F /home/hadoop/as/ash
that should be tail or some unix command
rest all is looking fine
let me change and try
ReplyDeletei did it tail -F /home/hadoop/as/ash
ReplyDeletebut still same error.
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1
Unknown or unspecified command 'agent'
usage: bin/flume-ng [COMMAND] [OPTION]...
please post contents of your configuration file "flume.conf"
Delete# The channel can be defined as follows.
ReplyDeletefoo.sources.seqGenSrc.channels = memoryChannel
# Each sink's type must be defined
foo.sinks.loggerSink.type = logger
#Specify the channel the sink should use
foo.sinks.loggerSink.channel = memoryChannel
# Each channel's type is defined.
foo.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
foo.channels.memoryChannel.capacity = 100
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
# Define an Avro source called avro-source1 on agent1 and tell it
# to bind to 0.0.0.0:41414. Connect it to channel ch1.
agent1.sources.avro-source1.channels = ch1
agent1.sources.avro-source1.type = avro
agent1.sources.avro-source1.bind = 0.0.0.0
agent1.sources.avro-source1.port = 41414
# Define a logger sink that simply logs all events it receives
# and connect it to the other end of the same channel.
agent1.sinks.log-sink1.channel = ch1
agent1.sinks.log-sink1.type = logger
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = avro-source1
agent1.sinks = log-sink1
this is my new flume.conf file....please tell me which command i should run????
ReplyDeletehi
ReplyDeletei am really very new to this technology.
i tried to configure flume using ur configuration using hbase as sink.
but i got ERROR properties.PropertiesFileConfigurationProvider: Failed to start agent because dependencies were not found in classpath.
i have included flume conf path in FLUME_CLASSPATH
also have given JAVA_HOME path.
what elso do i need to include in env.sh
pls guide me.
thanks
Dear Rahul,
ReplyDeleteThanks for the example. I simplified your .conf file to write to console with logger (since I was not able to get tail to write to hdfs), but nothing appears on the console. Even if I add new lines to the log file, there is no output on the console. My tail2logger.conf file is as follows:
# list sources, sinks and channels in the source agent
agent1.sources = tail1
agent1.channels = memoryChannel
agent1.sinks = sink1
#Describe the source
agent1.sources.tail1.type = exec
agent1.sources.tail1.command = tail -F /home/hduser/flume/exlog.txt
#Describe the sink
agent1.sinks.sink1.type = logger
#Describe the channel
agent1.channels.memoryChannel.type = memory
# Bind the source and sink to the channel
agent1.sources.tail1.channels = memoryChannel
agent1.sinks.sink1.channel = memoryChannel
and I use the command-line:
flume-ng agent --conf-file tail2logger.conf --name agent1 -Dflume.root.logger=INFO,console
Appreciate your help
Regards