Saturday, June 23, 2012

Deploy Apache Flume NG (1.x.x)

In this tutorial I have explained how to install / deploy / configure Flume NG on single system, and how to configure Flume NG to copy data to HDFS, then configuration for copying data to HBase

Before going to configurations let’s understand what Flume NG (1.x.x) is:
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application.






In this tutorial I have used:
Hadoop 0.20.X (Apache or Cloudera)
Flume 1.1.0 
HBase 0.90.4
Ubuntu 11.10

I am assuming that you have Hadoop cluster ready, if you don’t have Hadoop installed you can refer:
  • For Hadoop in pseudo distributed mode please click here
  • For Hadoop in distributed mode please click here
Let’s Deploy Flume NG:
Download Flume NG from 
https://www.apache.org/dyn/closer.cgi/incubator/flume/flume-1.1.0-incubating/
Or
http://archive.cloudera.com/cdh4/cdh/4/

Untar:
tar xzf flume-1.1.0-cdh4.0.0.tar.gz
Copy configuration file
cp conf/flume-conf.properties.template conf/flume.conf
cp conf/flume-env.sh.template conf/flume-env.sh

Configuration:
Following are the Configurations for copy data from file on local file system to HDFS
Edit flume.conf and add following entries:


agent1.sources = tail
agent1.channels = Channel-2
agent1.sinks = HDFS

agent1.sources.tail.type = exec
agent1.sources.tail.command = tail -F /var/log/apache2/access.log
agent1.sources.tail.channels = Channel-2

agent1.sinks.HDFS.channel = Channel-2
agent1.sinks.HDFS.type = hdfs
agent1.sinks.HDFS.hdfs.path = hdfs://localhost:9000/flume
agent1.sinks.HDFS.hdfs.fileType = DataStream

agent1.channels.Channel-2.type = memory


Start Flume to copy data to HDFS:
bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

Note that the agent name is specified by -n agent1 and must match a agent name given in -f conf/flume.conf

You can check your data in HDFS either from web console (http://localhost:50070) or command prompt


Now lets configure Flume NG to copy data to HBase:
Here I am assuming that you have HBase installed and ready

Create table in HBase in which you want to copy data

create 'myTab', 'cf'

Edit flume.conf and add following entries:


hbase-agent.sources=tail
hbase-agent.sinks=sink1
hbase-agent.channels=ch1


hbase-agent.sources.tail.type=exec
hbase-agent.sources.tail.command=tail -F /tmp/test05
hbase-agent.sources.tail.channels=ch1


hbase-agent.sinks.sink1.type=org.apache.flume.sink.hbase.HBaseSink
hbase-agent.sinks.sink1.channel=ch1
hbase-agent.sinks.sink1.table=myTab
hbase-agent.sinks.sink1.columnFamily=cf
hbase-agent.sinks.sink1.column=c1
hbase-agent.sinks.sink1.serializer= org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
hbase-agent.sinks.sink1.serializer.payloadColumn=coll
hbase-agent.sinks.sink1.serializer.incrementColumn=coll
hbase-agent.sinks.sink1.serializer.rowPrefix=1+
hbase-agent.sinks.sink1.serializer.suffix=timestamp


hbase-agent.channels.ch1.type=memory


Set Following variables in .bashrc


export HADOOP_HOME=YOUR-PATH
export HADOOP_PREFIX=YOUR-PATH
export PATH==YOUR-PATH:$PATH
export FLUME_HOME=YOUR-PATH
export FLUME_CONF_DIR=YOUR-PATH
export HBASE_HOME=YOUR-PATH
export CLASSPATH=$CLASSPATH:HBASE_HOME/conf:HADOOP_HOME/conf


Now start Flume to copy data to HBase:

bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n hbase-agent

Check whether data gets copy into HBase or not
scan 'myTab'

Note That SimpleHbaseEventSerializer specified in the configuration is just an example of Event Serializer you should write your own class depend on your requirements

13 comments:

  1. i have 2 node....agent 1 and collecter....when i configure...agent 1- as text ("XXXX") agentsink("localhost,35853")....and collecter
    collectorSource(35853) console .....its getting fail in agent sink....is there any additional setting...

    below is the error details

    12/06/27 11:23:37 INFO debug.TextFileSource: File /home/hadoop/as/ash opened
    12/06/27 11:23:37 INFO agent.LogicalNode: Node config successfully set to FlumeConfigData: {srcVer:'Wed Jun 27 11:23:33 IST 2012' snkVer:'Wed Jun 27 11:23:33 IST 2012' ts='Wed Jun 27 11:23:33 IST 2012' flowId:'default-flow' source:'text( "/home/hadoop/as/ash" )' sink:'agentSink( "localhost", 35853 )' }
    12/06/27 11:23:37 INFO durability.NaiveFileWALManager: NaiveFileWALManager is now open
    12/06/27 11:23:37 INFO rolling.RollSink: Created RollSink: trigger=[TimeTrigger: maxAge=10000 tagger=com.cloudera.flume.handlers.rolling.ProcessTagger@1a1c42f] checkPeriodMs = 250 spec='ackingWal'
    12/06/27 11:23:37 INFO rolling.RollSink: opening RollSink 'ackingWal'
    12/06/27 11:23:37 INFO hdfs.SeqfileEventSink: constructed new seqfile event sink: file=/tmp/flume-hadoop/agent/localhost/writing/20120627-112337056+0530.1716541877172.00000019
    12/06/27 11:23:37 WARN rolling.RollSink: Failure when attempting to open initial sink
    java.io.IOException: failure to login

    ReplyDelete
  2. Hi Jeet,
    I think you are getting confused with Flume OG and Flume NG

    in flume-NG architecture has been completely changed
    In Flume NG
    the concept of collector is removed and concept of agent is changed agent is now JVM process

    I think you should refer
    https://cwiki.apache.org/FLUME/flume-ng.html

    ReplyDelete
  3. in flume NG how to start a master ?
    i did all setting from https://cwiki.apache.org/confluence/display/FLUME/Getting+Started..
    now what the command i should write to start mysource and sink

    ReplyDelete
    Replies
    1. Hi Jeet,
      Again you are getting confuse with Flume NG and Flume OG

      there is no master in NG
      if you show architecture diagram shown above there is simply source, sink and channel connecting source and sink

      all the details about the command you should run is mentioned on above tutorial
      to start Flume you need to run :
      "bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1"

      Delete
  4. Hi rahul...i am done with step in above...now i run below command //// and i changed agent1.sources.tail.command = tail -F /var/log/apache2/access.log to

    agent1.sources.tail.command = text -F /home/hadoop/as/ash (it is in my local fs)

    now in below command i am getting error as

    bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

    Unknown or unspecified command 'agent'
    usage: bin/flume-ng [COMMAND] [OPTION]...

    So what should i write in place of agent
    ////please reply..

    Thanks
    Ashok

    ReplyDelete
    Replies
    1. in your command what is "text" ??
      agent1.sources.tail.command = text -F /home/hadoop/as/ash

      that should be tail or some unix command
      rest all is looking fine

      Delete
  5. let me change and try

    ReplyDelete
  6. i did it tail -F /home/hadoop/as/ash


    but still same error.

    bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1

    Unknown or unspecified command 'agent'
    usage: bin/flume-ng [COMMAND] [OPTION]...

    ReplyDelete
    Replies
    1. please post contents of your configuration file "flume.conf"

      Delete
  7. # The channel can be defined as follows.
    foo.sources.seqGenSrc.channels = memoryChannel

    # Each sink's type must be defined
    foo.sinks.loggerSink.type = logger

    #Specify the channel the sink should use
    foo.sinks.loggerSink.channel = memoryChannel

    # Each channel's type is defined.
    foo.channels.memoryChannel.type = memory

    # Other config values specific to each type of channel(sink or source)
    # can be defined as well
    # In this case, it specifies the capacity of the memory channel
    foo.channels.memoryChannel.capacity = 100
    # Define a memory channel called ch1 on agent1
    agent1.channels.ch1.type = memory

    # Define an Avro source called avro-source1 on agent1 and tell it
    # to bind to 0.0.0.0:41414. Connect it to channel ch1.
    agent1.sources.avro-source1.channels = ch1
    agent1.sources.avro-source1.type = avro
    agent1.sources.avro-source1.bind = 0.0.0.0
    agent1.sources.avro-source1.port = 41414

    # Define a logger sink that simply logs all events it receives
    # and connect it to the other end of the same channel.
    agent1.sinks.log-sink1.channel = ch1
    agent1.sinks.log-sink1.type = logger

    # Finally, now that we've defined all of our components, tell
    # agent1 which ones we want to activate.
    agent1.channels = ch1
    agent1.sources = avro-source1
    agent1.sinks = log-sink1

    ReplyDelete
  8. this is my new flume.conf file....please tell me which command i should run????

    ReplyDelete
  9. hi
    i am really very new to this technology.
    i tried to configure flume using ur configuration using hbase as sink.
    but i got ERROR properties.PropertiesFileConfigurationProvider: Failed to start agent because dependencies were not found in classpath.

    i have included flume conf path in FLUME_CLASSPATH
    also have given JAVA_HOME path.
    what elso do i need to include in env.sh
    pls guide me.
    thanks

    ReplyDelete
  10. Dear Rahul,
    Thanks for the example. I simplified your .conf file to write to console with logger (since I was not able to get tail to write to hdfs), but nothing appears on the console. Even if I add new lines to the log file, there is no output on the console. My tail2logger.conf file is as follows:
    # list sources, sinks and channels in the source agent
    agent1.sources = tail1
    agent1.channels = memoryChannel
    agent1.sinks = sink1

    #Describe the source
    agent1.sources.tail1.type = exec
    agent1.sources.tail1.command = tail -F /home/hduser/flume/exlog.txt

    #Describe the sink
    agent1.sinks.sink1.type = logger

    #Describe the channel
    agent1.channels.memoryChannel.type = memory

    # Bind the source and sink to the channel
    agent1.sources.tail1.channels = memoryChannel
    agent1.sinks.sink1.channel = memoryChannel

    and I use the command-line:
    flume-ng agent --conf-file tail2logger.conf --name agent1 -Dflume.root.logger=INFO,console

    Appreciate your help
    Regards

    ReplyDelete