Flume-Solr Integration

Integrate Flume with Solr. I have created a new sink. This sink is usually used with the regexAll decorators that perform light transformation of event data into attributes. This attributes are converted into solr document and commited in solr.


What is Solr
Solr is an open source enterprise search server based on Lucene. Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language.


What is Flume
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management.


I have used flume-0.9.3 and apache-solr-3.1.0 for this POC.


RegexAllExtractor decorator prepare events that contain attributes ready to be written into an Solr. Implementing a RegexAllExtractor decorator is very simple.


S3 instead of HDFS with Hadoop

In this article we will discuss about using S3 as replacement of HDFS (Hadoop Distributed File System) on AWS (Amazon Web Services), and also about what is the need of using S3. Before coming to original use-case and performance of S3 with Hadoop let’s understand What is Hadoop and What is S3

Let’s try to understand what the exact problems are & why HDFS is not used in cloud. When new instances are launched on the cloud to build a Hadoop cluster they do not have any data associated with them. So one approach is to copy the entire huge dataset on them, which is not feasible due to various reasons including bandwidth, time to copy & associated cost. Secondly after completion of jobs once again you will need to copy the result back before terminating cluster machines otherwise the result will be lost when instances are terminated & you will not get anything. Also due to associated cost running the entire cluster just for data collection is not feasible.

Save Data in EBS Volume

This tutorial will guide you through create Amazon EBS (Elastic Block Store) volume, attach it to your running instance and save your data to Amazon EBS. Before that lets understand what is Amazon EBS volumes and what are the features it provides

Amazon Elastic Block Store (EBS) provides block level storage volumes for use with Amazon EC2 instances. We can imagine it as attaching an external hard drive to your system to store data. We can attach multiple EBS volumes to an instance, but one volume can be attached to single instance at a time. Data will be remain saved in the volume after your instance is terminated.

Some Features of Amazon EBS Volumes
·        Amazon EBS allows you to create storage volumes from 1 GB to 1 TB
·        Amazon EBS volumes placed in a specific availability zone can then be attached to instances in that same availability zone.
·        Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component.
·        Amazon EBS also provides the ability to create point-in-time snapshots of volumes, which persists to Amazon S3. These snapshots can be used as the starting point for new Amazon EBS volumes. Snapshot can be used to instantiate new volume
·        AWS also enables you to create new volumes from AWS hosted public data sets.
·        Amazon CloudWatch exposes performance metrics for EBS volumes, giving you insight into bandwidth, throughput, latency, and queue depth.