Retrieve tweets using twitter streaming with hortons sandbox

Classical but mandatory I’ll talk about social and big data in further articles. The first will describe how retrieving  live tweets and store them in HDFS.

Pre-requisite

We’ll use the flume to retrieve stream tweet/

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

Agent component diagram

A Flume data flow is made up of five main components: Events, Sources, Channels, Sinks, and Agents.

Events

An event is the basic unit of data that is moved using Flume. It is similar to a message in JMS and is generally small. It is made up of headers and a byte-array body.

Sources

The source receives the event from some external entity and stores it in a channel. The source must understand the type of event that is sent to it: an Avro event requires an Avro source.

Channels

A channel is an internal passive store with certain specific characteristics. An in-memory channel, for example, can move events very quickly, but does not provide persistence. A file based channel provides persistence. A source stores an event in the channel where it stays until it is consumed by a sink. This temporary storage allows source and sink to run asynchronously.

Sinks

The sink removes the event from the channel and forwards it on either to a destination, like HDFS, or to another agent/dataflow. The sink must output an event that is appropriate to the destination.

Agents

An agent is the container for a Flume data flow. It is any physical JVM running Flume. The same agent can run multiple sources, sinks, and channels. A particular data flow path is set up through the configuration process.

Installation

Or HortonWorks SandBox, connect to the console with putty and type

yum install flume

Twitter application

Before any configuration, you have to declare a twitter  application.

Log into dev.twitter.com and select “Create a new application”. After creating a new application, Twitter will provide the following information:

• consumerKey
• consumerSecret
• accessToken
• accessTokenSecret

You will need these values in the further step

Flume Configuration

Download the flume-sources-1.0-SNAPSHOT.jar JAR file here and place it into /usr/lib/flume/lib

Note : To simplify our case, we use the cloudera package around twitter

Create two new folders in hdfs :

hadoop fs –mkdir /user/flume
hadoop fs –mkdir /user/flume/tweets

Edit the /etc/Flume/conf/flume.conf and add this

TwitterAgent.sources.Twitter.consumerKey = {consumerKey value}
TwitterAgent.sources.Twitter.consumerSecret = {consumerSecret value}
TwitterAgent.sources.Twitter.accessToken = {accessToken value}
TwitterAgent.sources.Twitter.accessTokenSecret = {accessTokenSecret value}
TwitterAgent.sources.Twitter.keywords = {your keyword}

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/flume/tweets/%Y/%m/%d/%H/

Note : as you can see, in path we put  the sandbox VM fully qualified name, it’s essential to put the name and not localhost. The default sandbox port is 8020

Now we just have to start the flume service :

flume-ng agent –conf-file /etc/flume/conf/flume.conf –name TwitterAgent > flume_twitteragent.log

If you want to close your terminal session add the NOHUP parameter

 

Now you just have to wait and  your folder will fill with the tweet. To check type

hadoop fs –ls /user/flume/tweets/{year}/{month}/{day}/{hour}

Structured, semi structured and unstructured data

Three concepts come with big data : structured, semi structured and unstructured data.

Structured Data

For geeks and developpers (not the same things ^^) Structured data is very banal. It concerns all data which can be stored in database SQL  in table with rows and columns. They have relationnal key and  can be easily mapped into pre-designed fields. Today, those datas are the most processed in development and the simpliest way to manage informations.

But structured datas represent only 5 to 10% of all informatics datas. So let’s introduce semi structured data.

Semi structured data

Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (it could be very hard for somme kind of semi structured data), but the semi structure exist to ease space, clarity or compute…

Examples of semi-structured : CSV but  XML and JSON documents are semi structured documents,  NoSQL databases are considered as semi structured.

But as Structured data, semi structured data represents a few parts of data (5 to 10%) so the last data type is the strong one : unstructured data.

Unstructured data

Unstructured data represent around 80% of data. It often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered « unstructured » because the data they contain doesn’t fit neatly in a database.

Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data. Just as with structured data, unstructured data is either machine generated or human generated.

Here are some examples of machine-generated unstructured data:

  • Satellite images: This includes weather data or the data that the government captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture.

  • Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.

  • Photographs and video: This includes security, surveillance, and traffic video.

  • Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.

The following list shows a few examples of human-generated unstructured data:

  • Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today.

  • Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.

  • Mobile data: This includes data such as text messages and location information.

  • website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.

And the list goes on.

The unstructured data growing quickiest than the other, and their exploitation could help in business decision.

A group called the Organization for the Advancement of Structured Information Standards (OASIS) has published the Unstructured Information Management Architecture (UIMA) standard. The UIMA « defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information. »

Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data