Classical but mandatory I’ll talk about social and big data in further articles. The first will describe how retrieving live tweets and store them in HDFS.
Pre-requisite
We’ll use the flume to retrieve stream tweet/
A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
A Flume data flow is made up of five main components: Events, Sources, Channels, Sinks, and Agents.
Events
An event is the basic unit of data that is moved using Flume. It is similar to a message in JMS and is generally small. It is made up of headers and a byte-array body.
Sources
The source receives the event from some external entity and stores it in a channel. The source must understand the type of event that is sent to it: an Avro event requires an Avro source.
Channels
A channel is an internal passive store with certain specific characteristics. An in-memory channel, for example, can move events very quickly, but does not provide persistence. A file based channel provides persistence. A source stores an event in the channel where it stays until it is consumed by a sink. This temporary storage allows source and sink to run asynchronously.
Sinks
The sink removes the event from the channel and forwards it on either to a destination, like HDFS, or to another agent/dataflow. The sink must output an event that is appropriate to the destination.
Agents
An agent is the container for a Flume data flow. It is any physical JVM running Flume. The same agent can run multiple sources, sinks, and channels. A particular data flow path is set up through the configuration process.
Installation
Or HortonWorks SandBox, connect to the console with putty and type
yum install flume
Twitter application
Before any configuration, you have to declare a twitter application.
Log into dev.twitter.com and select “Create a new application”. After creating a new application, Twitter will provide the following information:
• consumerKey
• consumerSecret
• accessToken
• accessTokenSecret
You will need these values in the further step
Flume Configuration
Download the flume-sources-1.0-SNAPSHOT.jar JAR file here and place it into /usr/lib/flume/lib
Note : To simplify our case, we use the cloudera package around twitter
Create two new folders in hdfs :
hadoop fs –mkdir /user/flume
hadoop fs –mkdir /user/flume/tweets
Edit the /etc/Flume/conf/flume.conf and add this
TwitterAgent.sources.Twitter.consumerKey = {consumerKey value}
TwitterAgent.sources.Twitter.consumerSecret = {consumerSecret value}
TwitterAgent.sources.Twitter.accessToken = {accessToken value}
TwitterAgent.sources.Twitter.accessTokenSecret = {accessTokenSecret value}
TwitterAgent.sources.Twitter.keywords = {your keyword}TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://sandbox.hortonworks.com:8020/user/flume/tweets/%Y/%m/%d/%H/
Note : as you can see, in path we put the sandbox VM fully qualified name, it’s essential to put the name and not localhost. The default sandbox port is 8020
Now we just have to start the flume service :
flume-ng agent –conf-file /etc/flume/conf/flume.conf –name TwitterAgent > flume_twitteragent.log
If you want to close your terminal session add the NOHUP parameter
Now you just have to wait and your folder will fill with the tweet. To check type
hadoop fs –ls /user/flume/tweets/{year}/{month}/{day}/{hour}
Vous devez être connecté pour poster un commentaire.