Retrieve tweets using twitter streaming with hortons sandbox

Classical but mandatory I’ll talk about social and big data in further articles. The first will describe how retrieving  live tweets and store them in HDFS.


We’ll use the flume to retrieve stream tweet/

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

Agent component diagram

A Flume data flow is made up of five main components: Events, Sources, Channels, Sinks, and Agents.


An event is the basic unit of data that is moved using Flume. It is similar to a message in JMS and is generally small. It is made up of headers and a byte-array body.


The source receives the event from some external entity and stores it in a channel. The source must understand the type of event that is sent to it: an Avro event requires an Avro source.


A channel is an internal passive store with certain specific characteristics. An in-memory channel, for example, can move events very quickly, but does not provide persistence. A file based channel provides persistence. A source stores an event in the channel where it stays until it is consumed by a sink. This temporary storage allows source and sink to run asynchronously.


The sink removes the event from the channel and forwards it on either to a destination, like HDFS, or to another agent/dataflow. The sink must output an event that is appropriate to the destination.


An agent is the container for a Flume data flow. It is any physical JVM running Flume. The same agent can run multiple sources, sinks, and channels. A particular data flow path is set up through the configuration process.


Or HortonWorks SandBox, connect to the console with putty and type

yum install flume

Twitter application

Before any configuration, you have to declare a twitter  application.

Log into and select “Create a new application”. After creating a new application, Twitter will provide the following information:

• consumerKey
• consumerSecret
• accessToken
• accessTokenSecret

You will need these values in the further step

Flume Configuration

Download the flume-sources-1.0-SNAPSHOT.jar JAR file here and place it into /usr/lib/flume/lib

Note : To simplify our case, we use the cloudera package around twitter

Create two new folders in hdfs :

hadoop fs –mkdir /user/flume
hadoop fs –mkdir /user/flume/tweets

Edit the /etc/Flume/conf/flume.conf and add this

TwitterAgent.sources.Twitter.consumerKey = {consumerKey value}
TwitterAgent.sources.Twitter.consumerSecret = {consumerSecret value}
TwitterAgent.sources.Twitter.accessToken = {accessToken value}
TwitterAgent.sources.Twitter.accessTokenSecret = {accessTokenSecret value}
TwitterAgent.sources.Twitter.keywords = {your keyword} = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://

Note : as you can see, in path we put  the sandbox VM fully qualified name, it’s essential to put the name and not localhost. The default sandbox port is 8020

Now we just have to start the flume service :

flume-ng agent –conf-file /etc/flume/conf/flume.conf –name TwitterAgent > flume_twitteragent.log

If you want to close your terminal session add the NOHUP parameter


Now you just have to wait and  your folder will fill with the tweet. To check type

hadoop fs –ls /user/flume/tweets/{year}/{month}/{day}/{hour}

Structured, semi structured and unstructured data

Three concepts come with big data : structured, semi structured and unstructured data.

Structured Data

For geeks and developpers (not the same things ^^) Structured data is very banal. It concerns all data which can be stored in database SQL  in table with rows and columns. They have relationnal key and  can be easily mapped into pre-designed fields. Today, those datas are the most processed in development and the simpliest way to manage informations.

But structured datas represent only 5 to 10% of all informatics datas. So let’s introduce semi structured data.

Semi structured data

Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (it could be very hard for somme kind of semi structured data), but the semi structure exist to ease space, clarity or compute…

Examples of semi-structured : CSV but  XML and JSON documents are semi structured documents,  NoSQL databases are considered as semi structured.

But as Structured data, semi structured data represents a few parts of data (5 to 10%) so the last data type is the strong one : unstructured data.

Unstructured data

Unstructured data represent around 80% of data. It often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered « unstructured » because the data they contain doesn’t fit neatly in a database.

Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data. Just as with structured data, unstructured data is either machine generated or human generated.

Here are some examples of machine-generated unstructured data:

  • Satellite images: This includes weather data or the data that the government captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture.

  • Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.

  • Photographs and video: This includes security, surveillance, and traffic video.

  • Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.

The following list shows a few examples of human-generated unstructured data:

  • Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today.

  • Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.

  • Mobile data: This includes data such as text messages and location information.

  • website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.

And the list goes on.

The unstructured data growing quickiest than the other, and their exploitation could help in business decision.

A group called the Organization for the Advancement of Structured Information Standards (OASIS) has published the Unstructured Information Management Architecture (UIMA) standard. The UIMA « defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information. »

Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data

Announcement: MABS August 2014 Update is now live!

Microsoft BizTalk team delivered a new version (clik here for the original) Here is a copy

Microsoft Azure BizTalk Services August 2014 Update is now live!

We are pleased to announce the August 2014 update of Microsoft Azure BizTalk Services and the associated SDK.

The release provides ability to configure and manage agreements and bridges separately, new features for EDI, enhanced encryption for AS2, supports advanced XML schema constructs in transforms and is Drummond certified.

While these features are immediately available in any new BizTalk services you create, existing services will be upgraded over in next few days. There would be no impact to availability of existing services as we roll out the upgrade. There are no pricing changes with this announcement.

Key features you can now leverage with this release are:


Ability to Configure and Manage Agreements and Bridge Separately

You can now change settings in a bridge (transforms, transport settings …) without having to redeploy the corresponding agreement. Additionally one bridge can now cater to multiple agreements. Agreements and bridges can be configured and managed separately enabling reuse of configurations and ease of management.

Visit BizTalk Portal now to try it out…

Note: all your existing agreements and bridges continue to function as before. This feature update decouples existing agreements and its associated bridges but does not break functionality in any way


EDI Delimiters at a Transaction Set Level

When sending messages to a B2B partner via an agreement you could use only one pair of delimiters for that partner. With this release Azure BizTalk Services allows configuration of delimiter set per message type on the outbound side. This is applicable for both X12 and EDIFACT protocols. Acknowledgements generated during receive side processing can also choose to use the delimiter set of the incoming message for which they were generated.


Enhanced digest and encryption algorithms for AS2

Supported symmetric key encryptions:

Existing: RC2, 3DES

New: AES-128, AES-192, and AES-256

Supported algorithms for MIC calculation:

Existing: MD5, SHA1

New: SHA2-256, SHA2-384, and SHA2-512


AS2 Drummond Certified

We are happy to announce that with this release Azure BizTalk Services is now AS2 Drummond Certified. See the same being announced by the Drummond Group here and here.


Support for Advanced XML Schema Constructs

All advanced XML schema constructs used in complex schemes, including derived types (xs:extension and xs:restriction types), are now fully supported in Transforms. These derived types are typically used in OAGIS, Salesforce and GJXDM schemas. When you create links from elements under a “derived type” node in the Source schema, that link is only executed when the “xsi:type” value in the input XML instance matches the type value for that specific node. For “derived type” node in the Target schema, the correct “xsi:type” is stamped automatically in the output XML instance based on the incoming mapping links.

Following is the list of the XML schema constructs that are newly supported.

  • Derived complex types – i.e., <xs:extension> and <xs:restriction> being used for a complex-type definition (with a complex type as its base type)
  • <xs:choice> constructs
  • <xs:group> constructs
  • <xs:attributeGroup> constructs
  • <xs:any>
  • <xs:anyAttribute>


Azure BizTalk Services Team

BizTalk 2013 R2 is available for MSDN subscriber

What’s new in BizTalk Server 2013 R2:

  •  Windows Server 2012 R2, Windows Server 2012, Windows 8.1, Windows 7 SP1.
  • Microsoft Office Excel 2013 or 2010.
  • .NET Framework 4.5 and .NET Framework 4.5.1
  • Visual Studio 2013
  • SQL Server 2014 or SQL Server 2012 SP1
  • SharePoint 2013 SP1
  • WCF-WebHttp adapter now supports sending and receiving JSON messages.
  • SFTP adapter now supports two-factor authentication
  • HL7 Accelerator now supports the following:
    • Provides capability to include free-text data as part of the message that can be processed by the HL7 pipelines.
    • 64-bit support for hosting Hl7 adapter.

Good luck, and play around with it.. Again, you can download it through MSDN Subscriptionhere


source (

Introduction to No-SQL

hello everyone,

I’m starting a new series of blog post around no-sql. Why did I choose the domain ? Because, I think that the future will pass by No-SQL, and that evereybody talks about big data. So before talking about big data, for me, I think it is essential to first talk about No-SQL.

What is No-SQL ?

Before giving any definition, I will present a very important matter in database environment : The CAP Theorem

CAP Theorem

The CAP theorem explains that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees :

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it was successful or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

All database sytems can only provide two guarantess, and all of them can be classified with the CAP Theorem (include SQL and NO-SQL system) :

cap theorem

As we can seen it, SQL and No-SQL cohabit in database environment and give an answer to differents issues. You cannot simply forget SQL system for using No-SQL.

OK got it, but what is hidden behind the terme NO-SQL ?

Wikipedia say :

A NoSQL or Not Only SQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though and the particular suitability of a given NoSQL DB depends on the problem to be solved (e.g. does the solution use graph algorithms?). The appearance of mature NoSQL databases has reduced the rationale for Java content repository (JCR) implementations.

NoSQL databases are finding significant and growing industry use in big data and real-time web applications.[1] NoSQL systems are also referred to as « Not only SQL » to emphasize that they may in fact allow SQL-like query languages to be used. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and the huge investments already made in SQL by enterprises.[2] Most NoSQL stores lack true ACID transactions, although a few recent systems, such as FairCom c-treeACE, Google Spanner and FoundationDB, have made them central to their designs.

Remember the CAP theorem, No-SQL only gives two different combinaisons relative to SGDB.

In fact, No-SQL offers some very interesting properties like response time on very large volume (we talk about peta octet), and there is neither schema problem nor data type. On the opposite side, there is neither join operator nor agregation, you need to do this in your code.

Why I won’t compare performance between SQL and No-SQL

On the web, you find lot of comparison between SQL and No-SQL systems. For me, that doesn’t make any sense because they don’t work on the same segment. It’s like adding cabbage and carrots. The only significant possible comparison is about volume, and it’s sure that SQL manages the lowest volume. But everyone agrees with me to say that the volum is only one side of database system problematics.

Database Classification

Now that we have defined No-SQL, let’s classify them. There are four main categories but more do exist  (you can find a list here)

Data Model Performance Scalability Flexibility Complexity Functionality Project Name
Key–value Stores high high high moderate associative array  Voldemort
Column Store high high moderate low columnar database  Hbase, Cassandra
Document Store high variable (high) high low object model  MongoDB, SImpleDB
Graph Database variable variable high high graph theory  Neo4J, AllegroGraph


For a developed tabular click here.

In next articles, I will try to introduce thoses databse systems


Extension Method

hello everybody,

Extension method are not well know but they are very powerfull.

This article is not a tutorial with extension method but an inventory around it.

When use it

Extension method must be use only for technical issues never for business requirement.

For example, you have a string and you want to split it into list of fix lenght substring (Why MS doesn’t includes in the box????)

public static class StringExtensions{    
    public static List<string> SplitIntoParts(this string input, int partLength)
    {        var result = new List<string>();
        int partIndex = 0;
        int length = input.Length;
        while (length > 0)
            var tempPartLength = length >= partLength ? partLength : length;
            var part = input.Substring(partIndex * partLength, tempPartLength);
            length -= partLength;}
        return result;

And the manner how we call it

string longString = "This is a very long string, which we want to
 split on smaller parts every max. 30 characters long."; // Length: 98
var partLength = 30;
var parts = longString.SplitIntoParts(partLength);

How implement it

As I said previously, extensions are designed for technicals cases. The extension’s code shall not exceed ten or twenty lines.

In Visual Studio,  I recommend to store extensions in one project dedicated. This project can be referenced in each artifact.

Moreover, you can manage your extensions near your client project, in fact String.SpliintParts method should be the same in your client A and your client B, so manage your extensions in your own project.

Where do I can find extensions method

I recommend a great site

It includes not only .NET extensions, but don’t use all extensions without reading the code. Make sure you understand what it does, and all methods may not bug free





News from Azure cloud

This morning we released a massive amount of enhancements to Microsoft Azure.  Today’s new capabilities and announcements include:

  • Virtual Machines: Integrated Security Extensions including Built-in Anti-Virus Support and Support for Capturing VM images in the portal
  • Networking: ExpressRoute General Availability, Multiple Site-to-Site VPNs, VNET-to-VNET Secure Connectivity, Reserved IPs, Internal Load Balancing
  • Storage: General Availability of Import/Export service and preview of new SMB file sharing support
  • Remote App: Public preview of Remote App Service – run client apps in the cloud
  • API Management: Preview of the new Azure API Management Service
  • Hybrid Connections: Easily integrate Azure Web Sites and Mobile Services with on-premises data+apps (free tier included)
  • Cache: Preview of new Redis Cache Service
  • Store: Support for Enterprise Agreement customers and channel partners


For more details of each new capapbilities, please visit scott gu’s blog