Three concepts come with big data : structured, semi structured and unstructured data.
For geeks and developpers (not the same things ^^) Structured data is very banal. It concerns all data which can be stored in database SQL in table with rows and columns. They have relationnal key and can be easily mapped into pre-designed fields. Today, those datas are the most processed in development and the simpliest way to manage informations.
But structured datas represent only 5 to 10% of all informatics datas. So let’s introduce semi structured data.
Semi structured data
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (it could be very hard for somme kind of semi structured data), but the semi structure exist to ease space, clarity or compute…
Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.
But as Structured data, semi structured data represents a few parts of data (5 to 10%) so the last data type is the strong one : unstructured data.
Unstructured data represent around 80% of data. It often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered « unstructured » because the data they contain doesn’t fit neatly in a database.
Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data. Just as with structured data, unstructured data is either machine generated or human generated.
Here are some examples of machine-generated unstructured data:
Satellite images: This includes weather data or the data that the government captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture.
Scientific data: This includes seismic imagery, atmospheric data, and high energy physics.
Photographs and video: This includes security, surveillance, and traffic video.
Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles.
The following list shows a few examples of human-generated unstructured data:
Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today.
Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr.
Mobile data: This includes data such as text messages and location information.
website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram.
And the list goes on.
The unstructured data growing quickiest than the other, and their exploitation could help in business decision.
A group called the Organization for the Advancement of Structured Information Standards (OASIS) has published the Unstructured Information Management Architecture (UIMA) standard. The UIMA « defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information. »
Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data