Examples to help clarify what’s unstructured data and what’s structured?

Standard

I have been reading and researching about BigData and BigData on cloud. One of the concept that’s repeated is that “Big Data is about analyzing unstructured data…” and in this blog post, I just want to show few examples that would help you differentiate between Structured data & Unstructured data.

Before we begin, here’s the definition of Unstructured data:

Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables – Wikipedia

Also I just wanted to point that it’s not unstructured because you cannot fit the data into a schema/model but even after fitting it into the model – it would not help. Example. Consider email body as an example of unstructured data. You can create a column “EMAIL BODY”. Now think of questions that are likely to be asked. Do they get answered? if not – then fitting it into model and calling it structured does not make sense, does it? With that, Here are the examples:

1 Word Doc & PDF’s & Text files

Unstructured data

Examples: Books, Articles

2. Audio files

Unstructured data

Example: Call center conversations.

3. email body

Unstructured data

Example: you don’t need an example here!

4. Videos

Unstructured data

Example: Video footage of criminal interrogation

5. A Data Mart / Data Warehouse

Structured Data

6. XML

Semi Structured Data

Couple of Applications for your brain cells:

1. Map disease patterns by analyzing medical records (Text)

2. Tuning customer support by analyzing calls (Audio)

Few Quotes about Unstructured data that I liked:

80 percent of business-relevant information originates in unstructured form –  Justin Langseth. URL (Wikipedia Article says that even Merrill Lynch cited this)

BUT some-one else had a nice perspective about this 80%:

but managing it (this 80%) really isn’t a significant problem……………the innovation isn’t in structuring text, it’s in applying models to discover and exploit their inherent structure. Source

My Experience with Unstructured Data (in context of BigData) and Cloud:

I have been playing with MapReduce on Windows Azure (Project Daytona), Elastic Map reduce (Amazon Web Services) and Google’s BigQuery platform. To give you one example. I’ll use the example of Microsoft’s project daytona. Here I uploaded data in unstructured format in form of TEXT. And the goal was to run the “Word Count”. It helps you answer questions like: which word has the highest frequency? or which is the least popular word? and you could tweak the algorithm to consider words with length greater than four (among other constraints) – Now this is what happens when you run the algo: amazing MapReduce framework (App deployed on Windows Azure in this case) does some analysis on unstructured data (TEXT  in this case) and it helps you answer the question that you were looking for. So I hope you know how it works.

That’s about it for this post. Do you have an example or application of unstructured data? Please do post it in the comments!

17 thoughts on “Examples to help clarify what’s unstructured data and what’s structured?

  1. Great post Paras! You addressed a very important distinction here. Would be a good reference for anyone starting to play around with BigData aka unstructured data. Thanks for sharing!

  2. Comment by Nakul on http://beyondrelational.com/modules/24/syndicated/404/posts/14867/examples-to-help-clarify-whats-unstructured-data-and-whats-structured.aspx :

    ” Paras,

    Good one. I agree with the nice perspective around the 80% statistics – the challenge always has been mining the data to get meaningful information out of it. Data is everywhere – information is not, and that’s what IT is all about! 🙂

    This post will be made compulsory by me for anyone who is getting started in IT – the difference between structured & unstructured data is one that has to be understood.

    Thank-you for taking the time out and writing about it! ”

    << Thanks!

  3. Tavis Reddick

    I think your article makes a fair summary, but I still disagree with some of the distinctions. A relational database model of press releases, for example, if it uses plain text for the story, may be less structured than an XML, HTML or word processing document that structures text into paragraphs, lists and other block and inline elements in order.

    To say that XML is semi-structured is to ignore that some XML documents will have schemas, and some XML schemas are more specific and granular than relational database schemas (for example, attributes can be used to qualify data elements, and complex types may be more powerfully described there from document down to the character level, and different vocabularies can be combined using namespaces).

    Recursion is sometimes better handled in document-oriented rather than relational databases.

    And with HTML5, there is more semantic structure in web pages than before.

    To characterize data as unstructured, as the Wikipedia article does, as “not helpful for the desired processing task” is to confuse the issue, as the property of structuredness then depends on the task, not on the data format, which (in both relational and document-oriented worlds) should be general purpose, not for a specific, foreseen usage.

    And considering that you can store XML within relational databases, for instance, shows that you can have hybrid solutions where you can potentially provide more structure than either component on its own.

  4. Betinho

    Fantastic blog! Do you have any tips and hints for aspiring wretirs? I’m planning to start my own site soon but I’m a little lost on everything. Would you suggest starting with a free platform like WordPress or go for a paid option? There are so many options out there that I’m totally overwhelmed .. Any suggestions? Kudos!

What do you think? Leave a comment below.