I have been reading and researching about BigData and BigData on cloud. One of the concept that’s repeated is that “Big Data is about analyzing unstructured data…” and in this blog post, I just want to show few examples that would help you differentiate between Structured data & Unstructured data.
Before we begin, here’s the definition of Unstructured data:
Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables – Wikipedia
Also I just wanted to point that it’s not unstructured because you cannot fit the data into a schema/model but even after fitting it into the model – it would not help. Example. Consider email body as an example of unstructured data. You can create a column “EMAIL BODY”. Now think of questions that are likely to be asked. Do they get answered? if not – then fitting it into model and calling it structured does not make sense, does it? With that, Here are the examples:
1 Word Doc & PDF’s & Text files
Unstructured data
Examples: Books, Articles
2. Audio files
Unstructured data
Example: Call center conversations.
3. email body
Unstructured data
Example: you don’t need an example here!
4. Videos
Unstructured data
Example: Video footage of criminal interrogation
5. A Data Mart / Data Warehouse
6. XML
Couple of Applications for your brain cells:
1. Map disease patterns by analyzing medical records (Text)
2. Tuning customer support by analyzing calls (Audio)
Few Quotes about Unstructured data that I liked:
80 percent of business-relevant information originates in unstructured form – Justin Langseth. URL (Wikipedia Article says that even Merrill Lynch cited this)
BUT some-one else had a nice perspective about this 80%:
but managing it (this 80%) really isn’t a significant problem……………the innovation isn’t in structuring text, it’s in applying models to discover and exploit their inherent structure. Source
My Experience with Unstructured Data (in context of BigData) and Cloud:
I have been playing with MapReduce on Windows Azure (Project Daytona), Elastic Map reduce (Amazon Web Services) and Google’s BigQuery platform. To give you one example. I’ll use the example of Microsoft’s project daytona. Here I uploaded data in unstructured format in form of TEXT. And the goal was to run the “Word Count”. It helps you answer questions like: which word has the highest frequency? or which is the least popular word? and you could tweak the algorithm to consider words with length greater than four (among other constraints) – Now this is what happens when you run the algo: amazing MapReduce framework (App deployed on Windows Azure in this case) does some analysis on unstructured data (TEXT in this case) and it helps you answer the question that you were looking for. So I hope you know how it works.
That’s about it for this post. Do you have an example or application of unstructured data? Please do post it in the comments!
I like it! You did a great job at explaining structured data vs unstructured data. Thanks
Thanks DataNinja
Great post Paras! You addressed a very important distinction here. Would be a good reference for anyone starting to play around with BigData aka unstructured data. Thanks for sharing!
Glad you liked it, Thanks for the comment!
Comment by Nakul on http://beyondrelational.com/modules/24/syndicated/404/posts/14867/examples-to-help-clarify-whats-unstructured-data-and-whats-structured.aspx :
” Paras,
Good one. I agree with the nice perspective around the 80% statistics – the challenge always has been mining the data to get meaningful information out of it. Data is everywhere – information is not, and that’s what IT is all about! 🙂
This post will be made compulsory by me for anyone who is getting started in IT – the difference between structured & unstructured data is one that has to be understood.
Thank-you for taking the time out and writing about it! ”
<< Thanks!
I think your article makes a fair summary, but I still disagree with some of the distinctions. A relational database model of press releases, for example, if it uses plain text for the story, may be less structured than an XML, HTML or word processing document that structures text into paragraphs, lists and other block and inline elements in order.
To say that XML is semi-structured is to ignore that some XML documents will have schemas, and some XML schemas are more specific and granular than relational database schemas (for example, attributes can be used to qualify data elements, and complex types may be more powerfully described there from document down to the character level, and different vocabularies can be combined using namespaces).
Recursion is sometimes better handled in document-oriented rather than relational databases.
And with HTML5, there is more semantic structure in web pages than before.
To characterize data as unstructured, as the Wikipedia article does, as “not helpful for the desired processing task” is to confuse the issue, as the property of structuredness then depends on the task, not on the data format, which (in both relational and document-oriented worlds) should be general purpose, not for a specific, foreseen usage.
And considering that you can store XML within relational databases, for instance, shows that you can have hybrid solutions where you can potentially provide more structure than either component on its own.
Tavis, Thanks for the comment. you are right. I didn’t cover the nuances of XML. Thanks for bringing that up.
Fantastic blog! Do you have any tips and hints for aspiring wretirs? I’m planning to start my own site soon but I’m a little lost on everything. Would you suggest starting with a free platform like WordPress or go for a paid option? There are so many options out there that I’m totally overwhelmed .. Any suggestions? Kudos!
Seems like a spam, sorry. Ping me at contact[at]ParasDoshi[dot]com for any questions that you may have.
is there any tool used for structuring unstructured multimedia data?
Thanks for the comment. Couple of questions for clarification:
1) What’s your data analysis needs?
2) What do you consider as a structured with respect to your business needs?