[video] Data Science is not NEW – it’s just that we live in a VERY special time!

Standard
  • Data Analysis is NOT new
  • Data Mining is NOT new
  • Predictive Analytic is NOT new
  • Machine Learning is NOT new
  • Statistics is NOT new
  • And Data Science is NOT new

So what’s new?

  • The rate at which data is produced.
  • The variety in Data that’s being produced.
  • The “amount” of data that’s being produced.

And we did not have Tools and Techniques before – But now we do! Indeed, We live in a VERY special time!

Here’s a nice 5 minute video titled “Data Science: Beyond Intuition”.

Link to video: http://vimeo.com/48456421  AND Thanks Ryan Swanstrom for sharing!

What Chart should I use for effective graphical representation of data?

Standard

Data Visualization is an art. No doubt about it –  I admire professional artists that can create “beautiful” data visualizations. And Data visualization involves more than one technique – one being representing data using charts. And if you have experience in this domain then you know that there are Many charts out there.

Resource: You can browse Google Charts Gallery here to see various options you have.

Now, how do you choose between these options when confronted with the challenge of creating effective charts? wouldn’t it be great if we knew of a resource that could help us get started?

well, I found this resource which I think can help you get started on which charts you can use:

chart chooser data visualizationSource: http://www.extremepresentation.com/design/charts/

Online Tool: http://labs.juiceanalytics.com/chartchooser.html

 

Data Mining: Classification VS Clustering (cluster analysis)

Standard

For someone who is new to Data mining, classification and clustering can seem similar because both data mining algorithms essentially “divide” the datasets into sub-datasets; But there is difference between them and this blog-post, we’ll see exactly that:

CLASSIFICATIONCLUSTERING
  • We have a Training set containing data that have been previously categorized
  • Based on this training set, the algorithms finds the category that the new data points belong to
  • We do not know the characteristics of similarity of data in advance
  • Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data
Since a Training set exists, we describe this technique as Supervised learningSince Training set is not used, we describe this technique as Unsupervised learning
Example:We use training dataset which categorized customers that have churned. Now based on this training set, we can classify whether a customer will churn or not.Example:We use a dataset of customers and split them into sub-datasets of customers with “similar” characteristics. Now this information can be used to market a product to a specific segment of customers that has been identified by clustering algorithm

If you want to learn about Data Mining, check out the “free Book in PDF format: Mining the massive data-sets”.

Examples to help clarify what’s unstructured data and what’s structured?

Standard

I have been reading and researching about BigData and BigData on cloud. One of the concept that’s repeated is that “Big Data is about analyzing unstructured data…” and in this blog post, I just want to show few examples that would help you differentiate between Structured data & Unstructured data.

Before we begin, here’s the definition of Unstructured data:

Unstructured Data (or unstructured information) refers to information that either does not have a pre-defined data model and/or does not fit well into relational tables – Wikipedia

Also I just wanted to point that it’s not unstructured because you cannot fit the data into a schema/model but even after fitting it into the model – it would not help. Example. Consider email body as an example of unstructured data. You can create a column “EMAIL BODY”. Now think of questions that are likely to be asked. Do they get answered? if not – then fitting it into model and calling it structured does not make sense, does it? With that, Here are the examples:

1 Word Doc & PDF’s & Text files

Unstructured data

Examples: Books, Articles

2. Audio files

Unstructured data

Example: Call center conversations.

3. email body

Unstructured data

Example: you don’t need an example here!

4. Videos

Unstructured data

Example: Video footage of criminal interrogation

5. A Data Mart / Data Warehouse

Structured Data

6. XML

Semi Structured Data

Couple of Applications for your brain cells:

1. Map disease patterns by analyzing medical records (Text)

2. Tuning customer support by analyzing calls (Audio)

Few Quotes about Unstructured data that I liked:

80 percent of business-relevant information originates in unstructured form –  Justin Langseth. URL (Wikipedia Article says that even Merrill Lynch cited this)

BUT some-one else had a nice perspective about this 80%:

but managing it (this 80%) really isn’t a significant problem……………the innovation isn’t in structuring text, it’s in applying models to discover and exploit their inherent structure. Source

My Experience with Unstructured Data (in context of BigData) and Cloud:

I have been playing with MapReduce on Windows Azure (Project Daytona), Elastic Map reduce (Amazon Web Services) and Google’s BigQuery platform. To give you one example. I’ll use the example of Microsoft’s project daytona. Here I uploaded data in unstructured format in form of TEXT. And the goal was to run the “Word Count”. It helps you answer questions like: which word has the highest frequency? or which is the least popular word? and you could tweak the algorithm to consider words with length greater than four (among other constraints) – Now this is what happens when you run the algo: amazing MapReduce framework (App deployed on Windows Azure in this case) does some analysis on unstructured data (TEXT  in this case) and it helps you answer the question that you were looking for. So I hope you know how it works.

That’s about it for this post. Do you have an example or application of unstructured data? Please do post it in the comments!