Asked on Quora:
When reading about Big Data, this starts with the definition of Gartner’s analyst Doug Laney (3Vs). IBM is often using 4 dimensions by adding veracity. Some people are using 6 or up to 12 dimensions. I am wondering what’s the most frequently used definition?
Here’s my “working” definition of Big Data: if your existing 1) Tools & 2) Processes don’t support the data analysis needs then you have a Big Data problem.
You can add as many V’s as you want to but it all ties back to the notion that you need bigger and better tools and processes to support your data analysis needs as you grow.
#1. Social Media Data is BIG! It’s Text (variety) and much bigger in size (Volume) and it’s all coming in very fast! (velocity) AND business wants to analyze customer sentiments on social: OK — we have 3V’s problem and need a solution to support this. Maybe Hadoop is the answer. Maybe not. But you do have a “Big Data” problem.
#2: Your Customer Database is broken. They don’t right addresses. Google and Alphabet are showing up as two separate companies when they should be just one. Their employee count is outdated and All of these problems is confusing your business user and they don’t TRUST the data anymore. You have a veracity problem and so you have a BIG Data problem.
Everyone has a BIG DATA problem. It just depends what there “v’s” are AND it most cases “tools” alone will not solve the issue. You need PEOPLE and PROCESS to solve that. Here’s my ranking: 1) PEOPLE 2) PROCESS 3) PLATFORM (tools) for ingredients that are key to solving BIG Data problems.
VIEW QUESTION ON QUORA
Someone asked this on Quora about how to learn & explore the field of Big Data Algorithms? Also, mentioned having some background in python already and wanted ideas to work on a good project so with that context, here is my reply:
There are two broad roles available in Data/Big-Data world:
- Engineering-oriented: Date engineers, Data Warehousing specialists, Big Data engineer, Business Intelligence engineer— all of these roles are focused on building that data pipeline using code/tools to get the data in some centralized location
- Business-oriented: Data Analyst, Data scientist — all of these roles involve using data (from those centralized sources) and helping business leaders make better decisions. *
*smaller companies (or startups) tend to have roles where small teams(or just one person) do it all so the distinction is not that apparent.
Now given your background in python and programming, you might be a great fit for “Data engineer” roles and I would recommend learning about Apache spark (since you can use python code) and start building data pipelines. As you work with a little bit more than you can learn about how to build and deploy end-to-end machine learning projects with python & Apache spark. If you acquire these skills and keep learning — then I am sure you will end up with a good project.
Hope that helped and good luck!
VIEW QUESTION ON QUORA
SQL Server 2014 was Announced in Teched’s Keynote!
So while the focus of the SQL server 2012 was around in-memory OLAP, the focus of this new release seems to be In-memory OLTP (Along with Cloud & Big Data)
Here’s the Blog Post: http://blogs.technet.com/b/dataplatforminsider/archive/2013/06/03/sql-server-2014-unlocking-real-time-insights.aspx (Also Thanks for the Picture!)
I was selected to a be a speaker at SQL Saturday Trinidad! And it was amazing because not only did I get a chance to interact with the wonderful people who are part of SQL Server community there but also visited some beautiful places on this Caribbean island!
I visited Trinidad in January, just before their carnival season! And even though, people were busy preparing for carnival season, it was great to see them attend an entire day of SQL Server Training:
And here’s me presenting on “Why Big Data Matters”:
(Thanks Niko for the photo!)
And after the event, I also got a chance to experience the beauty of this Caribbean island!
Thank you SQL Saturday 185 Team for a memorable time!
Presentation Slides: The slides had been posted for the attendees prior to my presentation and if you want you can view them here:
HDFS and MapReduce inner workings in a nutshell.
Click on the image to view larger sized image
In this post, I want to point out that HDInsight (Hadoop on Windows) comes with a sample datasets (log files) that you can load using the command:
1. Hadoop command Line > Navigate to c:HadoopGettingStarted
2. Execute the following command:
powershell -ExecutionPolicy unrestricted –F importdata.ps1 w3c
After you have successfully executed the command, you can sample files in /w3c/input folder:
Conclusion: In this post, we saw how to load some data to Hadoop on Windows file system to get started. Your comments are very welcome.
Official Resource: http://gettingstarted.hadooponazure.com/loadingData.html