The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles/videos made the cut for today’s newsletter.
(1) Data & AI landscape 2020
Really good review of the yera 2020 of data & AI landscape. Look at those logos that represent bunch of companies tackling various data and AI challenges — it’s an exciting time to be in data! Read here
The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
(1) The Modern Data Stack
Amazing artcile by Tristan Handy explaining modern data stack. If you are familiar with tools such as Looker, Redshift, Snowflake, BigQuery, FiveTran, DBT, etc but wondered how each of them fit into an overall architecture, then this is a must read! Read here
The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
Why dropbox picked Apache superset as data exploration tool?
Apache superset is gaining momentum and if you want to understand the reasons behind that, you can start by reading this article here
Growth: Adjacent User Theory
I love the framing via this LinkedIn post here where Nimit Jain says that Great Growth PM output looks like “We discovered 2 new user segments who are struggling to proceed at 2 key steps in the funnel and simplified the product for them via A/B experiments. This lead to conversion improvement of 5-10% at these steps so far. We are now working to figure the next segment of users to focus on.”; you can read about the Adjacent user theory here
Really good matrix on comparing 2 popular ETL workflow platforms. Read here
A data engineer’s point of view on data democratization
If more people can easily access data that was previously not accessible to them then that’s a good thing. This is a good read on various things to consider, read here
The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
1. AWS re:Invent ML, Data and Analytics announcements
Really good recap of all ML, Data and Analytics announcements at AWS reinvent 2020 here
2. How to build production workflow with SQL modeling
A really good example of how a data engineering at Shopify applied software engineering best practices to analytics code. Read here
3. Back to basics: What are different data pipeline components and types?
Must know basic concepts for every data engineer here
4. Back to basics: SQL window functions
I was interviewing a senior candidate earlier this week and it was unfortunate to basic mistakes while writing SQL window functions. Don’t let that happen to you. Good tutorial here
The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
What is a data lake?
Good article on basics of data lake architecture on guru99 here
Data quality at Airbnb
Really good framework on how to think about data quality systematically through examples and mental-model from Airbnb here
Monetization vs growth is a false choice
Good article from Reforge for Monetization vs growth mental model here
The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
1. Why a data scientist is not a data engineer?
Good post on the difference between data engineer and data scientist and why you need both roles in a data team. I chuckled when one of the sections had explanations around why data engineering != spark since I completely agree that these roles should be boxed around just one or two tools! read the full post here
The purpose of this Insight Extractor’s newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following articles made the cut for today’s newsletter.
1. What does a Business Intelligence Engineer (BIE) do in Amazon?
Have you wondered what Analytics professionals at Top tech companies work on? Are you job hunting and wondering what data roles (data engineer, data science, or Bi engineer) at Amazon are a great fit for your profile? If so, read Jamie Zhang’s (Sr Business Intelligence Engineer at Amazon) post here
2. What are the 2 Data & Analytics Maturity models that you should absolutely know about?
If you have read my blog, you know that I am a fan of mental models. So, here are 2 mental models (frameworks) shared by Greg Coquillo that are worth reading/digesting here
3. Using Machine Learning to Predict Value of Homes On Airbnb
Really good case study by Airbnb Data scientist Robert Chang here
4. How Netflix measures product succes?
Really good post on how to define metrics to prove or disprove your hypotheses and measure progress in a quick and simple manner. To do this, the author, Gibson Biddle, shares a mechanism of proxy metrics and it’s a really good approach. You can read the post here
Once you read the post above, also suggest learning about leading vs lagging indicators. It’s a similar approach and something that all data teams should strive to build for their customers.
5. Leading vs lagging indicators
Kieran Flanagan and Brian Balfour talk about why your north star metric should be a leading indicator and if it’s not then how to think about it. Read about it here
Thanks for reading! Now it’s your turn: Which article did you love the most and why?
The purpose of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. Following articles made the cut for today’s newsletter:
1.What I love about Scrum for Data Sciene.
I love the Scrum mechanism for all data roles: data engineering, data analytics and data science. The author (Eugene) shares his perspective based on his experiences. I love that the below quote from the blog and you can read the full post here
Better to move in the right direction, albeit slower, than fast on the wrong path.
One of the best article on end to end anayltics journey at a startup by Samson Hu. Must read! Go here (Note that the analytics architectures have changed since this post which was published in 2015 but read through the mental model instead of exact tech tools that were mentioned in the post)
3. GO-FAST: The Data Behind Ramadan:
A great example of data storytelling from Go-Jek BI team lead Crysal Widjaja. Read here
4. Why Robinhood uses Airflow:
Airflow is a popular data engineering tool out there and this post provides really good context on it’s benefits and how it stacks up against other tools. Read here
5. Are dashboards dead?
Every new presentation layer format in the data field can lead to experts questioning the value of dashboards. With the rise of Jupyter notebooks, most vendors have now added the “notebooks” functionality and with that comes the follow-up question on if dashboards are dead? Here’s one such article. Read here
I am not still personally convinced that dashboards are “dead” but it should complement other presentation formats that are out there. The post does have good points against dashboards (e.g data is going portrait mode) and you should be aware about those to ensure that you are picking the right format for your customers. The author is also biased since they work for a data vendor that is betting big on notebooks and so you might want to account for that bias while reading this. Also, I had written about “Are dashboards dead?” in context of chat-bots in 2016 and that hypothesis turned out to be true; you can read that here
If you are a data science professional and haven’t heard about bots, you will soon! Most of the big vendors (Microsoft, Qlik, etc) have started adding capabilities and have shown some signs of serious product investments for this category. So, let’s step back and reflect how will bot impact the adoption of data platforms? and why you should care?
So, let’s start with this question: What do you need to drive a data-driven culture in an organization? You need to focus on three areas to be successful:
Data (you need to access from multiple sources, merge/join it, clean it and store it in cental location)
Modeling Layer/Algorithm layer (you need to add business logic, transform data and/or add machine learning algorithm to add business value to your data)
Workflow (you need to embed data & insights in business user’s workflow OR help provide data/insights when they in their decision-making process)
Over the past few years, there was a really strong push for “self-service” which was good for the data professionals. A data team builds a platform for analysts and business users to self-serve whenever they needed data and so instead of focusing on one-off requests, the team could focus on continuously growing the central data platform and help satisfy a lot of requests. This is all great. Any business with more than 50-ish employees should have a self-service platform and if they don’t then consider building something like that. All the jazz comes after this! Data Science, Machine learning, Predictive modeling etc would be much easier if you have a solid data platform (aka data warehouse, operational data store) in place! Of course, I am talking at a pretty high-level and there are nuances and details that we could go into but self-service were meant for business users and power users to “self-serve” their data needs which is great!
Now, there is one problem with that! Self-service platforms don’t do a great job at the third piece which is “workflow” — they are not embedded in every business user’s workflow and management team doesn’t always get the insights when they need to make the decision. Think of it this way, since it’s self-serving platform, users will think of it to react to business problems and might not have the chance to be pro-active.Ok, That may seem vague but let me give you an example.
Let’s a take a simple business workflow of a sales professional.
She has a call coming up with one of her key customers since their account is about to expire. So she logs into the CRM (customer relationship management) software to learn about the customer. She looks at some information in the CRM system and then wants to learn about the product usage by that customer over last 12 months.
She opens a new browser tab and logs into the data platform. Takes about 10 minutes to navigate to data model/app that has that information. Filters the data to the customer of interest and a chart comes up.
Goes back to the CRM system. Needs something else so goes back to the data platform. That searching takes another 10 minutes!
Wasn’t that painful? Having to switch between multiple applications and wasting 10 minutes each time just to answer a simple question. So business users do this if this is critical but they will ignore your platform if it’s not business-critical.
So to improve data-driven culture you need to think about your business users workflow and think of ways to integrate data/insights. This is probably one of the most under-rated things that has exponential pay-off’s!
So how do bots fit into all of this? So we talked about how workflows are important, right? To address this, tools had data alerts and embedded reports feature which works too but now we have a new thing called “bots” which enables deeper integration and helps you embed data/insights to a business user’s workflow.
Imagine this: In the previous example, instead of logging into data platform, the business user could just ask a question on one of the chat applications: show me the product usage of customer x. And a chart shows up. Boom! Saved 10 minutes but more importantly, by removing friction and adding delight, we gained a loyal user who is going to be more data-driven than ever before!
This is not fiction! Here’s a slack bot that a vendor built that does what I just talked about:
So to wrap up, I think bots could have a tremendous impact on the adoption of the data platforms as it enables data professionals to work on the third pillar called “workflow” to further empower the business users.
And the increase in data consumption is great for both data engineers and data scientists. it’s great for data engineers because people might ask more questions and you might have to integrate more data sources. It’s great for data scientists because if more people ask questions then over time, they will get to asking bigger and bolder questions and you will be looped into those projects to help solve those.
What do you think? Do you think bot will impact the adoption of data platforms? If so, how? if not, why not? I am looking forward to hearing about what you have to say! please add your comments below.