All things data newsletter #16

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles/videos made the cut for today’s newsletter.

(1) Data & AI landscape 2020

Really good review of the yera 2020 of data & AI landscape. Look at those logos that represent bunch of companies tackling various data and AI challenges — it’s an exciting time to be in data! Read here 

2020 Data and AI Landscape
Image Source

(2) Self-Service Analytics

Tooling is the east part, it’s the follow-up steps needed to truly achieve a culture that is independently data-drive. Read here

(3) What is the difference between data pipeline and ETL?

Really good back-to-basics video on difference between Data pipeline and ETL.

(4) Delivering High Quality Analytics at Netlfix

I loved this video! It talks about how to ensure data quality throughout your data stack.

(5) Introduction of data lakes and analytics on AWS

I have another great Youtube video for you. This one introduces you to various AWS tools on data and analytics.

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things Data Newsletter #15 (#dataengineering #datascience #data #analytics)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

(1) Scaling data

Fantastic article by Crystal Widjaja on scaling data. It shares a really good framework for building analytics maturity and how to think about building capabilities to navigate each stage. Must read! Here

three stages.png
Image Source: reforge

(2) Building startup’s data infrastructure in 1-Hour

Good video that touches multiple tools. Watch here: https://www.youtube.com/watch?v=WOSrRTaNIm0 (it’s a little outdated since it was shared in 2019 which is 2 years ago but the architecture is still helpful)

(3) Analytics lesson learned

If you haven’t read lean analytics, I recommed it! After that, you should read this free companion which covers 12 good analytics case studies. Read here

(4) Organizing data teams

How do you organize data teams? completely centralized under a data leader? or do you structure it de-centralized reporting into leaders of business functions? some good thoughts here

Image Source

(5) Metrics layer is a missing piece in modern data stack

This is a good article that encourages you to think about adding metrics layer in your data stack. In the last newseltter, I also shared an Article that talks about Airbbn’s Minerva metrics layer and this article does a good job of providing additional reasons to build something simiar. Read here

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things Data Newsletter #14 (#dataengineering #datascience #data #analytics)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

(1) Analytics is a mess

Fantastic article highligting the importance of the “creative” side of analytics. It’s not always structured and that is also what makes it fun. Read here

(2) Achieving metric consistency at Scale — Airbnb Data

This is a great case study shared by Airbnb’s data team on how they achived metrics consistency at Scale. Read here

Image Source

(3) Achieving metric consistency & standardization — Uber Data

Another great read on metrics standardization — this time at Uber. As you can notice it’s a recurring problem at different organizations after hitting a certain growth threshold. This problem occurs since in the intial growth stage, there’s a lot of focus on enabling folks to look at metrics in a manner that’s optimized for speed. After a certain stage, this needs to balanced with consistency where the teams might have gone in different direction and they are defining the same thing in different way but that doesn’t scale anymore since you need some consistency and standardization. This is where the topic of metric consistency and standardization comes in. It’s a problem worth solving — and if you are interested, please read this article here

(4) Where is data engineering going in 5 years?

A good short post by Zach Wilson on LinkedIn talking about where data engineering is going over the next few years. Not surprised to see Data privacy in there! Read others here

(5) 3 foundations of successful data science team

An Amazon leader (Elvis Dieguez) talks about the 3 foundational pillards of a successful data team. This is comprised of 1) data warehouse 2) automated data pipelines 3) self-service analytics tool. Read here

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things data newsletter #13 (#dataengineer #datascience)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

(1) The Modern Data Stack

Amazing artcile by Tristan Handy explaining modern data stack. If you are familiar with tools such as Looker, Redshift, Snowflake, BigQuery, FiveTran, DBT, etc but wondered how each of them fit into an overall architecture, then this is a must read! Read here

Image Source: GetDBT by Tristan Handy

(2) How can data engineering teams be productive?

Good mental model and tips to build a productive data engineering team. Read here

(3) Why is future of Business Intelligence open source?

From the founder of Apache superset on why he beleives that the future of BI is open source? Read here.

(This is also a great marketing pitch for Apache Superset so please read this with a grain of salt and be aware about author’s bias on this topic)

(4) How Data and Design can work better together?

Diagnose with data and Treat with Design. Great artcile by the Julie Zhuo here

(5) Zach wilson believes that standups can be less productive in data engineering teams compared to software engineering teams

Interesting observations on his LinkedIn thread here

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things data newsletter #12 (#dataengineer #datascience)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

Why dropbox picked Apache superset as data exploration tool?

Apache superset is gaining momentum and if you want to understand the reasons behind that, you can start by reading this article here

Growth: Adjacent User Theory

I love the framing via this LinkedIn post here where Nimit Jain says that Great Growth PM output looks like “We discovered 2 new user segments who are struggling to proceed at 2 key steps in the funnel and simplified the product for them via A/B experiments. This lead to conversion improvement of 5-10% at these steps so far. We are now working to figure the next segment of users to focus on.”; you can read about the Adjacent user theory here

SQL window functions

Need intro to SQL window functions? Read this

Luigi vs Airflow

Really good matrix on comparing 2 popular ETL workflow platforms. Read here

A data engineer’s point of view on data democratization

If more people can easily access data that was previously not accessible to them then that’s a good thing. This is a good read on various things to consider, read here

Apache Superset growth within Dropbox:

superset adoption data graphics
Image Source: Dropbox Tech Blog

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

Making your engineering team more data-driven

Standard

Background:

In December 2020, I was invited to lead a group discussion as part of Plato circles. The topic that I chose was “Making your engineering team more data-driven” and we had such a good discussion over 3 sessions as a group that I decided to open-source our notes. Please find them below.

Index:

In this post, we will explore why data-driven engineering is important and how to build a data-driven engineering culture within your teams and org.

Part #1: why is data-driven engineering important?

Part #2: learn about the build-measure-learn cycle and double-click on the “measure” component

Part #3: Tips on how to build a data-driven culture within the team and org

Part #1: Why is data-driven engineering important?

What is a data-driven culture?

Being data-driven means that you are using data in your tactical and strategic decision-making process. If you are not using data, then you are either using intuition/gut-feeling or usability studies to make the decision. Both intuition and usability aren’t the best ways to go about making decisions, here’s why 1) Intuition: In tech, intuition doesn’t scale because you are catering to a diverse user-base (usually millions of users) and you won’t get it right. A simple exercise is to vote for a design change and then run an A/B test. How many times does your intuition win? 2) Usability studies: usability studies are complementary to data-driven processes but shouldn’t be the only source for making decisions. This is because usability studies require a high-touch (1:1’s or focus group) approach, expensive and more importantly, it takes a while for the data to be gathered. As your product grows, it becomes unscalable to keep reaching out to your customers to help you with your decision making process. In the next section, let’s double click on why data-driven is important. 

Why is data-driven important? 

In tech, we have an unprecedented scale that we haven’t seen before and this rollercoaster hasn’t ended yet. With the unprecedented scale, the industry comes up with new buzzwords like “Big Data”. In simple terms, it means that we now have a small team that caters to billions of users (e.g. WhatsApp famously had just 50 people when Facebook acquired it. source) and so we are building and running products that generate a lot of data. This data is gold because it is the voice of customers at scale. If you want to be effective (doing the right things) and efficient (doing things right) then you need to listen to your customers, be obsessed about solving customer pain points and build the best solutions through your products. Big Data and being data-driven is your best mechanism to achieve your desired outcomes. 

What does it mean for an engineering team to be data-driven? 

For an engineering team to be data-driven, it means that you are using data to make tactical and strategic decisions. For e.g. (not exhaustive)

  1. Are you logging the right data at the right location with the right data quality bar so you can analyze your data later? 
  2. Are you running A/B tests to figure out which version of the product is performing against your goals? 
  3. Are you using metrics to figure out what you should be building next? 
  4. Are you using metrics to scale up/down your software development resourcing investments in existing product features? 
  5. Do you have alarms and monitors set up that alert if your product health metrics (latency, load time, etc) are spiking and causing detrimental customer experience? (Note: all engineering teams have alarms and monitors if the product goes down but few teams think about “health” metrics)
Who are the other partners that need to be data-driven for your engineering team to be successful at this approach? 

In tech, within a product team, you usually have multiple partners so it’s important to be data-driven and influence other partner teams to be data-driven too. These partners include business leaders, product managers, marketing managers, finance owners, data science & engineering owners, design owners, user research, and others. Below you will find a matrix of common “product” metrics and who should own them: 

Metric CategoryMetric DescriptionMetric ExampleSuggested Owner
HealthIs the product performing reliablyLatency, uptime, load timeEngineering
UsageHow are customers using the product# of Activities, # of Active users (Daily Active User – DAU)Engineering + Product + Design
New user AdoptionAre new users adopting the product?# of new usersMarketing
RetentionAre users coming back?# of retained usersEngineering + Product
OutcomeWhat is overall business result?RevenueBusiness leader + Finance 
SatisfactionWhat is the overall customer satisfaction?Net promoter scoreUser research
Engineering Productivity and HappinessWhat is the engineering productivity and their happiness?Cycle Time, Velocity, # of employees thumbs up, down, neutral last week/month Engineering

Note: your data science and data engineering team should build an underlying platform for you to track your metrics. Having strong data engineering and data science teams is a huge plus and a clear indicator that business leaders consider data-driven culture as a critical component for success.  

Part #2: Build-measure-learn cycle 

What is a build-measure-learn cycle?

The build-measure-learn cycle was popularized by Eric Reis in his book “The Lean Startup”. It provides a mental model to effectively approach startup development. There are three steps in this cycle:

Image source

  1. Build: You “build” your product. It could be an MVP (minimum viable product), MLP (minimum lovable product), experiment, or iteration on your existing product.  
  1. Measure: The measure part should be kicked off on day 1 when you start architecting/designing your system. You should figure out a simple and robust data infrastructure since day 1. Your software developers should work backward from figuring out a system where it takes one line of code to send an event to the data infrastructure and one line of code to query the event. You don’t need to build your own solution here since there are plug-and-play options available. For e.g. if you are launching on the web, Google Analytics is a good solution. If you want to capture mobile events and get something a little more advanced, maybe look into Mixpanel or Amplitude. When you search for these vendors, you’ll get a list of 20 other tools. It’s not that critical on which tool you end up picking but more important that you have an architecture in place and are able to query your events and unlock the “measure” part of this cycle — since without this, you won’t be able to “learn” quickly.   
  1. Learn: We “learned” (pun intended) in the last session that using intuition isn’t always the effective way and using usability studies is a time-consuming process so if you want to make decisions fast, move fast, learn fast and iterate — you need to be data-driven! The Learning step here means that you are using the measuring step above to figure things that are not working so you can scale back and things that you are working on so you can double down. Sometimes, data might also help you figure out when to Pivot if needed. For e.g. Airbnb learned through data that listings that had a professional photo were getting booked by 2.5x times (source) compared to listings that didn’t have a professional photo so they quickly doubled down on that and skyrocketed their growth! 

Image Source (Airbnb growth after starting professional photographs)

Why is it useful for engineering teams?

The most important thing for engineering teams to note here is that the build-measure-learn mental model is a cycle vs a one-off occurrence. As an engineering team, you’ll need to keep going through these cycles and spin this wheel as fast as you can and should. If you are unstoppable at this in building software by iterating and learning very quickly then you will eventually build a product that your customers LOVE! 

How do you implement the “measure” part? 

Deciding that you and your team will adopt this mental model was the hardest part. The easier part is to now figure out how to implement it. Few ideas and questions to think about (not exhaustive since you will need to build-measure-learn this on your own for what works best for you):

  1. Do you know your success metrics? (e.g. Are you optimizing for Daily Active Users and engagement rates?) 
  2. Do you understand customer behavior? Do you understand the common paths that the customers will take in your product? Based on that, do you know the events that you want to capture? Do you have a prioritized list of events that you should capture? 
  3. Do you know your system architecture on how to log these events? How big is your team and how many users will you have? If the product will have multiple millions of users soon, have you considered how your events logging solution will scale? Do you have a data engineer on the team?
  4. Are you aware that most Analytics projects fail? So it’s important to start, iterate and fail fast. (Read the “why most analytics fail” blog linked in the appendix after this session) 
  5. Do you have a data-store where you will store the events? (e.g. Redshift, PostgreSQL, etc)
  6. Do you have a tool to visualize your events data? (e.g. Tableau, Excel, Quicksight, etc.)
  7. Are you planning to just go with a vendor that gives you end-to-end analytics capabilities? Have you run a proof of concept? Have you talked about other peers who might have experience with this? Are you aware that it’s hard to switch analytics vendors so you need to choose wisely? (e.g. Google Analytics, Amplitude, Mixpanel)
  8. Have you thought about rapid A/B testing tooling? (e.g. In-house, Google Analytics, Visual Website Optimizer, Optimizely, Amplitude, etc.) 
  9. How do you plan to address complex use-cases that your engineers can’t analyze? Developers should be very good at analyzing one event at a time? What if you need to analyze entire customer journeys and funnels? That will require a Data Scientist or Data Analyst on the team? Do you have one? 

Note that you should come up with the architecture here for the measure part that helps you and your engineer team keep moving fast. Your architecture should solve simple use-cases without external help as often as possible. 

Part #3: How to build a data-driven culture?

What is Data Culture?

First, let’s define what is culture: “The set of shared values, goals, and practices that characterizes a group of people” – source; Now building on top of that for defining data culture, What are sets of shared values? Decisions will be made based on insights generated through data. And also, a group of people represent all decision makers in the organization. So in other words:”An org that has a great data culture will have a group of decision makers that uses data & insights to make decisions”

What are the ingredients for a successful data culture?

It’s 3 P’s: Platform, Process and People and continuously iterating and improving each of the P’s to improve data culture.

How to build data culture?

Here’s a mental model for a leader within an org:

  1. Understand data needs and prioritize
  2. Hire the right people
  3. Set team goals and define success
  4. Build something that people use
  5. Iterate on the data product and make it better
  6. Launch and communicate broadly
  7. Provide Training & Support
  8. Celebrate wins and communicate progress against goals
  9. Continue to build and identify next set of data needs

Appendix: 

Book recommendation: 
  1. Lean Analytics: Use Data to Build a Better Startup Faster https://www.amazon.com/Lean-Analytics-Better-Startup-Faster-ebook/dp/B00AG66LTM/ref=tmm_kin_swatch_0?_encoding=UTF8&qid=1607104319&sr=1-3; Free Video companion: https://www.udemy.com/user/leananalytics/ 
  2. Measure what matters: https://www.amazon.com/Measure-What-Matters-Google-Foundation-ebook-dp-B078FZ9SYB/dp/B078FZ9SYB/ref=mt_other?_encoding=UTF8&me=&qid= 
  3. An elegant Puzzle: https://www.amazon.com/gp/product/B07QYCHJ7V/ref=ppx_yo_dt_b_d_asin_title_o02?ie=UTF8&psc=1 
  4. The Lean Startup: https://www.amazon.com/Lean-Startup-Entrepreneurs-Continuous-Innovation/dp/0670921602/ref=asc_df_0670921602/?tag=hyprod-20&linkCode=df0&hvadid=312118059795&hvpos=&hvnetw=g&hvrand=6101413581120496701&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9033313&hvtargid=pla-364195445884&psc=1 
Blogs:
  1. Why does most analytics fail? https://www.reforge.com/blog/why-most-analytics-efforts-fail
  2. Airbnb growth story: https://growthhackers.com/growth-studies/airbnb  
  3. Building data driven companies: https://insightextractor.com/2015/12/29/building-data-driven-companies-3-ps-framework/ 
  4. Data culture mental model: https://insightextractor.com/2019/04/05/data-culture-mental-model/ 
What is Plato’s circle?

1-on-1 mentorship is valuable but often not scalable. We value your time and we’d love for you to help multiple people at once (plus it’s a chance for participants to meet their peers). In a Circle, people connect virtually so they can (a) learn from you and (b) be in a safe space where they can share experiences, pain points, & successes. Typically, a Circle happens over 4 weeks over Zoom for 45 min per session.

Thanks to Plato Circle Attendees:

Julius (VP of Engineering @ Hubble), Sridhar (Director of engineering @ BlackHawk network), Ken (Software @ AES – EdTech), Danny Philayvanh (Director of Engineering @ Rakuten), Royce (Product @ Plaid), Daniel (Founder @ Canadian Entrepreneurs), Brian (Staff Research Scientist @ SambaTV), Yahia (Lead Staff Engineer @ Argo AI), Sky (Tech Lead @ Mindvalley), Minna (Senior software engineer @ Fable Tech Labs).

All things data newsletter #11 (#dataengineer, #datascience)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

1. AWS re:Invent ML, Data and Analytics announcements

Really good recap of all ML, Data and Analytics announcements at AWS reinvent 2020 here

2. How to build production workflow with SQL modeling

A really good example of how a data engineering at Shopify applied software engineering best practices to analytics code. Read here

Image Source

3. Back to basics: What are different data pipeline components and types?

Must know basic concepts for every data engineer here

4. Back to basics: SQL window functions

I was interviewing a senior candidate earlier this week and it was unfortunate to basic mistakes while writing SQL window functions. Don’t let that happen to you. Good tutorial here

5. 300+ data science interview questions

Good library of data science interview questions and answers

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things data newsletter #9 (#dataengineer, #datascience)

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

1 The Great Data Debate by a16z

a16z is top venture capital firm and they recently published this amazing podcast. Must listen! here

2 Zen of Pyhon!

some really good tenents that Python community lives by! Read here

Some of my favorites: “Practicality beats purity” and “if it’s hard to explain, it’s a bad idea”

3 Super intelligence: science or fiction?

A bit outdated (2017) but still a really fun conversation to listen to. Speakers include Elon Musk, Stuart Russell, Ray Kurzweil, Demis Hassabis, Sam Harris, Nick Bostrom, David Chalmers, Bart Selman, and Jaan Tallinn.

Watch here:

4 MUST READ! Data Quality at Airbnb; part 2

I included Part 1 in the previous newsletter #8 and in this one, you have the link to part 2 here

5 Some must know SQL concepts

Good list by Eric Weber on LinkedIn here

Elon Musk on Artificial Intelligence - YouTube
Source

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things Data Engineering & Data Science Newsletter #8

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

What is a data lake?

Good article on basics of data lake architecture on guru99 here

Data quality at Airbnb

Really good framework on how to think about data quality systematically through examples and mental-model from Airbnb here

Monetization vs growth is a false choice

Good article from Reforge for Monetization vs growth mental model here

Performance Tuning SQL queries

Really good basic post on tuning SQL queries here

Improving conversion rates through A/B testing

Good mental model to run effective A/B testing to improve metrics such as conversion rate here

Source: Difference Media Variations for A/B testing

Thanks for reading! Now it’s your turn: Which article did you love the most and why?

All things data engineering & science newsletter #7

Standard

(if this newsletter was forwarded to you then you can subscribe here: https://insightextractor.com/)

The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.

1. Why a data scientist is not a data engineer?

Good post on the difference between data engineer and data scientist and why you need both roles in a data team. I chuckled when one of the sections had explanations around why data engineering != spark since I completely agree that these roles should be boxed around just one or two tools! read the full post here

2. Correlation vs Causation:

1 picture = 1000 words!

No alternative text description for this image
Image Source
3. Best Practices from Facebook’s growth team:

Read Chamath Palihapitiya and Andy John’s response to this Quora question here

4. Simple mental model for handling for handling “big data” workloads
No alternative text description for this image
Image Source
5. Five things to do as a data scientist in firt 90 days that will have big impact.

Eric Weber gives 5 tips on what to do as a new data scientist to have a big impact. Read here

Thanks for reading! Now it’s your turn: Which article did you love the most and why?