The Best Data Science Blogs to Follow
The decision by data-driven companies to offer a peek behind the curtain through their blogs has been a win-win. By publishing instructive tutorials and case studies, companies engage a specialized audience who in turn reap the benefits of learning from the best in the field. It’s the antithesis of what, in a worse universe, could have been mere content-marketing chum.
Through these blogs, many companies carry on the long-term trend toward democratization in data-science education, as seen elsewhere in community competitions, MOOCs and other freely accessible resources. They often favor a combination of approachability and technical or academic rigor, and many companies publish handy primers of emerging concepts or approaches.
Below you’ll find a cross-section of longstanding must-reads, and several promising newcomers. There are tons of blogs worth following out there, but these 20 entries hopefully offer a good starting point.
Data Science and Machine Learning
The official blog of Google’s powerful machine-learning library is regularly updated with digestible how to’s, creative use-case spotlights and introductions to new open-source packages. Whimsical case-study spotlights should engage general-interest readers, while rollout explainers — like the recent series focused on the new, low-lift TensorFlow Recommender package — will appeal to working data scientists and machine-learning engineers.
Stitch Fix’s MultiThreaded
The online personal styling service was a trailblazer in data-driven retail, and its MultiThreaded blog reliably keeps a finger on the pulse of real-world data science applications. Posts about the specialist-vs.-generalist debate, ETL structure and (more recently) multi-armed bandits all stirred conversation in data circles. And years after it was published, the company’s Algorithms Tour is still a relevant scrollytelling explainer of how Stitch Fix leverages its 145-plus (!) data scientists for recommender systems, demand modeling, style development and other business facets.
Instacart Machine Learning
Instacart’s three-million-order data set remains a Kaggle-competition go-to and a handy resource for anyone diving into product purchasing analysis. So perhaps it’s no surprise that the grocery service’s tech blog sports some thoughtful, instructive peeks behind the curtain into its work in machine learning and data science.
Spotify R&D: Engineering
You won’t find a recipe for Spotify’s recommendation-system secret sauce — a hybrid of content-based and collaborative filtering that’s been crucial to the streaming app’s success — on the company’s blog. But deep-dive contextual explainers of Spotify’s various new frameworks — like Lexikon, for data discovery, and its new “The Experimentation Platform,” a more data-friendly alternative to traditional A/B testing — are reliably informative reads. Keep an eye on Spotify’s design blog, too, which has interesting data science-related reads like this one.
Netflix Technology Blog
Netflix’s surfeit of user data has allowed for analytics-driven decisions both small (algorithmically personalized thumbnail art) and large (whether or not a production or title buy is greenlighted). It also means that, whenever Netflix reveals something about the inner workings of its data team, it’s usually worth a look. Recent technical highlights include how the company batch-moves data from data warehouses to key-value databases and — no biggie — the introduction of a new interdisciplinary field, dubbed computational causal inference.
Airbnb Engineering & Data Science
The home-rental pioneer was also a data front runner: it famously incorporated data science from the outset at a time when few companies did, and it runs an internal data literacy “university” for employees. Not surprisingly, a new AI/ML or data science blog post from the company often attracts attention. A recent highlight is the two-part dive into data quality, but we also recommend checking out the classic A Beginner’s Guide to Data Engineering, which helped define the role as now commonly understood.
Facebook’s engineering blog has consistently noteworthy updates (such as the recent unveiling of the company’s data-discovery solution), but the really eye-widening stuff is over on the AI blog, where Facebook posts its research and academic publications — all of which have big implications far beyond social media. Recent notable entries include AI-accelerated MRI tech, ML-turbocharged effect-rendering and the object-recognition system GrokNet.
Speaking of Facebook Research, perhaps the most notable tool to emerge from FB R&D — the deep-learning framework PyTorch — sports a host of relevant content on its dedicated blog. There are plenty of interesting case studies (from Datarock to Disney), and a plethora of resources and community support for building and productionizing neural networks.
Wayfair may have only just recently posted a profit, but the digital-only furniture retailer’s early bet on data appears to have paid dividends on personalization, price modeling, computer vision-driven categorization and other key areas. Unfortunately, the blog updates don’t come as often as they did in the past, it seems. Nevertheless, 2020 highlights like an ETL automation guide and a breakdown of its Bayesian approach to determining what furniture will have the broadest appeal prove there’s still plenty of useful, day-to-day-data-science content in the pipeline.
Even though the ride-hailing company has discontinued its Uber AI Labs division and offloaded its self-driving-car focus, its ample computer vision and neural network research remains available here, along with regularly published updates under the AI and Uber Data categories. It was here that Uber popularized the concept of feature stores and made waves with its Postgres-to-MySQL migration announcement.
Shopify Data Science & Engineering
Updated monthly, the data science section of Shopify’s engineering blog offers readable, actionable lessons learned from the payment platform’s dramatic, data-driven rise. A walkthrough post about data documentation — a thorny challenge that teams often have to self-solve — and a high-level overview of Shopify’s foundational data and engineering principles were both widely shared and discussed in data engineering and data science circles.
Speaking of Uber and feature stores, three of the engineers who built Michelangelo eventually went on to found Tecton. It offers an ML platform that transforms and stores raw data as feature values. Since emerging from stealth in mid-2020, the Andreessen Horowitz- and Sequoia Capital-funded startup has also been publishing informative, engaging posts about machine learning and model reproducibility. Missives on MLOps, feature stores and data leakage are all well worth a read.
The Signal by Mixpanel
An early prominent rebuke of vanity metrics — or, as dubbed here — “bullshit metrics” came courtesy of analytics platform Mixpanel. Years later, The Signal still regularly dishes up digestible advice on product data, product metrics and long- and short-term growth — often from a business-intelligence perspective.
Not a company blog per se, but the official blog of the non-profit Project Jupyter is something of a focal point for the game-changing, data-access computational notebook. Readers will find community announcements, novel use cases and — most notably — must-know release updates, kernel debuts and assorted new tools.
The regularly updated, reliably informative blog at this deep learning platform and Y Combinator alum has solidified itself as a trusted resource over the last three years. Penned by the company’s own data scientists and a roster of knowledgeable outside contributors (from organizations such as Microsoft Research, Intercom and Cognizant), posts are visualization-dense, code-snippet-packed deep dives on DL and ML concepts and approaches. Be sure to catch the ongoing “Humans of Machine Learning” series — Q&As with ML notables, by turns philosophical and instructional.
Domino Data Lab
Domino’s stated editorial mission is to remain “hyper focused on learning and understanding how to help data scientists accelerate their work,” and its frequently updated blog offers a deep and wide sampler of professional development-focused content. Topics range from high-level (causal inference, dealing with disappointing model outcomes) to granular (GAN evaluation, data drift detection in image classification), and the tone is a thoughtful blend of technical and approachable.
Approachable and frequently updated, the Tableau blog covers data-viz topics ranging from product updates to data literacy to COVID-19, which cast into stark relief the importance of data-visualization clarity. Regular contributors include CTO Andrew Beers and (noted bar-chart-race skeptic) Andy Cotgreave. The Engineering blog is worth a look too. A standout entry: the companion blog to a paper that Tableau researchers authored on the eternally debated question of whether truncating a graph’s y-axis is ever acceptable.
Chartable by Datawrapper
Chartable (not to be confused with the podcast-analytics startup with which it shares a name) gathers in-house data-visualization service journalism in its How To’s — including a Data Vis Book Club — while also posting and dissecting a new weekly chart. Those range from important, historical-minded voting analysis to fun toy projects, like charting the data set visualized on Joy Division’s Unknown Pleasures album.
This visualization platform, co-founded by former New York Times data journalist Mike Bostock, has become a favored platform for data visualization practitioners who desire more code-friendly customization than plug-and-play alternatives. Meanwhile, Observable’s own profile serves as a repository for staff-pick highlights, creator spotlights, how-to demos, forkable notebooks created by staff and the occasional critical explainer — like this excellent entry on when it might make sense to use the oft-derided radial visualization.
Not surprisingly, several posts on Plotly’s page spotlight updates and thorough pairing tutorials for its flagship offering, Dash (recent walkthroughs include using Dash with HoloViews and SHAP). At the same time, it stays varied with broader-interest (for the ML-practitioner crowd, anyway) posts, such as a history of autonomous-vehicle data sets, word-embedding logic and — per Plotly’s metier — data visualization for AI.