Data science is a booming field, utilizing technologies like machine learning, algorithms and predictive models to gather everyday insights and make business decisions — which can be difficult to understand if you’re new to the industry.
Best Data Science Books
- Everybody Lies
- Naked Statistics
- Data Science from Scratch
- Think Stats
- An Introduction to Statistical Learning
- Pattern Recognition and Machine Learning
Luckily, we compiled this list of data science books to help you further your knowledge base, ranging from introductory overviews to more advanced content on deep learning, bias in algorithms and more. With recommendations from experts and our own personal picks, here are the data science books to pick up to learn more about the subject.
General Interest Data Science Books
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz
This book is like Freakonomics in the age of data science. It’s 100 percent not a technical book. Every chapter tells some peculiar story illustrating a data science concept — like, there’s one chapter about Google searches, another about news, another about image data, etc. It’s a bunch of stories of people being creative and finding patterns in the most random things, because these random things actually reveal a lot. The book has that name because you can lie about what you eat and read, and you can lie about who you’re going to vote for — but if I have access to your search history, I can figure out the truth. It’s a book for people that are curious about what data science is and what it can do — especially when it comes to social data. The author finishes by saying the next Freud will be a data scientist, the next Foucault will be a data scientist, the next Marx will be a data scientist. I think that’s a bit much perhaps, because data science doesn’t answer every question ever. But it’s a fun book, to be read with a grain of salt.
— Chico Camargo, postdoctoral researcher in data science at the Oxford Internet Institute
Naked Statistics: Stripping the Dread From the Data by Charles Wheelan
This book gives a lot of examples of how statistical concepts apply in the real world. Wheelan does not go into a lot of theory, but he has some pretty interesting examples and a kind of dry sense of humor. This is the only statistics book that’s ever made me laugh, and it’s the book that we recommend our incoming students at the Flatiron School read beforehand. Our students come from a wide variety of statistics backgrounds, but I’ve always gotten really positive feedback on it. It’s ideal for beginners, but I also think that if you’ve never read it and you’re in data science, it’s a great read.
— Jeff Herman, lead data science instructor at the Flatiron School
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neil
The author of this book, Cathy O’Neil, used to be an academic mathematician. Then she went to Wall Street, then she went to Occupy Wall Street and now she’s an activist raising awareness of how algorithms rule our lives, and how they are not as neutral or unbiased as we like to believe. The book is a collection of stories of algorithms’ real-world applications, and a lot of them are about people who were classified as unworthy by an algorithm. Like, someone purchased an item at a particular shop and automatically got their credit card limit lowered, or a college student couldn’t get a job at a local grocery store because the algorithm said so.
She doesn’t just say “boo hoo, bad algorithm, bad machine!” though — she makes an effort to explain the mechanisms that might make an algorithm racist, for instance. So, why is a policing algorithm sending officers to Black neighborhoods more often? Well, what happened in that case is that the algorithm was fed data on previous police patrols, which were more often in Black neighborhoods. So the algorithm learned that those neighborhoods are the ones that receive more patrols. The algorithm simply reproduced what it was taught. The book makes you think a lot about how you can design algorithms and data science practices to deal with that.
— Camargo
Algorithms of Oppression: How Search Engines Reinforce Racism by Safiya Umoja Noble
This book has a few stories, with very simple “data,” which the author explores in depth. I found it a very interesting read, because the author’s background is almost diametrically opposed to mine. She’s 100 percent qualitative, telling stories based on “small data” with a lot of context.
In one of these stories, the author, Safiya Noble, was organizing a party for her niece and other children, and she searched something like “Black girls” on Google. To her surprise, she didn’t find pictures of children. She found websites like “HOT BLACK SINGLES IN YOUR AREA.” For other search terms, like “Latina girls” and “Asian girls,” she found the same stuff.
The reason this happened, she explained, is Google’s revenue model. The algorithm will serve whatever ad pays the most. And it becomes a troubling situation, because even though Google is an advertising company, we use it like a public library — like some sort of publicly accessible repository of information. I found it a very sobering read.
— Camargo
Data Science Books for Beginners
Data Science from Scratch: First Principles with Python by Joel Grus
This book is about how to write data science algorithms in Python. It’s a mix between a textbook and a normal book — a great entryway book, very appropriate for a layperson. So for instance, if I wanted to learn the machine learning algorithm Naive Bayes, this book says, “We’re going to literally program Naive Bayes as if it doesn’t exist in the world. We’re going to learn the math first and then write the code as part of that. We’ll build this algorithm together with nothing but Python.”
You probably want to know a little bit of Python and a little bit of statistics going in, but this book assumes almost no depth of knowledge. It’s not one of those books that’s like, “This is left to the reader because it’s easy.” And it will teach you all the standard machine learning algorithms, probably 10 or 15 different ones.
— Zach Miller, lead data scientist at CreditNinja
R for Data Science by Hadley Wickham and Garret Grolemund
This book overviews using the R programming language for data science, with no previous programming experience necessary. Readers are introduced to the basics of R as well as RStudio, an integrated development environment for R, and Tidyverse, a collection of open-source packages for R. Wickham and Grolemund walk through how to use R and its tools to wrangle, program, explore, model and visualize data, and provide an overall understanding of the data science cycle for beginners. Every section of the book also comes with a data exercise, so readers can practice what they learn along the way.
— Built In Staff
The Hundred-Page Machine Learning Book by Andriy Burkov
This book introduces the fundamentals of machine learning and popular algorithms used in the field of data science in a little over 100 pages. Along with covering common machine learning definitions and practices, it illustrates algorithms like linear and logistic regression, support vector machines and random forest using Python. The book is suitable for beginners with no prior programming or statistical experience, as well as experienced data science professionals looking for a reference resource.
— Built In Staff
Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow by Aurélien Géron
This book will teach you how to run predictive analytics. In the data science world, there are two main programming languages: Python and R. There are pros and cons to both, but this book is specifically for Python. Scikit-Learn, Keras and TensorFlow are all libraries of machine learning and deep learning functions within the Python programming languages.
You have to be pretty good at these libraries to be a data scientist. When I was starting out, I would reference this book daily. To this day I probably look at it at least monthly as a reference, because he really goes deep into explaining how each algorithm works. A lot of algorithms have a lot of knobs or levers that you can turn — so depending on what the data is doing, you might change the algorithm a little bit. The author explains what those different knobs and levers are in a way that a beginner can understand, but someone with more experience can appreciate the level of detail that he goes into.
— Herman
Grokking Deep Learning by Andrew W. Trask
This book is an introductory textbook for the beginner who wants to go beyond usage and understand a bit of how deep learning works. People who develop deep learning tools are usually drawing from a lot of mathematics: multivariate calculus, linear algebra, optimization, often some physics too. But you don’t need all these things to understand what deep learning is doing. In the author’s words, “If you’ve passed high school mathematics and hacked around in Python, you’re ready for this book.” It covers some very general and fundamental bits, such as gradient descent, backpropagation and regularization, which are used in so many advanced tools that you cannot progress without a decent understanding of them.
I think books like this are important because thanks to online tutorials, you can get to a point where you’re implementing complex stuff without actually understanding how it works — all you need is Python and an internet connection. And that is troublesome, sometimes. People can waste resources by using deep neural networks where a linear regression would do (using a bazooka to kill a fruit fly, in a sense) or by implementing algorithms that lead to decisions that harm people, without the programmers realizing that’s happening.
— Camargo
An Introduction to Statistical Learning: With Applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
When I was first learning data science, most statistical textbooks were kind of unreadable. They went in-depth on theory and didn’t really show the application side. This book doesn’t go as deep statistically as a lot of other books, but it gives you enough knowledge to be successful as a data scientist, and it goes over the key machine learning algorithms. One of the issues people have with data science is that algorithms are these black boxes where you put data in and you get data out and you have no idea what happens in the middle. This book gives you enough statistical knowledge to understand what’s going on in that black box.
It’s geared toward people that don’t have any programming or statistics background. That being said, I’ve actually read this book multiple times. Even if you’re an experienced data scientist, a lot of statistical concepts, you kind of forget about them over time. As you work in a job, you’re not going to be using every single algorithm. You get comfortable. This book allows you to say, okay, maybe I should try this other algorithm.
— Herman
Think Stats: Exploratory Data Analysis by Allen B. Downey
Data science is a mix of three different disciplines. One is programming and computer science; one is linear algebra, stats, very math-heavy analytics; and then one is machine learning and algorithms. The ideal data scientist is really good at all of them. But that doesn’t always happen, so this book is about building out that analytics, math and stats side of your data science knowledge. How do you do testing, how do you determine whether your solutions are working and the distributions are right, and how do you use that math stuff to solve business problems?
It’s textbook-y, but it isn’t a hardcore textbook. It also merges the statistical analysis with how you would write it in Python. Early in my career, I found statistics fairly easy, but making statistics into a program was more challenging. I found this very helpful for making that connection.
— Miller
Linear Algebra Done Right by Sheldon Axler
This book is an undergraduate math textbook. It’s designed for a mid-level linear algebra course, which is something every data scientist can use. It’s not machine learning, it’s not flash programming. But the thing that I use more than anything else is my ability to take a matrix or a high-dimensional space and think about it. This is one of those books that, when you’re done, you will know inside and out how to do matrices and how to handle the vector space and how to do pure math about high-dimensional spaces. I wouldn’t say it’s for everybody, though. If this was your first math book, you would find it daunting. This is for a 200- or 300-level course.
— Miller
Advanced Data Science Books
Pattern Recognition and Machine Learning by Christopher M. Bishop
This book is definitely a textbook. It’s also, if you take Data Science From Scratch and then turn up the math level to 11, that’s what this book is. It bases everything on what is known as a Bayesian viewpoint, and it says that it has an intro for Bayesian learning, which it technically does, but any beginner would be mortified by it about two pages in. When I talked to other data scientists who are as nerdy as me, though, this is the book that we always end up talking about.
As far as what pattern recognition means here — any machine learning is pattern recognition, right? Looking at how the stock market used to perform and then projecting how it should perform next, that’s pattern recognition. But similarly looking at a bunch of signs and learning, this pattern means “stop,” that’s a similar thing. Machine learning is a big, fancy, shiny term, which basically just means using the old data to think about the data you haven’t seen before. This is probably the best book I’ve read on the subject, just in terms of just depth and clarity of presentation. He’s not glossing over anything and he’s not making it super beginner-friendly. It’s just, this is how it works, and you can take it or leave it.
— Miller
Deep Learning with Python by François Chollet
The author of this book is the creator of the library called Keras, which makes it a lot easier to build neural networks in Python — and usually, in deep learning, you’re using neural networks on unstructured data. So if you’re trying to predict if there’s a person in an image, or whether a review on Yelp is positive or negative, you would use a deep neural network. I remember when I was reading this, in the second chapter, you build a neural network for the first time. He writes out code in the book, and then you try it out for yourself on your computer, and you get 98 percent accuracy. The data set is a bunch of handwritten numbers and you’re trying to predict what the number is, even though everyone’s handwriting is different. The ones the algorithm gets incorrect are ones that I would probably get incorrect. Being able to do that in the second chapter, I was like, “OK, I’m definitely gonna be finishing this book.”
— Herman
Data Science with Python and Dask by Jesse Daniel
The focus of this book is big data — specifically working on it with Dask.
Dask is a library in Python and it’s this buzzword right now. I see it in pretty much every job description my students apply for, and I’m very fond of it. Most companies that work with big data use a library called Spark, but it has a huge learning curve. You have to learn essentially a new language to use it. Dask allows you to interact with massive data sets in libraries that you’re already comfortable with. In this book, I really liked seeing how concepts were applied. The author introduces a data set at the beginning — it’s 42 million parking tickets around New York City — and he’ll explain a concept and then apply it on that data set.
— Herman
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman
This book isn’t a standard pick for a data science book because it’s very much in that data engineering, computer sciences corner of data science’s three pillars. It’s more about designing databases and making sure that your data can flow in and out of your system. If I wanted to build a system to store every Yelp review that’s ever existed, every Yelp user and all of that information — this book is about how you store that. How do you make sure that the data can go in and out? How do you make sure that the data is consistent and reliable? How do you make sure that your system doesn’t break when you get a million users instead of 100,000 users?
It’s not super data science-y, but I think it’s a piece of the puzzle that a lot of data scientists ignore, and it explains why your system should be this way very clearly. It doesn’t assume that you’re a data engineer or an admin. I would say anybody who’s a data scientist owes it to themselves to learn about how the systems they rely on work. But you probably aren’t going to sit down and read this one end to end. It’s more of a reference.
— Miller
Data Science Books for Professionals
Build a Career in Data Science by Emily Robinson and Jacqueline Nolis
This book serves as a guide for landing your first data science role and succeeding as a professional in the field. Rather than focusing on technical knowledge, it outlines how to craft a data science portfolio, ace job interviews and create effective analyses and deploy models at work. Overall, it provides steps for navigating a data science career, from entry-level to managerial roles.
— Built In Staff
The Data Science Handbook by Carl Shan, Henry Wang, William Chen and Max Song
This book features in-depth conversations with data scientists from established companies and growing startups alike, including places like Facebook, LinkedIn and Uber. It covers their careers, perspectives, personal stories and general life advice. The data science professionals interviewed come from various backgrounds and industries, too, so it’s a perfect primer for readers curious about the field.
— Built In Staff
Data Science for Business by Foster Provost and Tom Fawcett
This book provides a look into the principles of data science and how to apply them for practical and business applications. It walks through how to approach data from an analytical perspective and utilize data-mining techniques, as well. Provost and Fawcett emphasize treating data as a business asset, helping readers understand how to fit data science practices into an organization and use it as a competitive advantage.
— Built In Staff
Storytelling with Data by Cole Nussbaumer Knaflic
This book teaches the foundations of data visualization, and what practices business professionals (including data scientists) can use for presenting data effectively. It explains how to go beyond conventional tools to truly understand and communicate your data, and goes over the importance of context, audience and storytelling when visualizing data. The book also provides real-world examples for readers to use in their own presentations.
— Built In Staff
Frequently Asked Questions
What is a good starter book for data science?
Some good starter books for data science include:
- Data Science from Scratch: First Principles with Python
- Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow
- An Introduction to Statistical Learning: With Applications in R
- Build a Career in Data Science
- The Data Science Handbook
Is data science math-heavy?
Data science can require having mathematical knowledge in linear algebra, calculus and statistics, though the amount of math realistically used will depend on the role and specific task needing to be accomplished.
Is data science hard for beginners?
Data science can be difficult for beginners due to its technical nature, though the subject can be learned with proper training or education.