My Favorite Data Science Books

Share this article

We recently moved to a new home in the south Bay Area which meant I had to reorganize my books. Over the years, I have accumulated more data science books than one might imagine – upwards of a hundred. While it seems like learning from books (as opposed to online sources) is going out of style, I automatically gravitate toward my physical bookshelf when confronted with a problem.

So, I thought I would take a break from digging into research papers and instead put together a brief list of some of my favorite books in the world of statistics, econometrics, and programming. I firmly believe they belong on every data scientist’s bookshelf.

Communication

Successfully communicating the findings of our analyses is an essential part of being an effective data scientist. If you are anything like me, you can benefit from some lessons in “clarity and grace.” The following are two classic books that have completely transformed how I think about communicating my thoughts, ideas, and results. If I were to suggest a single book from this entire page, I’d pick any of these two. Look for an older edition that costs much less.

Popular Science

Occasionally, statistics enters the public discourse through the popular science gateway. Typically, this takes the form of showing you how statisticians can manipulate their data to achieve the results they want. These books do much more than that – they walk you through the beautiful world of probability and statistics from all kinds of angles.

Data Visualization
  • R Graphics Cookbook: Practical Recipes for Visualizing Data” by W. Chang (O’Reilly Media; 2019). You know what type of graph goes with your data. Now, the next step is to produce that graph. Chang’s book tells how to do this beautifully in R, with hundreds of examples. The downside is it does not cover many advanced topics. But, hey, it’s a terrific book, nevertheless.
Programming – R, Python, and SQL

It’s hard to imagine a data science position that does not require coding. These are my favorite SQL, R, and Python books. Older editions are usually a good deal – they would cost less and cover the same material.

Experimentation & A/B testing

Everybody knows running A/B tests (a.k.a. RCTs) is fun. But there are so many ways in which things can get messy – flawed design, wrong KPIs, spillover effects, etc. This book will tell you how to make sure you do it right. It is the de facto bible for online randomized experiments. It’s thin on statistical theory, but it makes up for that with a rich, practical viewpoint.

Statistics

Having a solid foundation in statistics is among the most important skills for data scientists. It will give you a deeper understanding of any modeling you might do or encounter. More importantly, it will differentiate you from other data scientists who fear equations and statistical technicalities. What exactly is a p-value? The following books will give you the basics you need to succeed at that.

(Beginner) Machine Learning

Machine learning is now part of many scientists’ day-to-day duties. These books will prepare you for any prediction/ML problem you might face.

  • An Introduction to Statistical Learning: with Applications in R” by James, Witten, T. Hastie and R. Tibshirani (Springer; 2021). A great overview of the most popular machine learning methods/algorithms with detailed how-to steps and R code. The more technical version of this book is listed in the next section.
  • The Hundred-Page Machine Learning Book” by A. Burkov (2019). A beautiful, minimalistic walkthrough of the most important ML concepts and algorithms with just the right level of technical detail. Importantly, the digital version is free on the author’s website.
(Advanced) Machine Learning

Sometimes the fun is in digging deeper. And this is certainly true for machine learning models.

  • The Elements of Statistical Learning: Data Mining, Inference, and Prediction” by T. Hastie, R. Tibshirani, and J. Friedman (Springer; 2016). I regret over-using the word “bible” because I really should have left it for this one. Hastie, Tibshirani, and Friedman, all at Stanford, have long been an inspiration to me. So, I take everything they write seriously. And, for a good reason.
  • Machine Learning Engineering” by A. Burkov (2020). A sequel of Burkov’s earlier book with a focus on the data engineering side of ML.

(Beginner) Causal Inference

These resources will gently guide you through the most used tools for isolating causal effects from non-experimental data.

  • Causal Inference: The Mixtape” by S. Cunningham (Yale University Press). A relatively new, yet a well-established book in the world of causal inference. You can find a free version online, but I recommend getting the paper copy immediately. Cunningham successfully balances theoretical details and hands-on applications (it even includes tons of R and Stata code).
  • Mastering ‘Metrics: The Path from Cause to Effect” by J. Angrist and J. S. Pischke (Princeton University Press). After their earlier book’s enormous success, Angrist and Pischke published a more accessible version. I, like most grad students in the field, have spent numerous days banging my head, reading each chapter. So, I can’t recommend this less technical book enough. Joshua Angrist recently won the Nobel Prize for his contributions to causal inference, so you are definitely in good hands.
(Advanced) Causal Inference

Sometimes the basics are not enough.

  • Causal inference in statistics, social, and biomedical sciences” by G. Imbens and D. Rubin (Cambridge University Press; 2015). Did I already use the word “bible?” This tome feels a bit too dense but is an indispensable desk buddy for anyone who takes causal inference seriously. This said, it is probably too technical for most data scientists in the industry. Guido Imbens also won the Nobel Prize, while Don Rubin is credited for the dominant Potential Outcomes framework.
  • Impact Evaluation: Treatment Effects and Causal Analysis” by M. Frölich and S. Sperlich (Cambridge University Press; 2019). This lesser-known resource is a great addition to Imbens and Rubin’s work. It contains chapters that you do not find outside of research papers – e.g., quantile treatment effects in propensity score or regression discontinuity models.

Leave a Reply

Your email address will not be published. Required fields are marked *