We recently moved to a new home in the south Bay Area which meant I had to reorganize my books. Over the years, I have accumulated more data science books than one might imagine – upwards of a hundred. While it seems like learning from books (as opposed to online sources) is going out of style, I automatically gravitate toward my physical bookshelf when confronted with a problem.
So, I thought I would take a break from digging into research papers and instead put together a brief list of some of my favorite books in the world of statistics, econometrics, and programming. I firmly believe they belong on every data scientist’s bookshelf.
Successfully communicating the findings of our analyses is an essential part of being an effective data scientist. If you are anything like me, you can benefit from some lessons in “clarity and grace.” The following are two classic books that have completely transformed how I think about communicating my thoughts, ideas, and results. If I were to suggest a single book from this entire page, I’d pick any of these two. Look for an older edition that costs much less.
- “Style: Lessons in Clarity and Grace” by J. Williams (Pearson)
- “The Elements of Style” by W. Strunk Jr. and E. B. White
Occasionally, statistics enters the public discourse through the popular science gateway. Typically, this takes the form of showing you how statisticians can manipulate their data to achieve the results they want. These books do much more than that – they walk you through the beautiful world of probability and statistics from all kinds of angles.
- “The Art of Statistics: How to Learn from Data” by D. Speigelhalter (Basic Books; 2021). A veteran statistician brilliantly explains a wide range of concepts with engaging real-world examples.
- “The Data Detective: Ten Easy Rules to Make Sense of Statistics” by T. Harford (Riverhead Books; 2022). I have long been a fan of Tim Hartford, and I love listening to him on BBC’s More or Less podcast. His newest book further solidifies him as a data expert and an amazing storyteller.
- “Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures” by C. Wilke (O’Reilly Media; 2018). This pick stands out in the crowded market for data viz books as a clear and compelling resource for any data scientist, really. You will learn best practices in working with visualizations, what to avoid doing, and how to choose the right type of chart for your data.
- “R Graphics Cookbook: Practical Recipes for Visualizing Data” by W. Chang (O’Reilly Media; 2019). You know what type of graph goes with your data. Now, the next step is to produce that graph. Chang’s book tells how to do this beautifully in R, with hundreds of examples. The downside is it does not cover many advanced topics. But, hey, it’s a terrific book, nevertheless.
Programming – R, Python, and SQL
It’s hard to imagine a data science position that does not require coding. These are my favorite SQL, R, and Python books. Older editions are usually a good deal – they would cost less and cover the same material.
- “Learning SQL: Generate, Manipulate, and Retrieve Data” by A. Beaulieu (O’Reilly Books; 2020). I learned SQL with this book, and I can’t recommend it more. Its main shortcoming is a lack of depth, but it’s great for SQL basics.
- “R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics” by J.D. Long and P. Teetor (O’Reilly Books; 2019). I have spent a lot of time with this one. Just like Beaulieu’s SQL book, it’s great for the basics, but I often find myself needing to look up elsewhere more technical topics.
- “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems” by A. Géron (O’Reilly Books; 2022). An elegant walkthrough of the entire ML landscape, with about half of the book dedicated to neural networks. The author carefully hands your hand through an end-to-end ML project.
Experimentation & A/B testing
Everybody knows running A/B tests (a.k.a. RCTs) is fun. But there are so many ways in which things can get messy – flawed design, wrong KPIs, spillover effects, etc. This book will tell you how to make sure you do it right. It is the de facto bible for online randomized experiments. It’s thin on statistical theory, but it makes up for that with a rich, practical viewpoint.
- “Trustworthy Online Controlled Experiments” by R. Kohavi, D. Tang, and Y. Xu (Cambridge University Press; 2020).
Having a solid foundation in statistics is among the most important skills for data scientists. It will give you a deeper understanding of any modeling you might do or encounter. More importantly, it will differentiate you from other data scientists who fear equations and statistical technicalities. What exactly is a p-value? The following books will give you the basics you need to succeed at that.
- “Foundation of Agnostic Statistics” by P. Aronow and J. Miller (Cambridge University Press; 2019)
- “All of Statistics: A Concise Course in Statistical Inference” by L. Wasserman (Springer; 2010)
- “Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science” by B. Efron and T. Hastie (Cambridge University Press; 2021). This one is the most technical of them all, so you need to gather some patience. The authors paint a detailed picture of statistical inference (a.k.a. hypothesis testing) from a historical development perspective.
(Beginner) Machine Learning
Machine learning is now part of many scientists’ day-to-day duties. These books will prepare you for any prediction/ML problem you might face.
- “An Introduction to Statistical Learning: with Applications in R” by James, Witten, T. Hastie and R. Tibshirani (Springer; 2021). A great overview of the most popular machine learning methods/algorithms with detailed how-to steps and R code. The more technical version of this book is listed in the next section.
- “The Hundred-Page Machine Learning Book” by A. Burkov (2019). A beautiful, minimalistic walkthrough of the most important ML concepts and algorithms with just the right level of technical detail. Importantly, the digital version is free on the author’s website.
(Advanced) Machine Learning
Sometimes the fun is in digging deeper. And this is certainly true for machine learning models.
- “The Elements of Statistical Learning: Data Mining, Inference, and Prediction” by T. Hastie, R. Tibshirani, and J. Friedman (Springer; 2016). I regret over-using the word “bible” because I really should have left it for this one. Hastie, Tibshirani, and Friedman, all at Stanford, have long been an inspiration to me. So, I take everything they write seriously. And, for a good reason.
- “Machine Learning Engineering” by A. Burkov (2020). A sequel of Burkov’s earlier book with a focus on the data engineering side of ML.
(Beginner) Causal Inference
These resources will gently guide you through the most used tools for isolating causal effects from non-experimental data.
- “Causal Inference: The Mixtape” by S. Cunningham (Yale University Press). A relatively new, yet a well-established book in the world of causal inference. You can find a free version online, but I recommend getting the paper copy immediately. Cunningham successfully balances theoretical details and hands-on applications (it even includes tons of R and Stata code).
- “Mastering ‘Metrics: The Path from Cause to Effect” by J. Angrist and J. S. Pischke (Princeton University Press). After their earlier book’s enormous success, Angrist and Pischke published a more accessible version. I, like most grad students in the field, have spent numerous days banging my head, reading each chapter. So, I can’t recommend this less technical book enough. Joshua Angrist recently won the Nobel Prize for his contributions to causal inference, so you are definitely in good hands.
(Advanced) Causal Inference
Sometimes the basics are not enough.
- “Causal inference in statistics, social, and biomedical sciences” by G. Imbens and D. Rubin (Cambridge University Press; 2015). Did I already use the word “bible?” This tome feels a bit too dense but is an indispensable desk buddy for anyone who takes causal inference seriously. This said, it is probably too technical for most data scientists in the industry. Guido Imbens also won the Nobel Prize, while Don Rubin is credited for the dominant Potential Outcomes framework.
- “Impact Evaluation: Treatment Effects and Causal Analysis” by M. Frölich and S. Sperlich (Cambridge University Press; 2019). This lesser-known resource is a great addition to Imbens and Rubin’s work. It contains chapters that you do not find outside of research papers – e.g., quantile treatment effects in propensity score or regression discontinuity models.
Leave a Reply