Students in Matt Brems’ data science classroom come from all professional walks of life.
Brems is the global lead data science instructor at bootcamp purveyor General Assembly, where he teaches machine learning concepts to people around the world — some with a decent coding foundation, some with not much at all.
“We see people of all different backgrounds come in and be successful, whether they’re coming from a journalism background, a chemistry background, a marketing background, or anywhere in between,” Brems told Built In.
The same goes for Jeff Herman, lead data science instructor at Flatiron School, who sees a fair number of developers looking to onramp to ML.
“If you do freelance work, for example, you can charge a higher rate if you do machine learning versus just web development,” Herman told Built In. “So a lot of people, especially in software engineering, want to gain those skills.”
Programming chops are of course central to training and running machine learning models — it’s essentially the “hacking skills” circle of Drew Conway’s famous data science Venn diagram. But is familiarity with Python or R really sufficient to get started with machine learning? What would someone with a dev background — but not a data science background — need to learn in order to start working with ML?
We talked with Brems, Herman and Rochelle Terman, an assistant professor at the University of Chicago who previously taught a Machine Learning for Social Science course at Stanford University, to find out.
Taking ‘Baby Stats’
In the aforementioned diagram, Conway warns against the so-called “danger zone” — that is, building one’s programming bedrock and domain knowledge, but not the required statistics knowledge. People in the danger zone are able to analyze data in a way that appears to be sound, but they are unable to explain the process or meaningfully interpret results.
It’s tough to fall too deep into the danger zone, since building up subject matter and programming expertise should also impart one with at least passing stats skills, according to Conway, but “it does not take many [people] to produce a lot of damage.”
“It’s possible to run a line of code without error, and if you understand your data well enough and build a model in Python or in R, maybe you get results out of it,” Brems said. “But if you don’t know enough probability or linear algebra to step back and make sure your results are correct, things can go awry pretty quickly.”
“[If] you don’t know enough probability or linear algebra to step back and make sure your results are correct, things can go awry pretty quickly.”
Most machine learning algorithms are designed using probabilities. So broadly speaking, the need-to-know statistical skills include basic distribution concepts, such as whether or not two things are independent of one another and whether points within a data set fall within a normal distribution. Also important to know are fundamentals like confidence intervals, hypothesis testing, line of best fit, and enough linear algebra to understand how computers use matrices “to actually do data science,” Brems said.
Some optimization functions, regression analysis and Bayesian statistics, which are used in many algorithms, are also key to know, Herman added.
“It’s not necessary for people to have a Ph.D. in statistics or a master’s degree in a STEM field,” but the above are fundamental in order to leverage machine learning in a professional environment, Brems said.
Put simply, the math bar isn’t prohibitively high.
Indeed, the prerequisites for Terman’s introductory machine learning class were basic data principles — such as defining types of variables — and “baby stats” — core concepts like variance and dimensionality.
First, think terminology definition over technique — concepts like bias-variance trade off, overfitting regression models and out-of-sample mean squared error.
“The more specific techniques to estimate a model are probably less important than the more foundational knowledge,” Terman said.
Resources to Know
Newcomers looking to shore up that foundational knowledge might encounter some paradox-of-choice paralysis. There are so many quality online resources, many of them free, that it can be difficult to know where to start. But all three experts with whom we spoke pointed to An Introduction to Statistical Learning, an R-focused stats overview, sometimes referred to simply as ISL in data circles, and available to download free as a PDF.
“This is the gold standard for statistics textbooks, and it’s really approachable for people who don’t have a statistics background,” Herman said.
The authors clearly lay out the math behind various modeling techniques, Brems said. “It does presume the reader is comfortable with some mathematical notation, but if you’ve heard of linear regression before, you should be able to go through and understand it,” he added.
That said, if you crack ISR and discover the first chapters to be too dense, check out online resources, like Medium posts, about basics like confidence intervals and hypothesis testing, then come back once your foundation is stronger, Brems suggested.
“So much in machine learning and data science is all the steps you have to do before you ever actually estimate a model.”
On the off chance that readers find it too rudimentary, there’s Elements of Statistical Learning. “They’re basically the same book except Elements is written to statisticians and Introduction is written toward people who are not statisticians,” said Terman, who taught Introduction in her Stanford ML class.
For the data side, Terman recommends the O’Reilly book R for Data Science, another free online option. With a focus on Tidyverse and R, it dives into working with data across the full workflow — “how to import data, munge it, restructure it, and also how to actually run, interpret and plot some linear models,” she said.
“So much in machine learning and data science is all the steps you have to do before you ever actually estimate a model, like cleaning and merging a data set from multiple different sources,” Terman added.
Working with Algorithms
Once you’re ready to begin diving into algorithms, Herman recommends starting with the humble linear regression — “the baseline for a lot of other algorithms” — and applying it to some well-known test-case data sets. That could include determining the probability that a person will have survived the Titanic disaster using that ship’s data set, classifying a flower’s species given a number of flower characteristics with the iris data set, and forecasting at what price a house would sell using the Boston housing data set.
All are available on Kaggle, and most are also on scikit-learn, the popular machine learning library. ML libraries like PyTorch and TensorFlow have democratized deep learning algorithms, but Herman recommends sticking to scikit-learn early on. (Scikit-learn boasts several tutorials, but Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow, also published by O’Reilly, is well-regarded in data circles.)
Software folks will likely be able to more quickly glean what’s going on under the hood, so they might want to try coding some from scratch once they become familiar with how the algorithms work.
“If you know how to code linear regression, random forests or support vector machine algorithms from scratch, then you’ll really have a good understanding of what’s going on,” Herman said.
Common Pitfalls
Thanks to a plethora of free and low-cost online resources, the barrier to entry for machine learning is significantly lower than it was even a few years ago. Still, there are some common stumbling blocks of which newcomers should be wary.
Beginners often have a difficult time explaining their results to a non-technical audience. Herman recalled his time working for the BNSF Railway, where he analyzed data to improve business performance and safety.
“You’ll get past the problem in question, build a machine learning algorithm, if that’s appropriate, and then present that to non-technical stakeholders at your company,” he said. “But if you can’t explain your results to non-technical people, nothing’s ever going to come from your analysis.”
“[If] you can’t explain your results to non-technical people, nothing’s ever going to come from your analysis.”
He recommends giving a slide-deck presentation to friends or family who don’t work in tech and noting if there’s any information they don’t understand. “That should be a red flag,” he said.
Another common pitfall, particularly among those with some coding experience, is jumping too quickly into building algorithms. Whereas ML newcomers who happen to have some coding or software development experience may have some leg up in terms of explaining tech to laypeople, that won’t necessarily help them here.
Locate your inner data analyst and get very comfortable with cleaning the data. (Herman points to the oft-cited 80-20 ratio.) “It’s tough because machine learning modeling is the most exciting part.” he said. “Making predictions is really fun, but you get so much better results by having a deep understanding of your data set.”
Similarly, you can often get strong results with a fraction of the processing power and training time by using a humble linear regression over deep learning, he added.
There’s also the tricky business of knowing when you have learned enough to elevate from student to real-world practitioner. “There’s no bright line as to at what point it’s OK to move from playing on your computer and practicing with data to putting models into production or driving business decisions with your analysis,” Brems said. In the end, it’s on the pupil to be honest about their own expertise level and move forward responsibly.
Speaking of fuzziness, newcomers should also be cognizant of the black-box problem, even if they’re not building anything so complex that deep explainability is an immediate issue. “Being able to explain the results of your models really clearly is super important,” Herman said.
It’s difficult to trust a model’s outcomes if we can’t transparently see how those outcomes were determined. It could also be dangerous, namely if it helps mask biases within the data. That might mean defaulting to a less complex, but more interpretable route, like linear regression. “Sometimes you have to sacrifice performance for interpretability,” he added.
All those cautions noted, there’s no substitute for getting your hands dirty. You don’t have to know how to build a car in order to drive, Terman said by way of analogy.
“Be humble with your conclusions, but just get in there,” Terman said. “Start estimating some models. Start working with data. That’s going to be the best learning experience, more than just reading about it.”