How to Onboard an Entry-Level Data Scientist
“The conceptual aspect of data science hasn’t really changed — the decision tree, the random forest model, those are always going to stay the same,” said Danny Malter, a former data science manager at MillerCoors and Hyatt Hotels Corporation who now runs an independent data consultancy.
But one notable shift does stand out: With data science and machine learning classes now widely offered at the undergraduate level, entry-level hires are starting with more expertise than they once did, and expectations have correspondingly climbed.
“The quality of applicants at the junior level has changed.... So the expectations were pretty high for entry-level data science positions” at both MillerCoors and Hyatt, Malter told Built In.
“Expectations were pretty high for entry-level data science positions.”
Still, even impressive new hires require a steady mentoring hand in order to succeed and meaningfully contribute.
Data science grads may be more competent than ever, but the divide between academic training and real-world practice will always persist — and that prompts some important questions for leadership. What first project should a manager assign? How can a manager best frame an entry-level hire’s role in terms of larger business goals? What common expertise holes need to be plugged?
We asked Malter, who’s also a former data science instructor at bootcamp provider General Assembly, to give us the rundown on how he helps industry newcomers ramp up.
It All Starts with Cleaning Data
Considering Malter’s background in the beverage industry, a beer metaphor seems apt. The oft-repeated brewer’s adage states that “brewing is 90 percent cleaning and 10 percent paperwork.” That aligns near-perfectly with data science’s so-called 80/20 rule, which states that a data scientist spends some 80 percent of their time sorting and cleaning data and just 20 percent actually analyzing it. The ratio is by no means hard and fast, but, like the brewing maxim, it speaks to the job’s essential nature, especially at the ground-floor level: Be ready to spend a lot of time cleaning up.
Week one should be all about the unglamorous work of learning about, and sorting, the organization’s data sets. “Anybody starting on a data analytics or data science team, it’s really important that they spend probably a good week, if not more, just going through the data,” Malter said.
“It’s so critical that there’s an understanding of the data and its features.”
It’s an ideal starting point for a couple of reasons. First, recent grads often have limited experience working with dirty data sets. Malter recalled the relatively trimmed and manicured data he encountered while studying predictive analytics at DePaul University in Chicago.
“In an academic setting, you’re often getting a clean data set from online, maybe through Kaggle or some other open-source data set,” he said. “There’s not a lot of cleaning of the data, not a lot of variety of data — but those are things everybody comes across in the field and should expect coming into a new company.”
Second, even though learning to clean the dirty data may be monotonous, everything builds from there. “Combing through data can be a little bit tedious,” he said. “But it’s so critical that there’s an understanding of the data and its features, what different data sets the company has, how different tables get merged together.”
It’s not the “fun part” of data science, Malter admitted, but it’s an absolute prerequisite before delving into more advanced analysis. “I don’t think any newcomer on a team should be thrown right into a project,” he said. “Because that practice with the data is really important.”
Master Your Domain
New entry-level data scientists are bound to face a few practical pitfalls as they develop what’s known in the field as domain knowledge — or the field- or company-specific details of the data — and gather more experience across a spectrum of data, which is a critical early step.
“Being comfortable with a wide variety of data sets is very important,” Malter (left) said.
The largest category of commercial data-gathering is the combination of users’ web-tracking and sales data. That intersection is potentially packed with lucrative insights, but it can also be a minefield of confounding naming conventions that new arrivals need help navigating.
“Companies can structure data in very complex ways,” Malter said. “So there could be variable names that don’t make sense, and you’d have to proactively reach out to data engineers on your team for what those feature names actually mean.”
Malter hypothesizes a customer rewards program, where the categorical features are bronze, silver and gold: “Maybe the actual definition of gold isn’t specified anywhere in the data, but you need to understand [the details], because it might have implications on how you interpret or share the data.”
Features are mutable, too, so a date-range analysis would be misleading if, say, the definition of gold status has changed over time. “Those understandings are more difficult at the beginning,” Malter said. “Realistically, a company’s not going to have a definition for every single category, across all the data, but hopefully they have some dictionary of the data column names, if not necessarily the data being fed into them.”
Some industries require working more with third-party data management firms, which means clarifying murky data might be even trickier. In that case, encourage them to see what answers can be gleaned online, even if they could still end up having to tap a shoulder within the office.
“Other people may have asked your question if it’s a common data set, even if it’s purchased from a company like Nielsen or IRI,” Malter said. “But it’s very likely that you’ll have to reach out to whoever’s in charge of the contract for the company that provided the data to figure out what things mean.”
“I tell people to get really good at one language but be willing to know a little bit about the other as well.”
Just as newcomers have to get acquainted with a diversity of data, managers should also encourage them to develop a decent knowledge of whichever statistical programming language they’re less familiar with, if they haven’t already. “There’s kind of a debate in the data science world: R or Python,” Malter said. “I tell people to get really good at one language but be willing to know a little bit about the other as well.”
“The concepts stay constant, so it doesn’t matter if you’re a Python user and others at the company are R,” he said. “You should be able to easily adapt to R if the company wants you to use R or vice versa.”
GEt them up to speed on SQL
It’s pronunciation may be debated, but whether you’re Team Sequel or Team S-Q-L, there’s little doubt that SQL is a must-have in the data science toolkit. And yet, according to Malter, entry-level data scientists are often curiously under-trained in the language, which is commonly used to merge data sources.
“SQL isn’t often taught in academic settings when it comes to data science, at least not in depth,” he said. “But if you come into a data science team, you’re very well going to be using SQL to collect your data.”
A candidate’s lack of mastery in SQL shouldn’t necessarily be a dealbreaker. But that means a manager should be prepared to spend some time helping entry-level hires fine-tune the skill — which is easier said than done, of course, given time constraints.
“There’s appropriate times to be helping and tutoring people in the workplace, but you also realistically only have so much time in a day,” he said. “You can’t sit down next to somebody all day helping them.”
Striking that balance is critical, especially given how important — if relatively unglamorous — SQL is in an entry-level role.
“Predictive modeling, R and Python — those are the cool parts, the nice and shiny part of data science. But the real-world truth is that writing SQL queries is a big part of the job as well,” he said. “Because that’s how you collect your data.”
Show new hires How They Contribute
An entry-level data scientist won’t get far without having the hard-skill finer points covered, but they also need to clearly see how their work fits into big-picture organizational goals.
There’s a rough split in how companies tend to apply their data. Some data science teams essentially work side-by-side with marketing, with the data directly driving sales outreach and strategy. Think recommendation models built for sales teams to use with clients.
Other teams function more similarly to research and development, developing projects independently and generating data that may not immediately inform on-the-ground sales and marketing tactics. “In that case, it’s a little bit more difficult to see how the work impacts the business and harder to keep people engaged in that way.” he said. “And from a leadership perspective, that’s a tough message to convey.”
In those environments, managers may want to keep an eye out for opportunities to backtest. A term for when a data scientist runs what would have been a predictive model, but applies it after the fact, in order to compare against actual results. That way, even if the data team develops a model that sales or marketing doesn’t run with, they can bolster their cred by proving the model’s effectiveness down the road.
“Hopefully you’ll see, for instance, a forecasting model turned out to be really on point with what the actual numbers were. ‘We were predicting we would bring in X dollars in revenue six months ago, and that’s nearly exactly what we have today,’” he said.
That can help foster a sense of investment in the part of a trainee, even if the nature of the organization creates a delay in seeing how their work is contributing overall. “There’s always tangible ways of seeing how the work is performing over time,” Malter said.