As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.
When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.
Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist. These recommendations will help you to be more diligent and more impactful at the same time.
1. Beware of Clean Data Syndrome
Tell me how many times this has happened to you: You get a data set and start working on it straight away. You create neat visualizations and start building models. Maybe you even present automatically generated descriptive analytics to your business counterparts!
But do you ever ask, “Does this data actually make sense?” Incorrectly assuming that the data is clean could lead you toward very wrong hypotheses. Not only that, but you’re also missing an important analytical opportunity with this assumption.
You can actually discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50 percent of values missing, you might think about dropping the column. But what if the missing column is because the data collection instrument has some error? By calling attention to this, you could have helped the business to improve its processes.
Or what if you’re given a distribution of customers that shows a ratio of 90 percent men versus 10 percent women, but the business is a cosmetics company that predominantly markets its products to women? You could assume you have clean data and show the results as is, or you can use common sense and ask the business partner if the labels are switched.
Such errors are widespread. Catching them not only helps the future data collection processes, but also prevents the company from making wrong decisions by preventing various other teams from using bad data.
2. Be Aware of the Business
You probably know fab.com. If you don’t, it’s a website that sells selected health and fitness items. But the site’s origins weren’t in e-commerce. Fab.com started as Fabulis.com, a social networking site for gay men. One of the site’s most popular features was called the “Gay Deal of the Day.”
One day, the deal was for hamburgers. Half of the deal’s buyers were women, despite the fact that they weren’t the site’s target users. This fact caused the data team to realize that they had an untapped market for selling goods to women. So Fabulis.com changed its business model to serve this newfound market.
Be on the lookout for something out of the ordinary. Be ready to ask questions. If you see something in the data, you may have hit gold. Data can help a business to optimize revenue, but sometimes it has the power to change the direction of the company as well.
Another famous example of this is Flickr, which started out as a multiplayer game. Only when the founders noticed that people were using it as a photo upload service did the company pivot to the photo sharing app we know it as today.
Try to see patterns that others would miss. Do you see a discrepancy in some buying patterns or maybe something you can’t seem to explain? That might be an opportunity in disguise when you look through a wider lens.
3. Focus on the Right Metrics
What do we want to optimize for? Most businesses fail to answer this simple question.
Every business problem is a little different and should, therefore, be optimized differently. For example, a website owner might ask you to optimize for daily active users. Daily active users is a metric defined as the number of people who open a product in a given day. But is that the right metric? Probably not! In reality, it’s just a vanity metric, meaning one that makes you look good but doesn’t serve any purpose when it comes to actionability. This metric will always increase if you are spending marketing dollars across various channels to bring more and more customers to your site.
Instead, I would recommend optimizing the percentage of users that are active to get a better idea of how my product is performing. A big marketing campaign might bring a lot of users to my site, but if only a few of them convert to active, the marketing campaign was a failure and my site stickiness factor is very low. You can measure the stickiness by the second metric and not the first one. If the percentage of active users is increasing, that must mean that they like my website.
Another example of looking at the wrong metric happens when we create classification models. We often try to increase accuracy for such models. But do we really want accuracy as a metric of our model performance?
Imagine that we’re predicting the number of asteroids that will hit the Earth. If we want to optimize for accuracy, we can just say zero all the time, and we will be 99.99 percent accurate. That 0.01 percent error could be hugely impactful, though. What if that 0.01 percent is a planet-killing-sized asteroid? A model can be reasonably accurate but not at all valuable. A better metric would be the F score, which would be zero in this case, because the recall of such a model is zero as it never predicts an asteroid hitting the Earth.
When it comes to data science, designing a project and the metrics we want to use for evaluation is much more important than modeling itself. The metrics themselves need to specify the business goal and aiming for a wrong goal effectively destroys the whole purpose of modeling. For example, F1 or PRAUC is a better metric in terms of asteroid prediction as they take into consideration both the precision and recall of the model. If we optimize for accuracy, our whole modeling effort could just be in vain.
4. Statistics Lie Sometimes
Be skeptical of any statistics that get quoted to you. Statistics have been used to lie in advertisements, in workplaces, and in a lot of other arenas in the past. People will do anything to get sales or promotions.
For example, do you remember Colgate’s claim that 80 percent of dentists recommended their brand? This statistic seems pretty good at first. If so many dentists use Colgate, I should too, right? It turns out that during the survey, the dentists could choose multiple brands rather than just one. So other brands could be just as popular as Colgate.
Marketing departments are just myth creation machines. We often see such examples in our daily lives. Take, for example, this 1992 ad from Chevrolet. Just looking at just the graph and not at the axis labels, it looks like Nissan/Datsun must be dreadful truck manufacturers. In fact, the graph indicates that more than 95 percent of the Nissan and Datsun trucks sold in the previous 10 years were still running. And the small difference might just be due to sample sizes and the types of trucks sold by each of the companies. As a general rule, never trust a chart that doesn’t label the Y-axis.
As a part of the ongoing pandemic, we’re seeing even more such examples with a lot of studies promoting cures for COVID-19. This past June in India, a man claimed to have made a medicine for coronavirus that cured 100 percent of patients in seven days. This news predictably caused a big stir, but only after he was asked about the sample size did we understand what was actually happening here. With a sample size of 100, the claim was utterly ridiculous on its face. Worse, the way the sample was selected was hugely flawed. His organization selected asymptomatic and mildly symptomatic users with a mean age between 35 and 45 with no pre-existing conditions, I was dumbfounded — this was not even a random sample. So not only was the study useless, it was actually unethical.
When you see charts and statistics, remember to evaluate them carefully. Make sure the statistics were sampled correctly and are being used in an ethical, honest way.
5. Don’t Give in to Fallacies
During the summer of 1913 in a casino in Monaco, gamblers watched in amazement as the roulette wheel landed on black an astonishing 26 times in a row. And since the probability of red versus black is precisely half, they were confident that red was “due.” It was a field day for the casino and a perfect example of gambler’s fallacy, a.k.a. the Monte Carlo fallacy.
This happens in everyday life outside of casinos too. People tend to avoid long strings of the same answer. Sometimes they do so while sacrificing accuracy of judgment for the sake of getting a pattern of decisions that look fairer or more probable. For example, an admissions office may reject the next application they see if they have approved three applications in a row, even if the application should have been accepted on merit.
The world works on probabilities. We are seven billion people, each doing an event every second of our lives. Because of that sheer volume, rare events are bound to happen. But we shouldn’t put our money on them.
Think also of the spurious correlations we end up seeing regularly. This particular graph shows that organic food sales cause autism. Or is it the opposite? Just because two variables move together in tandem doesn’t necessarily mean that one causes the other. Correlation does not imply causation and as data scientists, it is our job to be on a lookout for such fallacies, biases, and spurious correlations. We can’t allow oversimplified conclusions to cloud our work.
Data scientists have a big role to play in any organization. A good data scientist must be both technical as well as business-driven to perform the job’s requirements well. Thus, we need to make a conscious effort to understand the business’ needs while also polishing our technical skills.