Is Kaggle Worth It for Data Scientists?
Kaggle is a well-known platform that allows users to participate in predictive modeling competitions, to explore and publish data sets and also to get access to training accelerators. It’s a great ecosystem to engage, connect, and collaborate with other data scientists to build amazing machine learning models.
Over the years, Kaggle has gained popularity by running competitions that range from fun brain exercises to commercial contests that award monetary prizes and rank participants. Participating in these competitions can also open the door to recruitment by top firms. A lot of companies that are bogged down by tough data science problems or lack an in-house team look to Kaggle contests to fill that void.
Without a doubt, Kaggle is the largest online community for data scientists. For beginners looking to embark on their journey in the field, Kaggle is a valuable platform to get started and build a shining portfolio.
But should an aspiring data scientist rely solely on Kaggle to get a foot in the industry? Do data scientists need to keep Kaggling?
Personally, I believe that data scientists shouldn’t use Kaggle as a yardstick. In fact, aside from educational purposes and its usefulness in discovering data sets, I prefer to stay away from Kaggle contests completely. There are two major reasons why I feel it’s not worth the time for aspiring data scientists in the long run.
Kaggle Contests Can Never Simulate Real-World Problems
For the uninitiated, Kaggle competitions can either be posted publicly for all interested contenders or run privately for a select few top-rated participants. The host of the contest has to prepare the data and provide detailed descriptions of the problem at hand.
Now, the main concern with Kaggle is that users are spoon-fed the data they must use. In other words, data scientists in competition get to work right away on data that is already cleaned. Real-world problems, on the other hand, are a completely different animal altogether, which Kaggle competitions never represent.
Forget data cleaning; the business problems that you’d receive in the real world aren’t anywhere near as straightforward as the ones on Kaggle. A data scientist’s role outside of contests involves way more than just crunching the numbers. It requires having domain expertise, finding and preparing the relevant data, extracting and cleaning it, running code, deploying models on live data, analyzing trade-offs such as accuracy, speed, size, and portability, and ultimately determining if a solution is feasible or not. Unlike Kaggle contests, in which you’re limited to one data set with the sole objective of coming up with the best accuracy, the real practice of data science is much more complex. In reality, mining the data is what makes all the difference between an okayish and a great model, not just analysis.
So, although Kaggle contests are good practice for beginners, they only require dealing with the tip of the iceberg. Participating in one is miles away from doing real-life projects. Worse, Kaggle literally solves most of the problems for you. This structure ends up giving aspiring data scientists the wrong expectations and a false view of the industry.
Kaggle Contests Can Be Overwhelming for Newcomers
If their limited to nonexistent real-world relevance wasn’t enough, Kaggle contests put data scientists in the rat-race. Now, I know Kaggle competitions are fun and in no way am I trying to discourage you from participating in them. But, in the end, the highly competitive, rewards-based structure of Kaggle can be intimidating for many people, especially when they’re just starting out.
My primary concern with Kaggle contests is that they put you in a competitive mindset wherein the goal of data science shifts from creating the best algorithm to gaining those extra 0.001 points with hopes of getting into the top few spots. The truth is, making the top 0.1 percent on Kaggle’s leaderboard isn’t a cakewalk, no matter how good you are. This addiction to improving model accuracy for a better ranking might be a good move from Kaggle to gamify data science and keep people coming back, but it’s freakishly addictive and not that beneficial for data scientists themselves.
Points and ranks only make people more and more obsessed with the competition itself, and they end up having less fun with the data, limiting themselves to pleasing Kaggle’s algorithm. Although it might seem strange, creative thinking plays a huge role in data science. Countless ways exist of interpreting data and extracting it to find something meaningful. The tailor-made data sets that Kaggle provides limit creativity since the data scientist doesn’t need to explore the art of combining different data. For newcomers, this attitude and the focus on competition can easily become a vicious cycle or a downward spiral. Being bogged down in a numbers game can take a toll on anyone, causing stress and anxiety.
Another downside of Kaggle contests is the misconceptions they give to aspiring data scientists about possible rewards. Many of them initially perceive these contests as a way to earn serious money. In actuality, this rarely happens. The pursuit of a Kaggle prize only causes budding data scientists to shift their attention to the wrong things. Instead of looking to get better at machine learning, they can easily start getting greedy. Given the relative infrequency of hitting a jackpot, that shift in focus isn’t worth the amount of time people invest in it.
Besides, it’s too easy to start gauging your progress solely through Kaggle leaderboards. A low rank can make anyone feel like they aren’t that good at working in machine learning. The fact of the matter, though, is that Kaggle doesn’t really reflect actual machine learning work. Obsession with Kaggle rankings may drive otherwise skilled data scientists out of the field.
Verdict: Go Your Own Way
In the end, Kaggling can be fun for a short while, especially as a side project. There’s no doubt that it’s a great platform to hone data science skills. But despite its benefits, it’ll never come close to real-world applications and will always just cover a small aspect of the whole job. One can easily make a lot more progress, gain experience, and tap into a wider range of data science problems by working on their own projects.