The term “shadow IT,” which refers to technologies that are adopted without the involvement of IT, invokes a different reaction depending on who you ask. The IT department might shudder, while developers may shrug it off as a necessary shortcut to achieving agile workflows.
To close this gap in sentiment, companies have worked hard in recent years to increase awareness on both sides — developers have learned more about the security risks of going outside the IT lines, while IT has learned more about how to deliver tools and processes that match developers’ needs.
Today, we must have a similar conversation on a related topic: how to handle shadow data management. Left unchecked, this is an issue that will create many headaches, ranging from security vulnerabilities to inadequate return on investment (ROI) for AI projects.
What Is Shadow Data Management?
The Root Causes of Shadow Data Management
Shadow data management happens whenever data scientists or analysts copy data from the primary, IT-approved database to run analysis with their own tools in their own environments — many of which have not gone through the rigor of IT approvals. Shadow data management carries many of the same risks as shadow IT (like increased attack vectors or disrupted workflows), and also amplifies a company’s liability if sensitive user information is included in the “shadow” data set.
To understand how to address the challenge of shadow data management, it’s important to first explore the underlying root causes. After all, data scientists and machine learning (ML) engineers aren’t devious rule-breakers devoted to causing chaos. Rather, they’re driven to shadow data management practices because their technology needs aren’t being met by their IT department’s approaches to data governance, management and risk.
For example, I’ve heard of large organizations spending lots of money to stand up massive databases that were actually too slow to use for the models their teams needed. Frustrated by the primary database, the data scientists found a simple workaround: running a query to make the full table dump into a CSV file and then bringing in other scripts from there.
Speed isn’t the only pain point that drives data scientists to shadow data management. Another challenge is a lack of flexibility in data tooling and processes, which have traditionally centered around SQL databases. Additionally, data is often treated as artifact management, which has worked fine for many traditional business analytics workflows. However, what differentiates data science from business intelligence is the ability to experiment — simply pulling queries is not the same as experimentation and ML modeling.
What’s more, the types of data that businesses find valuable today are more varied than ever before, which means that one-size-fits-all data management solutions and tools are often insufficient. Arming data scientists with only spreadsheets and SQL and asking them to do their work is like asking them to build a jet engine with a chisel and stone. Shadow data management is invariably what results when data scientists break out of the restrictive silos of traditional data platforms.
The shadow data problem, then, is a view into the tension between the unmet needs of data scientists and the way traditional IT approaches data management. To solve this challenge, both sides need to be willing to work together on new solutions that can meet each team’s key needs.
Here are a few action steps that IT departments and data scientists can collectively take to help address the challenge of shadow data management.
Be Willing to Build Net-New Risk and Governance Models
Every organization needs data risk and governance policies in place in order to practice responsible data management, but leaders need to recognize that trying to force-fit existing models in the new AI era isn’t going to work. For IT teams, remember that the data management infrastructure isn’t the end goal, but rather a means by which to serve the needs of both data scientists and the business.
To enact AI initiatives, companies have to bring together more data than before, and doing that is inherently risky. IT teams should work with their data science counterparts to build new infrastructure that appropriately manages this risk. An example I like to use is imagining the difference between a car and an airplane. In a car, you’re likely not going to travel much faster than 80 miles per hour (if you want to be safe), but in a plane, you must travel significantly faster to achieve flight.
To have successful AI projects, teams need to move away from the “speed limits” they traditionally had and to create new guidelines that support a fundamentally new paradigm.
Don’t Try to Buy Your Way Out of the Problem With New Tooling
IT and data science teams must appreciate that the challenge of creating functional data management policies isn’t one that can be solved simply by buying a shiny new tool or software package. While some tooling updates will likely be necessary as part of the process — remember, spreadsheets are no longer sufficient — the best tooling in the world won’t solve for the governance and other process-related elements that also need to adapt. To continue our analogy: You can’t just stick wings on a car and expect it to suddenly fly like a plane.
It may be helpful to listen to technical talks from others in your industry about how they’re approaching data engineering or machine learning operations (MLOps) challenges. When doing so, remember that you don’t necessarily need to adopt the same solution because each company has its own unique challenges. Instead, concentrate on the problems that those teams identified, and how they approached solving them. The actual implementation may differ for you, but many of the underlying principles will be the same.
Recognize That Shadow Data Management Creates Technical Debt
This action step is especially directed toward data scientists Our State of Data Science 2020 report found that data scientists spend a lot of their time on data preparation tasks like data cleansing and loading, often considered “drudgery.” Given that, it’s tempting to save as much time as possible by getting the efficiency gains that can come with shadow data management.
But data scientists must remember that, like many short-term win solutions in software development, cutting corners leads to the accrual of technical debt over time. And one day, someone will have to pay that debt down. Even though it will be a significant investment of time, it’s very much worthwhile to go ahead and open discussions with the IT team about better data management practices and new solutions that can be implemented.
The End Result: Better Return on AI Investments
Though it may be tempting to see the solution to shadow data management as harsher enforcement of existing IT security protocols, that’s really just treating the symptom, instead of the underlying disease. The reality is that, even as we have raced toward a data-driven future, many of our practices and tooling have not yet caught up to the way data science is done today.
Companies must bring together their IT and data science teams to rethink data management strategies to fully support the needs of each side. Only then will organizations see the full benefits of their AI investments. When data science teams are better equipped to do their jobs with the tools and infrastructure they need, they can more effectively produce results. When IT teams build in protections and governance structures that are well-suited to today’s data science work, they can ensure that their companies are practicing responsible AI. It’s a win-win situation for everyone.