The Challenges (and Opportunities) of Data Science in Finance

Data is everywhere, which means data science professionals are also everywhere, in every industry. While there is a lot of commonality in what data science professionals do, each industry offers a unique experience — and unique opportunities.

The financial industry, which includes both traditional financial institutions as well as fintech companies, deals with large volumes of unique types of data and comes with some peculiarities that other industries don’t share. This has a huge impact on the different applications of data science within the industry and what data science professionals get to do.

Data Science in the Financial Industry

The financial industry deals with large volumes of very sensitive data. The industry itself is large, wide-reaching and heavily regulated. It also deals with specific challenges, like fraud risks. These unique characteristics result in specific use cases, including the need for highly accurate and explainable models, the need for low latency data processing and the opportunity to deploy and test experimental models in perhaps the shortest production cycle of any industry.

Working With Shorter Feedback Loops

While specific data science applications are almost never completely unique to a specific industry, different industries do tend to have a different combination of or focus on specific use cases, like the financial industry’s investment in natural language processing for customer sentiment applications or for gauging market confidence based on the language used in news reports.

But some use cases are fairly unique to the financial industry. Take hedge funds, for example. Hedge funds present unique application opportunities for data scientists in that they are “increasingly relying on data scientists who want to use the very latest and greatest algorithms that are coming from research,” according to Kjell Carlsson, head of data science strategy and evangelism at enterprise MLOps platform provider Domino Data Lab. He called this a fun challenge of combining very high performance requirements and the need to achieve them very quickly.

Yohann Smadja, VP of data science at Cedar, a fintech platform for healthcare providers, and former foreign exchange trader with a master’s degree in statistics, also pointed to the trading and investment decisions required for hedge funds as a unique opportunity for data science professionals, as these areas provide for one of the shortest production, testing and deployment cycles for models.

“Depending on what you’re working on — trading decisions or investment decisions — you are probably looking at making decisions on a daily basis, and you are going to get the feedback that your trades were right,” he said. “You can compare your benchmarks with the S&P 500 or other hedge funds performance … [and] that feedback loop is almost constant.”

For Smadja, that is one of the most attractive elements of the financial industry for a data scientist.

In addition to these unique applications of data science in the financial industry, there are other use cases that correspond with the peculiarities of this industry.

More on Fintech11 In-Demand Fintech Jobs

A Mature Industry Looking for New Types of Data

The financial industry embraced data early on in ways many other industries have not, according to Navin Budhiraja, CTO at Vianai Systems. He described the financial industry — along with the retail and high-tech industries — as having invested a lot of time and resources into making sure it has the right data in the right form to leverage for new insights. This means the financial industry is one where “building and operationalizing the AI models is already a big focus,” he said, while other industries like manufacturing and transportation are still adapting to the collecting, cleaning and organizing of data.

This familiarity with using data means that the financial industry is also expanding the kind of data it is interested in, according to Smadja. Over the past 10 years, he has observed a trend, particularly in the investing sphere of the financial industry, of expanding into new kinds of data that might serve as economic indicators.

“One such project is that people have been scraping Amazon’s prices and are trying to detect inflation ahead of everybody else,” he said. Another example is using satellite images of oil fields in an effort to anticipate how many barrels are in production, or using governments’ agricultural reports (usually in PDF format) to anticipate food supplies.

“Everybody has access to the same data, but the key is you are constantly trying to find an edge.”

“Everybody has access to the same data, but the key is you are constantly trying to find an edge,” Smadja said. “That’s the kind of work that data scientists could do — working on new formats, images, texts. How do you parse that? How do you leverage the information that is in those reports?”

The environmental, social and governance commitments of companies is another type of data that the financial industry is increasingly interested in, according to Paul Fahey, head of investment data science at asset servicing provider Northern Trust.

“If you’d have told me 16, 18 months ago that I was going to be spending so much of my time on sustainability and ESG-related topics, I’d have said you were crazy,” he said. But investors are increasingly using ESG information when making their investment decisions, he said, seeing a firm’s long-term viability tied to their commitments (or not) on sustainability, social and governance issues. Fahey expects that in the coming years, investors and investing institutions alike will consume ESG data much as they consume more traditional financial data today.

This presents an opportunity for data science professionals in the financial industry.

“People are taking their commitments to ESG seriously, but it’s early stages and there’s a lot of missing data,” he said. “The ability to gather that data efficiently, effectively and then glean the right signals from that is going to be very important in the coming months and years.”

Find out who's hiring.

See all Data + Analytics jobs at top tech companies & startups

View Jobs

Fraud Risks Require Fast, Accurate Data Analysis

While other industries — most notably healthcare — also deal with large volumes of sensitive data and operate under heavy regulation, finance is set apart by the issue of fraud, said Laura Guilbert, head of analytics at consumer financing service Wisetack.

“There’s a high chance that someone’s going to try and take advantage of whatever finance system that you’re running,” she said of the financial industry, including both traditional financial institutions and fintech companies. These efforts at fraud could be anything from trying to create a fake bank account, identity theft, direct theft of funds or applying for a loan under a false name, she said.

“Imagine an AI system automatically being able to tell that [the swipe of a credit card] is maybe suspect and needs further scrutiny, or being able to say, ‘No, this is fine. We’ll pass it through.”

Since the financial world — and the efforts to take advantage of it — move in real time, so too must fraud detection. Sanjay Rajagopalan, chief design and strategy officer at Vianai Systems, put it in the context of the swipe of a credit card.

“Imagine an AI system automatically being able to tell that it is maybe suspect and needs further scrutiny, or being able to say, ‘No, this is fine. We’ll pass it through,’” he said.

Such determinations must happen almost immediately in real time and the need for speed applies to more than just transactional fraud detection, Rajagopalan said. It also applies to more systemic fraud issues like money laundering or attempted tax evasion.

“There are machine learning systems that might look at vast amounts of data and extract signatures of fraudulent activity, which could then be given to experts who are trying to limit those types of things,” he said.

Systems that can deliver real-time, accurate assessments of fraudulent activity present a very challenging situation for data science professionals. As Kjell put it, the financial industry is an ecosystem with incredibly large volumes of data coming in almost constantly from which inferences must be made with very low latency. All of this requires intense, skillful management on the part of data science professionals.

A Heavily Regulated Industry Calls for Explainable Models

One of the key traits of the financial industry is its heavy regulation. Financial institutions are arguably one of the most heavily regulated industries in the United States and world, possibly second only to healthcare. While many of the regulations are aimed at preventing illegal activities and safeguarding the privacy of peoples’ sensitive data, a notable subset of regulations are focused on fairness, like the Right to Financial Privacy Act, the Home Mortgage Disclosure Act and the Fair Credit Reporting Act.

When it comes to data science, fairness-focused regulations mean there is a strong need for explainability, particularly when it comes to models. And that can cause frustrations for data scientists working in finance, according to Carlsson, because the most explainable models are so-called white box models like decision trees.

“Unfortunately, they’re not terribly accurate,” he said. “There’s almost an inverse correlation between just how difficult it is to interpret a model and the kind of accuracy that you can get.”

Models that can give more accurate predictions tend to be the more complex black box models, like a deep learning neural network model. While these models do theoretically outline how they work, “it just doesn’t mean anything to us as human beings,” Carlsson said. And that makes regulators nervous.

Carlsson said he has heard complaints that regulators and compliance officers working in the financial industry kind of want to have their cake and eat it too when it comes to data models.

“They want it to be extremely accurate, but they also want it to be very — not just explainable but intuitive,” he said. “And at the end of the day, those things are at odds with each other.”

There are ways that data science professionals in the financial industry can and are dealing with this dynamic however. One is to use the more complex models, but apply algorithms such as leave one feature out, or LOFO, Shapley values and local interpretable model-agnostic explanations, or LIME, to check the models on the individual level, according to Guilbert.

She gave the hypothetical example of a model that predicts risk associated with opening an account at a bank based on 10 different variables including income and number of jobs over the past year. If an account applicant in this example had a constant income but had a high number of jobs in the past year, the model might flag them as high risk and decline them. Applying the different algorithms to check the model could allow those working with it to explain why exactly the model flagged that individual as potential high risk.

“Once you build your model, you then just apply this overlay algorithm, it will give you, for each individual record, coefficients — using my example — for each of the 10 features to say ‘for each record, this is how important each individual feature was to this particular score,’” she said. “It’s a secondary step, so you would need to not only build your model, but then build this secondary explainability model to get coefficients record by record.”

In addition to efforts at making black-box models more explainable, Carlsson said he has heard about efforts to create what might be called gray-box models.

“There has been a cottage industry of coming up with particular methods — like a proprietary version of a so-called K-nearest neighbor algorithm — that are trying to get you the best of both worlds,” he said. These efforts are trying to both embrace the complexity of black-box models while still being “white-box enough” to satisfy regulators who demand explainability.

More on FintechFintech Engineers Are Bringing Modern Technologies to a Legacy Industry

Digitization Means More Opportunity

While some parts of the financial industry are at the cutting edge of data and have fully embraced the world of digitized data, not all of the industry has gotten with the times. Some sectors, like investing, continue to rely on legacy systems. This presents a unique dynamic for any data science professional working in finance, according to Fahey.

He pointed to the example of the portfolio manager. “Historically, they’ve been in a pretty much analog environment — living their lives in spreadsheets, word documents for research notes, email exchanges for information sharing,” Fahey said.

But changes in the financial industry and businesses at large — everything from the impact of COVID normalizing remote workforces to technological advancements allowing for the ingestion and analysis of big data faster — has accelerated the digital transformation within some of the more intransigent spaces.

Plus, the people working in the financial industry are changing, Fahey said.

“They don’t want to live in a world where they have to deal with offline spreadsheets or this more analog world.”

“If you think about somebody who’s been in the portfolio manager space for the last 25 to 30 years, they are of a certain vintage,” he said. “The up-and-coming portfolio managers and researchers are all of a different demographic.” They’re people who grew up using technology and having data readily accessible.

“They don’t want to live in a world where they have to deal with offline spreadsheets or this more analog world,” he said. “Not only are they comfortable with the technology, they are demanding that that is the world in which they want to work, which then just leads to greater access to data and the ability to do heightened levels of analytics.”

But even with these pressures, there is still opportunity for increased digitization in many areas of the financial industry, even in subsets of the industry that have embraced increased digitization of data. The insurance industry, for example, has utilized digitization and big data to great success on the consumer experience side, Fahey said. With just a snap of some pictures of damage to a car after a collision, run against the large data sets the insurance industry has, claim payments can take a matter of hours or days rather than weeks, he said.

“On the backside, all of the asset servicing of the insurance industry is still using some data technology and a lot of analog capabilities, a lot of spreadsheets, to support the asset side,” he said. “But I think, broadly, that the financial services still have a lot of room for improvement in the data science race.”