The Rise of Alternative Data

Sometimes messy, sometimes over-aggregated, alt data is still taking over.
Stephen Gossett
May 11, 2021
Updated: May 12, 2021
Stephen Gossett
May 11, 2021
Updated: May 12, 2021

With apologies to Fiddler on the Roof’s Tevye, the traditional ways can only deliver so much.

That’s essentially the impulse that, for the last several years, has driven hedge funds and other investment firms to augment conventional data sources like SEC filings and quarterly financial statements with newer, sometimes wildly outside-the-box data. Those streams now include everything from credit card transaction data and web-scraped social media to satellite imagery and IoT sensors. In the scramble for alpha — the financial industry’s term for market advantage — no data set is too obscure, as long as some actionable signal can be gleaned.

What Is Alternative Data?

Alternative data refers to non-traditional data sets that investors use to guide investment strategy. Examples of alternative data sets include credit card transaction data, mobile device data, IoT sensor data, satellite imagery, social media sentiment, weather data and ESG (environmental, social and corporate governance) data.

The figures tell the story of alt data’s fast rise. The number of alternative-data providers is more than 20 times larger now than it was 30 years ago — with more than 400 currently active providers, compared to only 20 in 1990, according to a report from last spring by the Alternative Investment Management Association, in collaboration with fintech company SS&C.

Today, roughly half of all investment firms use alternative data, according to both the AIMA report and another recent survey by Bank of America. And that number will likely continue to grow, as more firms have invested in new technology during the pandemic. A recent survey by AIMA, in conjunction with Simmons & Simmons and Seward & Kissel, found that 34 percent of hedge fund managers surveyed said their firms are newly investing in alternative data.

RelatedInsurance Companies Are Embracing AI, for Better and for Worse


useful alternative data
Image: Shutterstock

Making Alternative Data Useful

One of the precipitating factors behind the rise of alternative data was the “quant quake” of 2007, Yin Luo, vice chairman of quantitative research at data firm Wolfe Research, told MarketWatch. Quantitative hedge funds (“quants”) had herded around the same stocks, then moved to sell all at the same time, resulting in heavy losses. New data sources promised unique advantages and a way to break the pack mentality.

A year after the quake, now-shuttered MarketPsy Long-Short Fund began incorporating social-media sentiment into its models. A few years later, a leading London hedge fund kickstarted investments based on a 2010 study that showed a probable relationship between Twitter mood and the Dow Jones index, Deloitte reported. Alt-data vendors proliferated in the years to follow, and fundamental hedge funds soon began to follow the path paved by the quants.

The industry has blossomed, but access doesn’t inherently mean advantage.


Raw vs. Aggregated

Alternative data often comes either as aggregated data sets or as a straight data feed, through APIs. Aggregated data, the less expensive option, is structured, and therefore easier to work with and slot directly into an investment model. But those sets are more widespread and, because of that, they have less alpha potential.

They also lack depth. “You lose that ability to really dig and mine the data in unique ways,” said Gene Ekster, CEO of Alternative Data Group and an alternative-data professor at New York University.

They could also suffer from selection bias, which means they’re not truly representative. And good luck untangling that — or any other significant error. “Most [data] intermediaries’ techniques and methodologies are black-box systems, not available for audits by customers, thus exacerbating aggregation errors because of a lack of transparency,” Ekster wrote last year in an alt-data report.

How unforgiving can that black box be? Consider the Lululemon episode.

A few years ago, a number of the athletic-apparel retailer’s stores had inserted an asterisk between the two Lus in reports: Lu*lulemon, instead of Lululemon. The aggregators didn’t have the keyword for Lu*lu, which made it appear as if sales volumes had dropped dramatically, Ekster told Built In. That led to a number of short bets, which proved disastrous when Lululemon, in fact, reported a great quarter.

“If you had the raw data, you were able to see past that, not make that error and trade against that,” Ekster said.

For reasons like all those, a raw feed is considered much more valuable than aggregated data. But a purely unaltered data set, with no transformation applied, is essentially just data exhaust. Any hopes it would provide value would have to be weighed against the considerably heavy clean-up lift.


Tackling Ticker Tagging

The best solution is a direct API data feed with as much automated transformation and structuring as possible. But entity mapping and ticker tagging is a major challenge. Ticker tagging means assigning a company reference or brand alias back to its unique stock symbol and proper name. For example, “Verizon” needs to map back to VZ and Verizon Communications Inc. And not all references are so direct. Maybe a Twitter user sarcastically references Verizon’s slogan while including a typo — “that’s powerfull.” A hedge fund might want that sentiment included in its investment analysis, but it would need sophisticated AI to even detect the reference.

And it doesn’t stop at ticker symbols. Some fund managers also want data mapped to CUSIPs, alphanumeric codes for North American securities, or ISINs, international identifier codes.

One of the leading alternative-data providers — and one of the standouts in handling the tagging and mapping challenge, according to Ekster — is Thinknum.

“There’s an opportunity in the market to have what they call referential data — having all these different ways of referencing a given entity, company or security, mapped back in a way that facilitates the data analysis,” said Boris Spiwak, director of marketing at Thinknum. “And I think we’re all sort of trying to figure out the best way to do that.”

Thinknum sells more than 35 data sets. Those include social media and job listing data sets, but also more niche information like car inventory, retail store growth, hotel web traffic data and vendor-specific product pricing by location. The information is publicly available; anyone with the know-how could, say, scrape Glassdoor in hopes of detecting hiring patterns. But that ability to map and tag referential data as a direct feed has major value. Thinknum’s API data feeds cost between $25,000 and $50,000 per data set, per year, Spiwak said.

RelatedData Collection Methods Matter More Than Sheer Data Volume


Making Sure the Data Is Actually Worth It

Any ability to cut down turnaround time between acquisition and analysis is valuable, especially because many data intermediaries go for a quantity-over-quality approach: aggregated data sets with high ticker coverage, but not necessarily insightful ticker coverage.

“The problem today is ... how do we know if a data set is going to be valuable? It could take six months of R&D, [and] you have to buy it first. You don’t know how much alpha it’s going to generate until much later,” Ekster said.

Neuravest, formerly known as Lucena Research, is one of the companies focused on cracking that conundrum. Neuravest is something of an intermediary after the intermediaries. It partners with 42 select alternative-data providers and works to validate data sets before passing them along and incorporating them into machine-learning investment models for fund managers.

Raw data is piped into the system, which generates what the company calls a data qualification report. The platform measures the data along 12 checkpoints before it’s allowed to be incorporated into a model. Checkpoints include an indicator of the length of time before a signal loses value, plus a distribution of price action following a given event, such as a news announcement that generates social-media chatter.

After validation, the data is scrubbed, ticker-tagged and normalized before a model is built to generate back-testable investment theses. By bringing together uncorrelated data sets, the models aim to identify constituent stocks and assets that are about to move abnormally compared to similar stocks.

But it begins with that first step — proving a data set is even worth the time. It’s about “identifying which ones are good for certain scenarios, and really providing them on a silver platter to customers, so they don’t have to deal with all these other purchases and evaluations and hiring quants and infrastructure,” said Erez Katz, co-founder and CEO of Neuravest.

Related49 Fintech Companies and Startups to Keep in Your Back Pocket


future alternative data
Image: Shutterstock

The Future of Alternative Data

Even with well-structured feeds and benchmarked data sets, the need for skilled data analysts in finance isn’t going anywhere. Fundamental firms incorporate alt data to help interrogate their existing investment hypotheses, while quants input the alternative stuff into models alongside reams of traditional data. That is, alternative data will always be an ingredient, not the whole stew.

That’s also why experts sometimes push back on the idea that a widely distributed data set necessarily means diminishing alpha, particularly if it’s non-aggregated. “If you give the same raw data set to 20 different funds and analysts, they’ll come up with 20 different ways to make money on it,” Ekster said. “So in that sense, there will be no alpha decay.”

Katz struck a similar note, emphasizing the need for subject matter expertise and innovative thinking. “You need people who have very strong analytical skills, but also people who understand Wall Street, what it takes to move markets and how to circumvent what the common crowd knowledge presents.”


Beyond Alpha...

It’s also important to note that firms are no longer looking at alternative data strictly as an alpha generator. Data sets can also be used more like insurance — information to help limit loss in the face of potential upheaval. For instance, Spiwak said Thinknum saw “unprecedented” inbound demand when, at the height of the GameStop saga, it released its Reddit Mentions data set — which tracks, in real time, how often ticker symbols are mentioned in the top 100 posts on r/WallStreetBets and r/Stocks.

It was alternative data as risk management. If a hedge fund was shorting a stock, here was a way to maybe know if a short squeeze was imminent.

The Greensill episode offered a similar lesson. Sentiment analysis of Greensill employee reviews on job sites revealed turmoil prior to the finance company’s eventual collapse.

“There were some pretty clear signals from people working there that something wasn’t kosher,” Spiwak said.

Progress in the industry also means that sectors beyond finance are paying attention to the value of alternative data. Thinknum offers a more user-friendly, web-based user interface for data sets that’s less expensive than the API feed. The bulk of customers who use it come from companies outside finance, according to Spiwak.


...and Beyond Finance

Once a data set has enough historical data and true representativeness, it becomes attractive to enterprises too, and sometimes even governments. “You see a lot of non-institutional-investor interest in data sets that are mature enough and developed enough,” Ekster said. “And they’re using the fact that the institutional investment community uses them as a validation point.”

So far, that’s perhaps most evident in the fast-growing people analytics industry. Companies want up-to-the-minute data of employee sentiment, both for the employer and its competitors. And real-time tracking of competitors’ job listings can give a company a better picture of competitors’ growth strategies. While finance remains Thinknum’s beachhead, this kind of broader adoption of alternative data represents the future of the industry, Spiwak said.

Plus, there are always new kinds of data sets emerging. For example, ESG — environmental, social and governance — data has been the subject of much activity and chatter lately. It’s essentially a way of quantifying, through three main criteria, how sustainable a given enterprise is. That has broad appeal for governments tracking climate-related information, for companies looking to prove their green bona fides and for investors who’ve noticed the studies that indicate sustainable funds have performed as well or better than conventional funds.

ESG data isn’t perfect. The Organization for Economic Cooperation and Development recently called for more consistent standards to ensure across-the-board verifiability. But it’s clear nonetheless that — whether incorporating satellite data of construction practices or flood risk analysis or some other telling metric — alternative inputs will be key.

“To achieve that ESG goal, for the most part, alternative data is the only source of information that you have,” Ekster said. “You won’t get that from stock prices or company filings. You get that from alternative sources.”

Great Companies Need Great People. That's Where We Come In.

Recruit With Us