Johns Hopkins University has built an incredibly robust dashboard that tracks the impact of the coronavirus worldwide and in the United States, with infographics, interactive charts, timelapse videos, and other visualizations bringing the raw data to life and providing much-needed context. The dashboard is a powerful tool for government officials who need answers, journalists searching for statistics and citizens who just want to know what the heck is going on.
But governors, reporters and concerned citizens aren’t the only ones asking questions right now. Business leaders around the world are scrambling to assess the impact of the coronavirus on their industries, and few if any of the many coronavirus dashboards out there were built with CEOs in mind.
The dataset powering the Johns Hopkins’ dashboard is available on GitHub, and Dow Jones has leveraged it, along with a dataset from the Harvard Institute of Global Health, to build its own COVID-19 timelapse dashboards, which are also free to download on GitHub. The dashboards mix structured healthcare data with unstructured, industry-specific news data stored in the company’s Factiva database. By combining structured healthcare data with quotes, statistics and figures reported in articles and press releases, Dow Jones is able to track the impact of the coronavirus at an industry level.
Through sentiment analysis, an application of natural language processing that can be trained to identify specific sentiments in messages or word documents, Dow Jones is able to turn thousands of news articles into a dataset that shows how an industry has been impacted by the coronavirus over time. When the data indicates a change, the dashboard lights up, enabling users to see when shifts in their industry occurred — and what drove those changes.
Built In spoke with Niranjan Thomas, a GM on the Dow Jones developer platform and solutions engineering team, about the dashboard and the tech required to extract insights from news articles at scale.
Dow Jones is best known for the stock market index that bears its name, not for maintaining a vast store of unstructured data. Can you talk a bit about Factiva and its use in this specific application?
We saw an opportunity to use the Factiva Snapshots API to help customers better understand and respond to the impact of the COVID-19 crisis. The Snapshot API is designed to give developers a moment-in-time “snapshot” of content from over 8,500 trusted, high-quality sources available within our Factiva database. This allows us to provide partners and customers with a high-volume extract of Factiva news for text mining, analytics and, in advanced use cases, building predictive models.
The API is built on Google Cloud, so a Snapshot extraction with hundreds of millions of articles can be executed in just minutes. The timelapse dashboard is based on a developer blueprint — or what we call a solution pattern — made up of Heroku, Flask, Dash and the Factiva Snapshot API and comes complete with sample code that developers can access and use very quickly and easily.
“When used in addition to healthcare data, news data is incredibly powerful for understanding the social and economic impact of epidemiological events over time.”
What specific data is Dow Jones looking for in news articles, and what are the benefits to combining these insights with structured healthcare data?
Every news article ever produced contains a vast number of data points about people, companies, entities and events. In aggregate, this contributes to a huge pool of news data that — when properly normalized, labelled and structured — contains valuable insights about the world around us. When used in addition to healthcare data, news data is incredibly powerful for understanding the social and economic impact of epidemiological events over time. For instance, we can use news data to apply sentiment analytics, drill down into regional and industry level variances, and identify the actions and reactions that caused an event to unfold in a particular way.
What technology does Dow Jones use to extract data from news articles at scale?
We use a combination of machine learning and rule-based techniques to label or tag our datasets and leverage the scale of the cloud. But our capability extends much further. We need to ask the right questions so that AI can provide the appropriate answers. And for that, you need human expertise. Our team of multilingual researchers is critical to this process, helping us distinguish between the real and the fraudulent as well as to interpret the context of each article and decide whether information within it is relevant or irrelevant.