How Parallel Processing Helps ClimaCell Forecast Weather by the Minute
Your phone is a means for you to tell others how bummed you are about the rain delay on your baseball game. It’s also a means to tell whether, um, weather will cancel your make-up game tomorrow.
ClimaCell, a Boston weather tech firm, analyzes the signal quality from your cell phone and other connected devices to produce hyperlocal weather forecasts. Its alerts are so localized, in fact, it can even tell you that, when you get off that call in 12 minutes, you’re going to need an umbrella — rain is coming, at least on your block.
Knowing the weather is important — not just for comfort, but for safety and business. For example: In early February, Uber announced it planned to implement ClimaCell’s insights into its app to forecast more accurate driver ETAs. During the 2018 U.S. Open, operators at the Billie Jean King National Tennis Center in New York knew to close the stadium roof while women competed, avoiding rain and maintaining a safe environment for athletes to play. And in February of 2018, an accurate read on when a snowstorm would end saved JetBlue “tens of thousands of dollars” in individual cancellations.
“Latency is the name of the game when you’re dealing with weather.”
Yuval Gonczarowski, chief technology officer at ClimaCell, estimated a third of the global economy is sensitive to weather conditions. But managing the millions of weather data points quickly enough to provide minute-by-minute forecasts is far from breezy.
“Latency is the name of the game when you’re dealing with weather, Gonczarowski said. “Once you have all these millions of data points, we have to work very hard to put them into our systems in real-time, sort of high-performance, low latency.”
Built In talked to Gonczarowski about how the weather tech firm updates and analyzes such a high level of weather data so quickly.
Parallel computing completes multiple processes at once. ClimaCell is receiving millions of fresh data points every two minutes. The company relies on parallel processing to clean, condense and feed them into a machine-learning model simultaneously.
Sensor fusion algorithms combine data confidently. ClimaCell receives images, data on microwave frequency and temperature, and other weather information from millions of sources. Its sensor fusion algorithms combine all this data into a cohesive source to create a confident read on what the weather’s currently like where you’re at.
GPUs can speed up data processing. ClimaCell powers its parallel computing system with GPUs, made up of with hundreds of core processors and thousands of concurrent threads. ClimaCell said GPUs are key to feeding millions of data points into its algorithmic system quickly.
Using software to solve a hardware problem
Like the National Oceanic and Atmospheric Administration and other traditional weather firms, ClimaCell forecasts rain using satellite, radar and data from “ground truth” weather stations, facilities operated by government agencies that measure temperature, atmospheric pressure, humidity, wind speed and direction, and precipitation. But these methods don’t paint a complete picture.
“The signals are already out there to tell us the story of weather.”
Satellites have amazing coverage but don’t provide great street, or even city level, views, Gonczarowski said. Ground truth weather stations are expensive to operate and not distributed equally across the globe — Mumbai, for example, a city of 13 million people prone to monsoons, has just two. The term “under the radar” should be taken literally, Gonczarowski added: Radars are good at scanning weather conditions at a high level, but not close to the ground.
“ClimaCell writes code, we do analysis, we write algorithms. We don’t deploy sensors,” Gonczarowski said. “We just believe the signals are already out there to tell us the story of weather.”
ClimaCell has partnered with wireless networks like Vodafone and National Grid to receive tower to tower wireless signal data — in fact, the “cell” in ClimaCell comes from the term “cellular networks.” Wireless network providers give ClimaCell access to microwave signals — many providing cell phone diagnostics information they would typically trash, Gonczarowski said — and ClimaCell’s algorithms analyze how these signals are affected by ground-level weather conditions.
“When I drop a rock it falls to the ground, right? When I drop a rock in water, it drops to the ground a little more slowly. It’s the same thing,” Gonczarowski said. “We look at the signal strength being sent and we look at the strength of the signal being received.”
The firm also gathers signals from Internet of Things devices like smart city sensors, connected cars and Uber drivers, images from city cameras, and more to tell what the weather’s like within 500 meters of where they’re at. These new data points add an estimated 500 million bits of information to the satellite, radar and ground truth weather station data ClimaCell uses to predict the weather conditions in more than 50 countries.
Processing millions of fresh data points every two minutes
Every two minutes, ClimaCell receives a fresh data set, comprised of images of foggy street lamps, temperature readings from connected cars and intercepted cell phone signals.
“Every two minutes it escalates the problem. It grows it and makes it a lot more challenging,” Gonczarowski said. “One of the team leads told me, ‘This is the first job where stack overflow is just not helping me.’”
The firm uses graphical processing units (GPUs) — with hundreds of core processors, thousands of concurrent threads, and the ability to maximize floating-point throughput — to power a massive parallel computing system on the Google Cloud Platform, combining, cleaning and feeding new data into a machine-learning model. Google’s Cloud Dataflow processes weather data by grabbing windows of data for batch processing.
After capturing the data, ClimaCell’s new technologies team sanitizes the information it receives, cleaning it to make sure it only includes the most accurate weather data in its model. If your car sends over a reading of 100 degrees in January, ClimaCell needs to know that Boston isn’t in the middle of a heat wave — your car heater is just on.
ClimaCell validates data through a number of methods. If the data was collected near a ground truth weather station, Gonczarowski said ClimaCell considers the station reading confident and uses that information to validate the item. Other times, it compares fresh data to historical temperatures, like seasonal averages. Gonczarowski said individuals on ClimaCell’s atmospheric data science team also help prioritize the information it takes in.
“We’re the only weather company with an occupational biologist on board.”
“They can help us really prioritize the data sources and to create what we call a current conditions layer that takes in these all these different sources and tells me, you know, with high confidence, ‘I believe that here the humidity level is X,’” Gonczarowski said. “I think we’re the only weather company with an occupational biologist on board.”
The combination of data from all these sources creates a sensor fusion algorithm, Gonczarowski said, meaning the sum of all these points is more accurate than if data were read individually and separately. Sensor fusion creates a reading of what the weather is like on a specific block.
This layer is then fed into ClimaCell’s machine-learning system, which combines the new data with ground truth, satellite, radar and historical government data to forecast what the weather will be like up to six days from now.
The run-time for this model is less than five minutes.
“The idea here is, first rain causes some sort of behavior. Why don’t I just look at that behavior?” Gonczarowski said.
How ClimaCell stores data
Atmospheric scientists, biologists, research engineers and more built a custom system configured for “large amounts of RAM” to help ClimaCell cache weather data in its memory.
The firm stores these massive data amounts in the Google Cloud Storage system. Gonczarowski said engineers built ClimaCell’s storage system with Facebook’s problem in mind, avoiding the issue that users of the social media site have when they’re looking for an old post: When you open Facebook, you have to scroll to a specific year — you can’t simply request what you need and immediately get where you want to go.
“The two main use cases of accessing weather data are over space and over time. So one way I’ll access things very quickly is I’ll say, ‘I want to get from this morning at 9 a.m. everything that has happened in the greater Boston area,’” Gonczarowski said. “I can jump straight in a specific year and I will get that data from that specific year.”
“What does the weather mean for me?”
Some of ClimaCell’s databases are built on altered geospatial or GIS-type databases, Gonczarowski said. The languages his team mostly uses are from the scientific Python libraries, including MPI, SciPy and pandas.
He said ClimaCell is currently doubling down on compiling historical archives of weather information. Called the “Weather for AI” platform, it’s intended for other engineers to take and build their own artificial intelligence systems.
Then, he added, these developers and scientists could “take the historical weather data and answer for themselves, ‘What does the weather mean for me?’”