Data analytics used to be a lot like playing “Monday morning quarterback,” according to Venkat Venkataramani, CEO of Rockset. After the game — win or lose — fans of the team would get together to talk over what went wrong, what went right and what the team needed to do in the next game to win.
That’s basically batch analytics. All the analysis — sometimes done by outsiders not intimately familiar with the conditions — happens on a collected volume of data, after the fact. The data collected does nothing to help those in the game while it’s happening. It can only be used to help future games.
But as technology has advanced with the growing ubiquity of the internet of things, and with platforms like Apache Kafka and Confluent going mainstream, companies increasingly want to analyze and derive useful insights from that data right away. And they can, with real-time analytics. Instead of a game of Monday morning quarterback, the team can assess what plays are working in the middle of the third quarter to turn the game around.
What Is Real-Time Analytics?
It is increasingly important for businesses to have real-time analytics. Consumers wouldn’t be OK if a bank alerted them to suspicious transactions in their accounts a few days after the fact, said Alex Gallego, founder and CEO of Redpanda. They want to know as soon as possible. Similarly, the world of food delivery services that exploded during the pandemic couldn’t get away with alerting customers that their food was delivered several hours before. Consumers want — and need — to know immediately.
It takes some strategy to achieve real-time analytics on streaming data with low latency, however, particularly if a company is transitioning from batch analytics or is adapting traditional databases to the task.
Strategies to increase analytics speed can include throwing more hardware at the issue — more storage and processing can speed up the query response time — but that is not the most accessible solution. The following four companies have taken unique approaches to achieving real-time analytics. They’ve optimized other parts of the process, from the data format to the indexing approach to the querying, and even reworked the software from the ground up.
Molecula Makes Data More Machine-Readable
Molecula, an operational AI company, and its CEO Higinio Maycotte’s previous company, Umbel, took the strategy of making data more machine-readable and optimizing bitmap indexing to better deal with high-cardinality data.
It all started with Umbel (now MVP Index), a customer data platform that sought to enable sports teams and entertainment companies to real-time segment audiences and serve extremely hyper-targeted ads to different customer segments on Facebook. The effort was to target ads based on Facebook likes, demographics and geographic information to achieve greater return on investment on Facebook ad spending. The Facebook likes were the biggest component, meaning data was coming in all the time.
“They got to a point where they were trying to essentially ingest customer data while allowing for real-time querying of data on the other end,” Laura Komkov, director of corporate marketing at Molecula, said. The need to keep the data flow running and allow for real-time analysis for ad placement basically broke their stack. Ingests would stall out or queries would. Everything couldn’t run at once.
Some Umbel engineers began toying with optimizing bitmap indexes to be able to deal with large volumes of data while maintaining low-latency querying. Bitmap indexing is a form of database indexing that uses a bitmap — a binary yes or no — for each key value used instead of a list of rowids as occurs in regular indexing. This binary nature gives bitmap indexing considerable compression capabilities, like query speed potential.
Bitmap indexes are not traditionally known for being able to deal with high cardinality of data. When it comes to data, the “more options that are available, the more difficult it gets to process with speed because a database is having to read either row by row or column by column to find if a specific point exists or not,” Komkov said.
But the team re-engineered the traditional bitmap index to handle a wider variety of use cases. This included applying different types of compression and coding optimizations on specific tasks. This effort spun out into the open-source Pilosa project, and Maycotte created Molecula, which currently holds several patents on this feature-oriented format, according to Komkov.
“Once they had created that, they found it was able to power this customer segmentation platform, and it could deal with all of the ingest volume while still allowing for the low latency querying,” said Komkov.
“The angle that we’ve taken is, essentially, let’s shift data to be in its most performant format possible,” Komkov said.
“We essentially one-hot encode data,” Erica Fowler, senior product marketing manager at Molecula, said.
“We take any kind of data — typically you’d have something like states, say Texas, California, Ohio and Washington — and it would be in that human-readable word format within the database itself,” she said. “We one-hot encode that, where essentially we’re taking all of the columns, splitting the columns out into values, and then asking a simple yes or no for whether our record has that feature.”
While the binarization of data for machine learning systems is standard, it’s usually the last step and not automated, according to Komkov. Molecula, however, does this process and compresses the data as the first step on ingest automatically. Putting the data into the machine-readable binary format makes real-time analytics possible.
“That’s where our supposition is really — change the format of the data and put it all into one binarized format,” Komkov said. “That makes it super performant.”
Rockset Indexes Everything With Converged Indexing
Molecula wasn’t the only company that turned to indexing optimization as a core part of its real-time analytics strategy. Rockset, a cloud-based real-time analytics platform, utilizes a converged indexing. Basically, instead of indexing some elements of a database as is standard, the idea behind a converged index is to index everything.
Venkataramani gave the analogy of a book. Books, like databases, contain a lot of information. Readers can use the index to identify where a specific piece of information is in the book quickly. Similarly, database indexes allow specific data to be found rapidly. If an index covers all data, that potential for query speed applies across the database.
Normally a database’s index will only cover part of the data because it’s not affordable to index all of it, according to Venkataramani. But converged indexing that covers all data actually brings the cost down due to what the extensive indexing does to the processing necessary to run queries — in specific circumstances.
The traditional approach many companies take to implementing a real-time analytics application is to start with either databases that were really built to be a system of record, or data lakes or data warehouses set up for batch analytics, Venkataramani said. They then take those databases to the cloud to allow for data streaming and real-time analytics. This situation requires a specific strategy to make real-time analytics cost effective.
In the cloud, you have to pay for storage and computing power, Venkataramani said. When it comes to running real-time analytics, however, the cost of computing power far exceeds the cost of storage. Meaning that anything that speeds up the process — and decreases the computing power needed — will reduce the costs involved.
“Rockset’s indexing approach might increase your storage costs a little bit, but it will save you your compute costs by orders of magnitude,” Venkataramani said.
Medisafe Saw Patterns to Optimize Querying
Unlike Rockset, which has always been a real-time analytics company, Medisafe and its digital drug companion app started life using batch analysis. After a user had begun a medication program in the app, Medisafe could begin to capture data and insights about that specific patient.
But the batch analysis approach gave a very limited view into individual patient behaviors, according to Rotem Shor, Medisafe’s chief technology officer. The app couldn’t offer patients as personalized an experience as needed. To really help users keep up with their medications, the app needed to be more responsive to individual behavior and needs, and to do that, insights on individual patient behavior needed to be real time.
“We saw that there were times in a patient’s journey where more timely interaction could lead to better outcomes ... This was the impetus for a shift to real-time analytics — the ability to view insights in real-time and provide guidance to patients at moments that matter most.”
While they were still operating via batch analysis, however, Medisafe saw some patterns across users.
“We saw that there were times in a patient’s journey where more timely interaction could lead to better outcomes,” Shor said. “This was the impetus for a shift to real-time analytics — the ability to view insights in real-time and provide guidance to patients at moments that matter most.”
While each patient’s journey is unique and different in the fine details, there are some overarching similarities. For example, patients taking the same medication or experiencing the same condition are going to have some structural similarities in their experiences and needs. These across-user patterns allowed for the creation of pre-determined patient journeys that could then be customized based on individual user interaction data.
While the process is proprietary, identifying these patterns across user experiences allowed Medisafe to optimize their queries, reducing latency of analysis and ultimately allowing for real-time analytics.
Redpanda Rebuilt Old Systems for Simplicity
While many companies like Medisafe have been successful at transitioning from traditional databases with batch processing to real-time analytics, Gallego described trying to force old systems into real time as a game of whack-a-mole. Those systems — both the hardware and the software — were just not designed for real-time processing, he said.
So Redpanda, a Kafka-compatible streaming platform, took the strategy of re-engineering the process — specifically the Apache Kafka API — from the ground up with a focus on modern hardware and simplicity.
Hardware is fundamental to the strategy. Software is designed for hardware, so when hardware changes, the software must, too.
“All of the streaming technologies were built for a decade-old hardware platform,” Gallego said, referencing Kafka, Pulsar and RabbitMQ. When he wrote the code for the first version of Redpanda, he wrote it with modern hardware in mind.
“Hardware itself is fundamentally different today than it was a decade ago,” he said, highlighting the physical differences in modern storage hardware, like solid-state drives compared to old spinning disks and the proliferation of CPUs in modern servers.
In addition to writing the program for this modern hardware, Gallego said that a goal was to make the system simple. To run Apache Kafka, you need five different systems, he said: Kafka brokers, Zookeeper, an HTTP proxy, a schema registry and a Kafka stream. Redpanda has instead embedded those five processes into one.
“We couldn’t be here if Kafka and PulseR and RabbitMQ hadn’t come before us, and we got to learn from the architecture and improve on some of the design shortcomings,” Gallego said. “By redesigning for the modern hardware, we can onboard the complexity onto the platform. And I think this is the right tradeoff … we should onboard the complexity so that you don’t have to.”
Moving From Batch to Real Time
The reason companies continue to do batch processing today rather than moving to real-time analytics is because it’s easy to stick with the old way of doing things, according to Gallego. But there’s a transition happening.
“The world is moving into real-time interactions,” he said. “We’re going through that pièce de résistance movement where you stop doing things nightly.”
Venkataramani said the move from batch processing to real-time processing is more realistic for many companies as the ability to generate and capture data streams is becoming more ubiquitous.
“People ask me, ‘Why now? Why hasn’t it happened like 10 years ago?’” he said. “Now is when businesses are actually accumulating their data in real time. And now is when they can even think about, ‘Well, what if I can also get analytics out of that?’