Synthetic data is information that has been created algorithmically or via computer simulations. It’s essentially a product of generative AI, consisting of content that has been artificially manufactured as opposed to gathered in real life. And it is usually generated in order to increase the size of an AI model’s training data, or introduce new scenarios or edge cases that have not been covered with existing data.
“At its highest level, synthetic data is just data that hasn’t been collected by a sensor in the real world,” Lina Colucci, a co-founder of synthetic data-as-a-service startup Infinity AI, told Built In. “Then you use that to train your AI models.”
What Is Synthetic Data?
Synthetic data is created algorithmically or via computer simulations, and is used as training data to develop more robust AI models. It has the same mathematical patterns and properties of their real-world counterparts, but doesn’t contain any of the original information and is not a product of actual events.
In many ways, data is the fuel that powers artificial intelligence, teaching computers how to recognize and automatically respond to objects, actions and commands. Some mine the internet and elsewhere for their data, others artificially manufacture their own — and it’s shown to be an efficient and relatively inexpensive alternative to the real stuff. While it is technically fabricated, synthetic data reflects its real-world counterpart both mathematically and statistically. Recent research has shown that it can be just as effective or even better at training machine learning models than actual objects, events or people.
Like real data, synthetic data comes in many shapes and sizes. It can be generated text in natural language processing applications, manufactured tabular data for classification and regression analyses, or generated media like videos and images for computer vision applications. Limited access to real-world data, data privacy concerns, and the time or financial burden of data collection and annotation all make synthetic data an attractive resource when building and training AI models.
Today, synthetic data is a popular resource for a variety of industries — including automotive, healthcare, banking and more — providing anonymized, pliable, high-quality information on which they can train their AI models.
“The most innovative companies, the most forward-thinking companies, are starting to work with synthetic data,” Tobias Hann, the CEO of synthetic data company Mostly AI, told Built In. “It’s a massive opportunity, it’s a growing space, and we’re still in the early days.”
How Does Synthetic Data Work?
In general, synthetic data is generated by AI trained on real-world data samples. The algorithm first learns the patterns, correlations and statistical properties of the sample data, and then it creates statistically identical synthetic data. This new data looks, feels and means the same as the real-world data it was based on.
And like real-world data, synthetic data can be organized into two broad categories — structured and unstructured — depending on the data values, the operations that can be applied to the data, and the relationships between each piece of data. When data is structured, that means it is quantitative. It is usually organized in spreadsheets, CSV files or data warehouses, and can immediately be searched and analyzed in the state that it is in. Unstructured data is more qualitative, and typically comes in the form of images or videos. It is often stored in data lakes, and it cannot be analyzed or searched in its native form without some processing and transformation.
Exactly how synthetic data is generated largely depends on its structure. Creating a spreadsheet of fabricated numbers is very different from creating a 3D image. And both have distinct use cases.
Synthetic Structured Data
Synthetic structured data is made by taking a relational database of real data, creating a generative machine learning model for it, and then generating a second set of data that mathematically or statistically mirrors the original. The synthetic data set contains the general patterns and properties of its real-world counterpart, without any of the actual information.
So, let’s say you have a spreadsheet, with thousands of data points across hundreds of rows and columns. If you want to make a synthetic version of that data, you have to first “understand the relationship” between those columns and rows, as well as all the different data points in the data set, according to Adam Kamor, the head of engineering at Tonic.ai, a company that specializes in generating synthetic structured data.
“From that understanding, you can then generate synthetic rows that kind of match that understanding,” he told Built In.
Every row of synthetic data generated is not intended to be tied back to any individual row of real data. “I shouldn’t be able to say ‘OK, this is clearly that row but you just changed that value,’” Kamor said. “As a generative process, you’re not looking at your individual rows anymore. You’re instead looking at the aggregate model. The model — that’s your understanding of how it all ties together.”
The main appeal of synthetic structured data is that it offers an easy way to train AI models with as much data as a user needs at scale, without compromising the integrity or security of private or sensitive information. Many of the companies using synthetic structured data are in highly regulated spaces, including healthcare, finance and education — industries that frequently have confidential information that cannot be shared publicly.
Unstructured Synthetic Data
Unstructured synthetic data usually comes in the form of 3D models, like the ones seen in video games, flight simulations and animated movies, which can then be used to train models. The world is generated using CGI tools and can be customized with physics-based parameters, such as lighting and camera angle. They can also feature avatars with specific body proportions, clothing and skin tones to represent humans. This kind of data is ideal in applications of computer vision, such as self-driving cars, medical imaging and warehouse safety.
The worlds and situations depicted by these models are created through a combination of 3D scans of real people and objects, plus 3D assets modeled by specialized artists, all of which can be expanded upon algorithmically using deep learning methods. Similar to AI art generators like Stable Diffusion and DALL-E, synthetic unstructured data generators take some samples of real-world data and then use generative techniques to “augment” them, as Colucci put it, turning one real data sample into hundreds or even thousands of synthetic ones, which are then automatically annotated.
At this point, potentially millions of annotated synthetic images have been produced, some of which may not be realistic-looking enough to train a model. This is where GANs, or generative adversarial networks, come into play. GANs is a deep learning framework composed of two battling neural networks: a generator and a discriminator. The generator produces fake images, while the discriminator works to distinguish between real images provided by a training set and those created by the generator. Overtime, this adversarial network improves, meaning the generator makes more realistic images and the discriminator gets better at detecting them, thus leading to more (and better) synthetic images.
For example, Datagen, a company that specializes in synthetic unstructured data, claims to have generated more than 100,000 unique 3D humans, as well as various assets, environments and motion sequences. The company’s customers can then integrate these assets to generate the exact data that they need, whether that be up-close faces or humans in action.
The result is “high quality,” “high fidelity” visual outputs that are rich in data, Gil Elbaz, Datagen’s co-founder and CTO, told Built In. “It really includes a lot of information that you would never be able to manually annotate — it would be too much work or too expensive.”
Synthetic Data In Machine Learning
Synthetic data has been around, in one form or another, for decades. Its concept is largely credited to Donald B. Rubin, a Harvard statistics professor, who discussed it in a 1993 paper as a way to preserve data privacy in statistical analyses. In the years since then, major companies including Tesla and Microsoft have pushed the boundaries of what was previously thought possible with synthetic data and machine learning, prompting a new wave of industry adoption.
“It’s become best practice,” Colucci said. And there is big demand for it, particularly in the realm of AI and machine learning development, which requires immense amounts of data.
“You need lots of data to train models. And sometimes it’s just not feasible to collect that data in the real world,” Hann said. For instance, autonomous vehicles need to be trained on lots of visual data of different driving conditions and scenarios, all of which can be quite costly and time consuming to gather in the real world. Instead, companies can train the models on 3D renderings of those conditions and scenarios. “It helps to speed up the development of machine learning algorithms, because it allows you to create data that you otherwise didn’t have,” he added.
As synthetic data has evolved over the years, it has become an increasingly integral part of not just image-dependent spaces like self-driving cars and other computer vision applications, but also those that are dependent on more ordered, tabular data like banking and sales.
Synthetic Data Examples
Copious amounts of high-quality data remains a prerequisite when it comes to machine learning. But access to real data that meets these standards is often limited or inaccessible due to privacy concerns, or time and financial constraints. Synthetic data’s capacity to offer data sets where existing ones are either too sensitive or small to be useful has attracted the attention of several industries. Here are some ways it’s being used.
How Is Synthetic Data Used?
- It’s used to train autonomous vehicles for various scenarios.
- It’s used in banking without compromising client information.
- It’s used in smart retail store operations.
- It’s used in medical imaging.
- It’s used in machine vision.
Training Autonomous Vehicles
Three-dimensional simulations of driving footage can be an effective way to train autonomous cars, trucks and robots. They can use procedurally generated streets full of cars, animals, pedestrians and more to better prepare for both common and rare scenarios while driving. Synthetic data can also be leveraged to improve safety within the vehicles themselves, providing training data to detect driver fatigue, distraction and other passenger behaviors.
Banking and Client Information
Companies in the banking and finance industry gather a lot of data on their customers, and much of it is personal identifiable information, or PII. Synthetic data makes it possible for them to still use their proprietary data in model training without compromising on customers’ confidentiality. This can come in handy for everything from mortgage analytics to fraud detection.
Smart Retail Store Operations
In smart retail stores like Amazon Go and Nike Speed Shop, almost every experience is facilitated by an array of complex computer vision modeling. To successfully allow for conveniences like self-checkout and automatic payments, these models have to be sophisticated enough to handle all manner of image challenges, including object recognition and pose estimation. And yet, when you think about an in-person shopping environment — different people moving in all different directions, items that look similar to each other, varying light and shadows — it can be hard to account for all the variables in the training data. Synthetic data can be used to cover even the most rare cases, ensuring that a store’s AI models are operating at their best.
Health data sets can contain very personal information, and are often small if it pertains to things like rare diseases. Plus, medical imaging data from real life is hard to come by given the high cost of obtaining clinical annotations. But methods like GANs and variational autoencoders, another deep learning architecture commonly associated with synthetic data, can be a useful and affordable means of both protecting patient confidentiality and beefing up training data with more examples.
In warehouses, synthetic data can be used in what is called machine vision, which helps industrial robots see and recognize their surroundings so they can perform more complex tasks. Whether it’s identifying packages of varying shapes and sizes, monitoring for spills and other unsafe conditions, synthetic data ensures that AI models in the manufacturing and logistics space are ready for anything.
Synthetic Data Advantages and Disadvantages
There are many advantages to making and using synthetic data. For one, it is extremely flexible. “Because it is artificially made up data, it can be modified. It can be shaped and formed in specific ways that are most relevant to your use cases,” Hann, of Mostly AI, said. “You can create exactly what you actually want.”
Because of this, synthetic data is a really handy tool for testing edge cases, or unique situations that are rarely captured by real data. It can be difficult to account for every possible use case with real data, which means that, when they happen, they can really trip an AI model that hasn’t been trained to handle it.
Synthetic Data Advantages
- It can be easily modified according to a specific need or use case.
- It is a quick and easy way to create data for testing edge cases, or unique situations that are rarely captured by real-world data.
- It automates the data annotation process, making the process a lot quicker.
- It contains no identifiable information about real people, making it ideal for maintaining data privacy.
- It can be a quick and convenient way to make training data sets more diverse.
“You can, with code, specify rare examples — these long-tail failure cases. And then you can put that back into your training data, train your models, and fix those failures,” Infinity AI’s Colucci said. “In the real world, that could take six months to a year to find enough real-world examples of these rare edge cases.”
But, with synthetic data, it can take as little as a few minutes, depending on the project. This is because the creation of synthetic data is a lot less tedious and time consuming than the regular collection and annotation of real-world data. In computer vision, for example, image data needs to be annotated, or labeled, with different metadata information. This includes things like depth maps, segmentation masks and bounding boxes — the squares or rectangles that surround identifiable objects like cars or animals.
With real data, that annotation work can take months and is often contracted to third-party labeling services where humans manually go through every single image frame and identify the things within them. “The whole process for collecting data, labeling it and curating [data] is just a massive operational overhead,” Colucci said. “It’s literally the biggest blocker to progress in machine learning.”
With synthetic data, annotations are self-generated as the data is made. In other words: Potentially, hundreds of generated videos with exact specification requirements can be made in a matter of minutes, all labeled and ready to go.
“All of a sudden, you take data collection and data labeling — these two very manually intensive processes — and you replace them with code,” she continued. “It’s a huge timesaver.”
Maintaining Data Privacy
Privacy remains the key advantage of synthetic data. While it resembles real data, synthetic data ideally does not contain any traceable information about the actual data itself. It contains no personal or identifiable information about real people, meaning it can be used, transferred and manipulated within the bounds of existing laws and regulations in ways that are not possible with real data.
And as AI regulation continues to tighten up across the United States, the European Union and the rest of the world, it is likely going to play an even more “significant” role, according to Tonic’s Kamor.
“It’s going to get harder and harder to use real data. And there will need to be frameworks put in place for using synthetic data,” he said. “There’s going to have to be definitions and standards for how you synthesize data so that you meet some privacy threshold. And people will use that for most day-to-day work.”
“The biggest misconception is that synthetic equals private.”
Privacy isn’t always a given when it comes to synthetic data, though. “The biggest misconception is that synthetic equals private,” Kamor said. “Depending on the algorithm that you use for generating this synthetic data, your synthetic result set could be super private. Or it could be not private at all.”
For example, if a user has 100 rows of real data but they want 1,000 more rows of synthetic data to beef up the data set, an algorithm could just regenerate those same 100 rows in their original states 10 times. “That’s a synthetic data algorithm in a way, but it’s super not private. So the privacy guarantees you get are going to really be a function of your algorithm and how you’re doing these things,” he continued.
Making Data Sets More Diverse
Another big selling point of synthetic data is that it stands to rectify the long-standing diversity problem in AI training data sets. After all, AI systems (particularly those related to images) are riddled with defects. For example, iPhones have been known to unlock for the wrong person because they can’t discern between different Asian faces, and crime prevention tools have sent innocent people to jail. Even Google’s image-labeling tool has gotten in hot water for mistaking darker-skinned people for gorillas.
These errors aren’t uncommon. Rather, they are an unfortunate, yet inevitable, consequence of the homogenous data so many AI systems are trained on, which mostly skews white and male — making these tools imprecise or even useless for anyone who isn’t taken into account in the data. Synthetic data offers a convenient solution by allowing teams to generate image data that is more diverse.
“[Synthetic data has] the ability to generate diverse and balanced data sets with precise control over specific parameters. This can lead to more robust and accurate computer vision models, especially when used in conjunction with real-world data,” Elbaz, of Datagen, said.
Synthetic Data Won’t Necessarily Fix the AI Bias Problem
However, this method is far from perfect. Synthetic image generators require large data sets of tens of thousands of real images to train themselves to create photorealistic outputs. Those initial inputs serve as a foundation on which to build. If that foundation is incomplete or biased, the images generated could be just as ineffective as an AI system trained on biased real data.
Synthetic Data Disadvantages
- Synthetic data can still lead to biased outputs.
- The quality of synthetic data is not always reliable.
- Privacy is not always guaranteed.
- It requires specific expertise, as well as time and effort, to do it right.
- Synthetic data can only resemble real world data, not replicate it completely.
Or worse, they could reinforce the biases they’re trying to do away with. One 2020 study found that some “racial transformations” have been shown to create outputs that are unsettlingly evocative of blackface or yellowface. A different study at Arizona State University found that GANs tasked with generating faces of engineering professors both lightened the “skin color of non-white faces” and transformed “female facial features to be masculine.”
As the popular saying in AI goes: “garbage in, garbage out.” That goes for synthetic data as well.
‘It’s Very, Very Hard to Develop This Technology’
Indeed, synthetic data is by no means a silver bullet to the challenges and inefficiencies of AI development. It’s not unheard of that the outputs of models using synthetic data will differ from those using real-world data — which is usually an indicator that the former is lacking in some way.
“It can be hard to know ahead of time, and it can require iteration. It’s challenging. And it gets more challenging the more complex the data set is,” Kamor said.
In fact, the inconsistencies when trying to replicate the complexity found within an original data set, and the inability to replicate authentic data with synthetic, is perhaps one of this technology’s biggest drawbacks. “It’s very, very hard to develop this technology. It requires an enormous amount of effort,” Elbaz said.
“If you don’t have an easy to implement solution for synthetic data, it becomes very expensive and costly to develop synthetic data yourself,” he continued. “When people don’t have the resources needed to create high quality synthetic data, they hack together things that are half-ass or not fully fleshed out. And then there are different issues that arise with that.”
Will Synthetic Data Ever Replace Real Data?
It is for these reasons that synthetic data will likely never replace real-world data completely, although it will almost certainly become more of a fixture in AI and machine learning development in the coming years.
“The future of synthetic data is incredibly promising. As the technology becomes more sophisticated, [I] expect to see a wider adoption across industries and applications,” Elbaz said, adding that it has the potential to “democratize AI development” going forward by making “high-quality training data more accessible and affordable for everyone.”
That being said, not all kinds of data can be synthesized. And the ones that can still need to be validated and controlled to ensure outputs are suited for the real world, not just the synthetic one in which the generated data exists.
“For many, many instances, it will just make sense to work with synthetic data in the future.”
Of course, this requires the implementation of real data. “You obviously want your models to perform well in the real world,” Colucci said. Looking ahead, she believes every machine learning model should be trained on 90 percent synthetic data, 10 percent real. “I think that’s going to happen very soon.”
In fact, research and consulting firm Gartner predicts that, by 2024, 60 percent of the data used for the development of AI and analytics projects will be synthetically generated.
“Will it be all the data? Probably not,” Hann said. “But, for many, many instances, it will just make sense to work with synthetic data in the future.”