Synthetic data is information that has been created algorithmically or via computer simulations. It’s essentially a product of generative AI, consisting of content that has been artificially manufactured as opposed to gathered in real life.
“At its highest level, synthetic data is just data that hasn’t been collected by a sensor in the real world,” Lina Colucci, a co-founder of synthetic data-as-a-service startup Infinity AI, told Built In. “Then you use that to train your AI models.”
What Is Synthetic Data?
Synthetic data is created algorithmically or via computer simulations, and is used as training data to develop more robust AI models. It has the same mathematical patterns and properties of their real-world counterparts, but doesn’t contain any of the original information and is not a product of actual events.
Synthetic data is usually generated in order to increase the size of an AI model’s training data, introduce new scenarios or edge cases that have not been covered with existing data and save analysts time from having to mine the internet and elsewhere for data. Synthetic data is then poised to play a bigger role in driving more informed research and innovation.
Why Is Synthetic Data Important?
While it is technically fabricated, synthetic data reflects its real-world counterpart both mathematically and statistically. Research has shown that it can be just as effective or even better at training machine learning models than actual objects, events or people.
Like real data, synthetic data comes in many shapes and sizes. It can be generated text in natural language processing applications, manufactured tabular data for classification and regression analyses, or generated media like videos and images for computer vision applications. Limited access to real-world data, data privacy concerns, and the time or financial burden of data collection and annotation all make synthetic data an attractive resource when building and training AI models.
As a result, synthetic data has become a popular resource for a variety of industries — including automotive, healthcare, banking and more — providing anonymized, pliable, high-quality information on which organizations can train their AI models.
“The most innovative companies, the most forward-thinking companies, are starting to work with synthetic data,” Tobias Hann, the CEO of synthetic data company Mostly AI, told Built In. “It’s a massive opportunity, it’s a growing space, and we’re still in the early days.”
How Does Synthetic Data Work?
Synthetic data is generated by AI trained on real-world data samples. The algorithm first learns the patterns, correlations and statistical properties of the sample data, and then it creates statistically identical synthetic data. This new data looks, feels and means the same as the real-world data it was based on.
And like real-world data, synthetic data can be organized into two broad categories — structured and unstructured. When data is structured, that means it is quantitative. It is usually organized in spreadsheets, CSV files or data warehouses, and can immediately be searched and analyzed in the state that it is in. Unstructured data is more qualitative, and typically comes in the form of images or videos. It is often stored in data lakes and cannot be analyzed or searched in its native form without some processing and transformation.
How synthetic data is generated largely depends on its structure. Creating a spreadsheet of fabricated numbers is very different from creating a 3D image. And both have distinct use cases.
Synthetic Structured Data
Synthetic structured data is made by taking a relational database of real data, creating a generative machine learning model for it, and then generating a second set of data that mathematically or statistically mirrors the original. The synthetic data set contains the general patterns and properties of its real-world counterpart, without any of the actual information.
Let’s say you have a spreadsheet, with thousands of data points across hundreds of rows and columns. If you want to make a synthetic version of that data, you have to first “understand the relationship” between those columns and rows, as well as all the different data points in the data set, according to Adam Kamor, the head of engineering at Tonic.ai.
“From that understanding, you can then generate synthetic rows that kind of match that understanding,” he told Built In.
Every row of synthetic data generated is not intended to be tied back to any individual row of real data. “I shouldn’t be able to say ‘OK, this is clearly that row but you just changed that value,’” Kamor said. “As a generative process, you’re not looking at your individual rows anymore. You’re instead looking at the aggregate model. The model — that’s your understanding of how it all ties together.”
The main appeal of synthetic structured data is that it offers an easy way to train AI models with as much data as a user needs at scale, without compromising the integrity or security of private or sensitive information. Many of the companies using synthetic structured data are in highly regulated spaces, including healthcare, finance and education — industries that frequently have confidential information that cannot be shared publicly.
Unstructured Synthetic Data
Unstructured synthetic data usually comes in the form of 3D models, which are generated using CGI tools and can be customized with physics-based parameters, such as lighting and camera angle. They can also feature avatars with specific body proportions, clothing and skin tones to represent humans. This kind of data is ideal in applications of computer vision, such as self-driving cars, medical imaging and warehouse safety.
The worlds and situations depicted by these models are created through a combination of 3D scans of real people and objects, plus 3D assets modeled by specialized artists, all of which can be expanded upon algorithmically using deep learning methods. Similar to AI art generators, synthetic unstructured data generators take some samples of real-world data and then use generative techniques to “augment” them, as Colucci put it, turning one real data sample into hundreds or even thousands of synthetic ones, which are then automatically annotated.
Among the potentially millions of annotated synthetic images produced, some may not be realistic-looking enough to train a model. This is where generative adversarial networks (GANs) come into play. GANs are a deep learning framework composed of two neural networks: a generator and a discriminator. The generator produces fake images, while the discriminator works to distinguish between real images provided by a training set and those created by the generator. Over time, the generator makes more realistic images and the discriminator gets better at detecting them, leading to more (and better) synthetic images.
For example, Datagen, a company that specializes in synthetic unstructured data, claims to have generated more than 100,000 unique 3D humans, as well as various assets, environments and motion sequences. The company’s customers can then integrate these assets to generate the exact data that they need, whether that be up-close faces or humans in action.
The result is “high quality,” “high fidelity” visual outputs that are rich in data, Gil Elbaz, Datagen’s co-founder and CTO, told Built In. “It really includes a lot of information that you would never be able to manually annotate — it would be too much work or too expensive.”
Synthetic Data vs. Real Data
Synthetic data is generated by algorithms that create it based on real data. From there, synthetic data can be used to train AI and machine learning models, even simulating unprecedented situations to create a wider variety of training experiences. Real data only covers events that have already happened, offering a more limited perspective compared to synthetic data.
Synthetic data also enables teams to build large volumes of data in a short amount of time, with the ability to create more data whenever needed. Depending on the industry or topic, real data may be hard to find and collect at times. This makes it difficult for analysts to carry out research studies since they may not have enough data to verify their results.
At the same time, synthetic data may miss some of the outlier cases that occur in real data sets. It may help to combine a small amount of real data with synthetic data to strengthen the accuracy and reliability of a study.
Synthetic Data in Machine Learning
The concept of synthetic data is largely credited to Donald B. Rubin, a Harvard statistics professor, who discussed it in a 1993 paper as a way to preserve data privacy in statistical analyses. In the years since, major companies including Tesla and Microsoft have pushed the boundaries of what was previously thought possible with synthetic data and machine learning, prompting a new wave of industry adoption.
“It’s become best practice,” Colucci said. And there is big demand for it, particularly in the realm of AI and machine learning development, which requires immense amounts of data.
“You need lots of data to train models. And sometimes it’s just not feasible to collect that data in the real world,” Hann said. For instance, autonomous vehicles need to be trained on lots of visual data of different driving conditions and scenarios, all of which can be quite costly and time-consuming to gather in the real world.
Instead, companies can train the models on 3D renderings of those conditions and scenarios. “It helps to speed up the development of machine learning algorithms, because it allows you to create data that you otherwise didn’t have,” Hann added.
As synthetic data has evolved over the years, it has become an increasingly integral part of not just image-dependent spaces like self-driving cars and other computer vision applications, but also those that are dependent on more ordered, tabular data like banking and sales.
Synthetic Data Examples
Synthetic data’s capacity to offer data sets where existing ones are either too sensitive or small to be useful has attracted the attention of several industries. Here are some ways it’s being used.
How Is Synthetic Data Used?
- It’s used to train autonomous vehicles for various scenarios.
- It’s used in banking without compromising client information.
- It’s used in smart retail store operations.
- It’s used in medical imaging.
- It’s used in machine vision.
Training Autonomous Vehicles
Three-dimensional simulations of driving footage can be an effective way to train autonomous cars, trucks and robots. They can use procedurally generated streets full of cars, animals, pedestrians and more to better prepare for both common and rare scenarios while driving. Synthetic data can also be leveraged to improve safety within the vehicles themselves, providing training data to detect driver fatigue, distraction and other passenger behaviors.
Banking and Client Information
Companies in the banking and finance industry gather a lot of data on their customers, and much of it is personal identifiable information, or PII. Synthetic data makes it possible for them to still use their proprietary data in model training without compromising on customers’ confidentiality. This can come in handy for everything from mortgage analytics to fraud detection.
Smart Retail Store Operations
In smart retail stores like Amazon Go and Nike Speed Shop, almost every experience is facilitated by an array of complex computer vision modeling. To successfully allow for conveniences like self-checkout and automatic payments, these models have to be sophisticated enough to handle all manner of image challenges, including object recognition and pose estimation. Synthetic data can cover even the rarest cases, ensuring that a store’s AI models operate at their best.
Health data sets can contain very personal information, and are often small if it pertains to things like rare diseases. Plus, medical imaging data from real life is hard to come by given the high cost of obtaining clinical annotations. But methods like GANs and variational autoencoders, another deep learning architecture commonly associated with synthetic data, can be a useful and affordable means of both protecting patient confidentiality and beefing up training data with more examples.
In warehouses, synthetic data can be used in what is called machine vision, which helps industrial robots see and recognize their surroundings so they can perform more complex tasks. Whether it’s identifying packages of varying shapes and sizes or monitoring for spills and other unsafe conditions, synthetic data ensures that AI models in the manufacturing and logistics space are ready for anything.
Synthetic Data Advantages
Because of its versatility, there are many advantages to making and using synthetic data.
Synthetic Data Is Flexible and Able to Account for Rare Scenarios
Synthetic data is extremely flexible. “Because it is artificially made up data, it can be modified. It can be shaped and formed in specific ways that are most relevant to your use cases,” Hann, of Mostly AI, said. “You can create exactly what you actually want.”
Because of this, synthetic data is a really handy tool for testing edge cases, or unique situations that are rarely captured by real data. It can be difficult to account for every possible use case with real data, which means that, when they happen, they can really trip an AI model that hasn’t been trained to handle it.
“You can, with code, specify rare examples — these long-tail failure cases. And then you can put that back into your training data, train your models, and fix those failures,” Infinity AI’s Colucci said. “In the real world, that could take six months to a year to find enough real-world examples of these rare edge cases.”
Synthetic Data Is Efficient
The creation of synthetic data is a lot less tedious and time-consuming than the regular collection and annotation of real-world data. In computer vision, for example, image data needs to be annotated, or labeled, with different metadata information. This includes things like depth maps, segmentation masks and bounding boxes — the squares or rectangles that surround identifiable objects like cars or animals.
With real data, that annotation work can take months and is often contracted to third-party labeling services where humans manually go through every single image frame and identify the things within them. “The whole process for collecting data, labeling it and curating [data] is just a massive operational overhead,” Colucci said. “It’s literally the biggest blocker to progress in machine learning.”
With synthetic data, annotations are self-generated as the data is made. In other words: Potentially, hundreds of generated videos with exact specification requirements can be made in a matter of minutes, all labeled and ready to go.
“All of a sudden, you take data collection and data labeling — these two very manually intensive processes — and you replace them with code,” she continued. “It’s a huge timesaver.”
Synthetic Data Is Privacy-Friendly
While it resembles real data, synthetic data ideally does not contain any traceable information about the actual data itself. It contains no personal or identifiable information about real people, meaning it can be used, transferred and manipulated within the bounds of existing laws and regulations in ways that are not possible with real data.
And as AI regulation continues to tighten up across the United States, the European Union and the rest of the world, it is likely going to play an even more “significant” role, according to Tonic’s Kamor.
“It’s going to get harder and harder to use real data. And there will need to be frameworks put in place for using synthetic data,” Kamor said. “There’s going to have to be definitions and standards for how you synthesize data so that you meet some privacy threshold. And people will use that for most day-to-day work.”
Synthetic Data Is Key for Diverse Training Data
Another big selling point of synthetic data is that it stands to rectify the long-standing diversity problem in AI training data sets. After all, AI systems are riddled with defects, from iPhones failing to discern between different Asian faces to AI crime prevention tools sending innocent people to jail.
These errors aren’t uncommon. Rather, they are a consequence of the homogenous data so many AI systems are trained on, which mostly skews white and male — making these tools imprecise or even useless for anyone who isn’t taken into account in the data. Synthetic data offers a convenient solution by allowing teams to generate image data that is more diverse.
“[Synthetic data has] the ability to generate diverse and balanced data sets with precise control over specific parameters. This can lead to more robust and accurate computer vision models, especially when used in conjunction with real-world data,” Elbaz, of Datagen, said.
Synthetic Data Disadvantages
While synthetic data offers many upsides for data science professionals, this method is far from perfect and creates another set of challenges.
Synthetic Data Is Prone to Biases
Synthetic image generators require large data sets of tens of thousands of real images to train themselves to create photorealistic outputs. Those initial inputs serve as a foundation on which to build. If that foundation is incomplete or biased, the images generated could be just as ineffective as an AI system trained on biased real data.
Or worse, they could reinforce the biases they’re trying to do away with. A study at Arizona State University found that GANs tasked with generating faces of engineering professors both lightened the “skin color of non-white faces” and transformed “female facial features to be masculine.”
As the popular saying in AI goes: “garbage in, garbage out.” That goes for synthetic data as well.
Synthetic Data Is Not Always Accurate
Synthetic data is by no means a silver bullet to the challenges and inefficiencies of AI development. It’s not unheard of that the outputs of models using synthetic data will differ from those using real-world data — which is usually an indicator that the former is lacking in some way.
“It can be hard to know ahead of time, and it can require iteration. It’s challenging. And it gets more challenging the more complex the data set is,” Kamor said.
In fact, the inconsistencies when trying to replicate the complexity found within an original data set, and the inability to replicate authentic data with synthetic, is perhaps one of this technology’s biggest drawbacks. “It’s very, very hard to develop this technology. It requires an enormous amount of effort,” Elbaz said.
Synthetic Data Can Be Time-Consuming and Expensive
When errors or inconsistent results arise in synthetic data, these can make the synthetic data development process just as tedious as analyzing real data. Teams may need to spend time reviewing synthetic data and the real data sets on which that synthetic data was designed if the results between the two sets differ. And if a synthetic data set is then inherently flawed, teams may need to start over and find more reliable real data to work with.
Mistakes can pile up, leading to companies wasting even more time and resources trying to create and refine synthetic data. This produces even more problems.
“If you don’t have an easy to implement solution for synthetic data, it becomes very expensive and costly to develop synthetic data yourself,” Elbaz said. “When people don’t have the resources needed to create high quality synthetic data, they hack together things that are half-ass or not fully fleshed out. And then there are different issues that arise with that.”
Synthetic Data Is Not Entirely Private
Privacy isn’t always a given when it comes to synthetic data, either.
“The biggest misconception is that synthetic equals private,” Kamor said. “Depending on the algorithm that you use for generating this synthetic data, your synthetic result set could be super private. Or it could be not private at all.”
For example, if a user has 100 rows of real data but they want 1,000 more rows of synthetic data to beef up the data set, an algorithm could just regenerate those same 100 rows in their original states 10 times. “That’s a synthetic data algorithm in a way, but it’s super not private. So the privacy guarantees you get are going to really be a function of your algorithm and how you’re doing these things,” Kamor continued.
Will Synthetic Data Ever Replace Real Data?
Synthetic data will likely never replace real-world data completely, although it will almost certainly become more of a fixture in AI and machine learning development in the coming years.
“The future of synthetic data is incredibly promising. As the technology becomes more sophisticated, [I] expect to see a wider adoption across industries and applications,” Elbaz said, adding that it has the potential to “democratize AI development” going forward by making “high-quality training data more accessible and affordable for everyone.”
That being said, not all kinds of data can be synthesized. And the ones that can still need to be validated and controlled to ensure outputs are suited for the real world, not just the synthetic one in which the generated data exists.
“For many, many instances, it will just make sense to work with synthetic data in the future.”
Of course, this requires the implementation of real data. “You obviously want your models to perform well in the real world,” Colucci said. Looking ahead, she believes every machine learning model should be trained on 90 percent synthetic data, 10 percent real. “I think that’s going to happen very soon.”
In fact, research and consulting firm Gartner predicts that, by 2024, 60 percent of the data used for the development of AI and analytics projects will be synthetically generated.
“Will it be all the data? Probably not,” Hann said. “But, for many, many instances, it will just make sense to work with synthetic data in the future.”
Frequently Asked Questions
What is synthetic data vs. real data?
While real data is compiled from external environments and systems, synthetic data is generated by algorithms. Synthetic data can be molded to fit different situations and can be used to train AI models and produce more data, making it a more versatile alternative to real data.
What is a real-life example of synthetic data?
Synthetic data can be used to train autonomous vehicles to adapt to a variety of driving conditions. It can also generate rare or unforeseen scenarios to better prepare self-driving cars for reacting to more extreme circumstances and unique scenarios.
How is synthetic data made?
Synthetic data is generated by computer algorithms, which study real data models to build synthetic ones. As a result, synthetic data is not an exact replica of real data, but closely mimics the behavior of the original data. Synthetic data can then be used to train AI and machine learning models and create more synthetic data.