Can AI Teach Itself? Synthetic Data To Revolutionize AI
Synthetic data is on the rise, but not without its challenges. Learn about what generative AI needs to improve before synthetic data can go mainstream.

Back in January, we shared a piece about how synthetic data is on the rise. Six months later, it’s time for an update!
As a quick refresher, as AI continues to learn and grow, it will ultimately consume the maximum amount of data that humans can produce. To keep getting smarter, AI models are going to need more data. With growing demands for security and privacy, there is simultaneously greater demand for training and testing data. That’s where synthetic data comes in.
Synthetic data can be created by generative models that find underlying patterns in real-world data, then replicate it in similar scenarios. This new data can then be used to train a different model. In theory, this means that AI would be able to learn without the assistance of humans. But, as always, the situation is a bit more complex than that. Let’s dive into where synthetic data could shine, as well as where it might fall short.
In 2022, even before ChatGPT was released, Forbes outlined what synthetic data is and where it might be beneficially used. They stated that knowledge is power and value, but collecting quality data is hard and expensive, a fact that still holds true today.
Despite our thoughts of creating and utilizing synthetic data in the realm of AI, synthetic data was actually first widely used for autonomous vehicles, to generate data for rare cases that the vehicles might not encounter during their test runs, but still need to be trained for. While autonomous vehicles have kind of fallen out of the spotlight, there are tons of AI models that need to be trained for long-tail cases, so this use of synthetic data is still relevant today.
Synthetic data startups were already beginning to pop up in 2022, and a Gartner report claimed that 60 percent of all data used for AI development would be synthetic by 2024. The future of synthetic data was looking bright! But where are we really today?
In May, Forbes released another, briefer article on synthetic data, addressing its relationship to generative AI. Their report is pretty optimistic, stating that while generative AI struggles with privacy risks, hallucinations and non-determinism, synthetic data protects privacy, reduces hallucinations via more diverse input data, and allows for greater refining of input data, thus improving accuracy. They claim that where the data comes from isn’t important to the model, but rather the quality and diversity of the data are what matter.
This perspective has a lot of truth to it. One of the greatest draws of synthetic data is the potential to train models on replicas of sensitive data without the ethical risks of accidental disclosure or misuse of private information, a huge concern for fields such as medicine and government. Additionally, the ability to generate mass amounts of data almost instantaneously could dramatically reduce the resources required to train models, thus accelerating the development and deployment of AI innovations.
Synthetic data, which can be in language, media or tabular format, fights the data scarcity issue and simultaneously makes mass data available to smaller companies that don’t have access to the same data collection tools that giants such as Facebook and Google do. Synthetic data can be cost-effective, easily scalable and used to fill in data gaps, as well as demonstrate diverse and rare cases. It is possible that synthetic data might even be able to extrapolate from real-world data to detect potential future patterns in areas such as fraud or electrical grid outages.
One budding application of synthetic data is Tencent AI Lab's Persona Hub, a collection of 1 billion personas curated from web data that represent a variety of perspectives. These personas can be used to create diverse synthetic data for various scenarios.
Right now, Persona Hub is demonstrating its ability to generate specific scenarios, using the formula "create [data] with [persona]." For example, "create [a math problem] with [a delivery truck driver]." However, if we think about the potential here, synthetic personas could one day augment polling, another form of data collection that is resource intensive and riddled with gaps.
Of course, every new development must be looked at from all angles. As mentioned earlier, generative AI can tend to be inaccurate, make things up and replicate any biases in its training data. Since synthetic data is both generated by and then later used by AI models, it is possible that this process might amplify, rather than improve upon, these flaws.
According to The New York Times, OpenAI is investigating using two AI models to generate synthetic data, one to create it and one to judge it based on preset qualities. This tactic, however, brings us to synthetic data’s next potential drawback, which is the level of human involvement that will still be necessary.
In an ideal world, we would be so confident in the accuracy and fairness of generative AI that no human oversight would be necessary, and AI could continue to grow smarter just by training itself. This increase in productivity would free up a lot of resources, which could then be devoted to other areas. A cool thought, but unfortunately not yet our reality.
Throughout the process of generating and using synthetic data, some human oversight will still be necessary for the time being, even if just to verify the final output. While synthetic data does eliminate the need for manual labeling of data, even just manually verifying outputs drastically decreases the amount of synthetic data that can feasibly be used.
According to a 2024Â Gartner report, synthetic data is not yet being used to its fullest potential, and multiple different testing methods are necessary to ensure the effectiveness of the processes. The report also mentions that the generation and management of synthetic data might currently utilize more resources than we realize, potentially more than it is saving. Therefore, each company should determine its own ROI of using synthetic data. This perspective aligns with the previously mentioned drawback of a lack of complete trust in the AI systems generating the data.
In order for synthetic data to become a ubiquitous feature of software training and testing, a greater level of confidence in the systems producing the data is necessary. However, with the rate at which we have seen AI develop and improve over the past two years, that point in time may be soon. One of the biggest issues in software development right now is obtaining and managing data, and while synthetic data has been in the works for some time, maybe the exponential growth of generative AI will be its ticket to the mainstream. It is likely that in the near future, the data market will expand to include synthetic data, and data management procedures will be shaken up completely.
Synthetic data holds immense potential to revolutionize AI by addressing data scarcity issues, reducing ethical concerns and accelerating innovation. However, achieving widespread adoption requires overcoming current challenges, such as ensuring data quality, mitigating biases and maintaining human oversight. As we continue to monitor the evolution of synthetic data, we remain optimistic that these hurdles will be addressed, paving the way for synthetic data to become an integral part of AI development.
And just a final note: the sudden availability of generative AI made many concerned about a potential future where AI is able to outsmart humans, and AI systems that train on their own without human assistance may exacerbate that fear. The question of whether generative AI can come up with truly novel ideas without any human-like real-world experiences will be something else to watch for as this journey continues.
Best,
Nina for the Don’t Count Us Out Yet Team