Large language models (LLMs) depend on a steady stream of good data to learn and improve. And by many accounts, they are about to hit a wall. In part, because they've already exhausted the limits of available good data on the internet (operative word: good). OpenAI is entering into deals with publishers, such as today's deal with The Financial Times, for precisely this reason.
But such deals can be costly. Speaking to the Financial Times before today's announcement, Aidan Gomez, CEO of the two-billion-dollar LLM developer Cohere, explained that data created by humans is getting expensive.
“Human-created data,” he said, “"is extremely expensive." He added: "If you could get all the data that you needed off the web, that would be fantastic … In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need."
Defining Synthetic Data
One solution to the messy, expensive, human-generated data quandary is synthetic data. While synthetic data isn't a new phenomenon — it has roots in statistical modeling and simulation techniques dating back to the late 19th and early 20th Centuries — recent developments have pushed it to the fore.
With the rise of artificial intelligence and machine learning in the latter half of the 20th Century, the need for large and diverse datasets became apparent. However, acquiring real-world data for training and testing machine learning models has become increasingly difficult as privacy concerns, data scarcity and data quality issues become obvious.
So, what is it exactly? AWS defines synthetic data as non-human-created data that mimics real-world data. Created using computing algorithms and simulations based on generative artificial intelligence technologies, it contains the same mathematical properties as the actual data it is based on, but it does not contain any of the same information.
While Amazon, like other vendors in the field, has not indicated where or with what products it uses synthetic data, it is reasonable to assume it has been exploring its application in the development of model generation techniques.
The AWS definition listed the many benefits of synthetic data, including:
- Data generation — Synthetic data can be created relatively cheaply and at an unlimited scale, which can be added to existing data lakes for better insights and outcomes, yielding more training data for analysis.
- Privacy protection — In highly regulated industries like finance, health or the legal sector, synthetic data can be generated for research and analytics without using the actual data that these inquiries were built on in the first place.
- Bias reduction — Organizations can use it to reduce bias in AI training models. Many AI models are built on human-generated data with synthetic data added to fill the holes.
Related Article: Can We Trust Tech Companies to Regulate Generative AI?
Synthetic Data Still Needs Human Oversight
Make no mistake though: using synthetic data doesn't wipe out the potential for bias.
While synthetic data can be used in the creation of LLMs, human beings need to remain in the loop, said Jignesh M. Patel, Carnegie Mellon professor and co-founder of DataChat. If an LLM trains solely on its own results — a technique known as “self-improvement” — it is likely to reinforce its own biases.
Workaround techniques can help. The best known is the jeopardy method, which involves giving the LLM an answer and then asking it to generate the correct question. "This twist on self-improvement reduces susceptibility to bias, but it does require a human trainer to provide the answers and validate the output question,” Patel said.
He points to another, more automated approach where organizations use a mature, high-parameter LLM to effectively mentor a young LLM-in-training by reviewing and ranking its output, the way a person would.
This is called Reinforcement Learning from AI Feedback. Alternatively, the mature LLM could summarize documents, books, films and other content before feeding them to the young LLM. This saves computing power and human labor hours.
“The techniques I’ve described work best with specialized LLMs — the kind that might be used in a business to carry out a discrete set of workflows,” he said. "With higher-parameter, generalist LLMs like Gemini and GPT-4, the impacts of using synthetic data are harder to predict. They amaze us, in part, because they were trained with high-quality information, all vetted by thoughtful human beings.”
Related Article: A Look at the Large Language Model Landscape
The synthetic data benefit
All that said, synthetic data is proving a crucial part in the development of LLMs, offering a viable alternative to real data without the privacy concerns, Bob Brauer, founder and CEO of Interzoid, said.
The business value of synthetic data is clear, he said. Once generated, it enables limitless innovation while safeguarding customer privacy and meeting compliance requirements. For instance, it can train AI models without risking sensitive data exposure.
Additionally, synthetic data enhances efficiency and speeds up the time-to-value in production use of AI models. It overcomes common real data challenges like data inconsistencies, formatting issues and time-intensive ELT/ETL processes typically found in harvesting real-world data for use. This is achieved by embedding specific rules in the data generation algorithms to precisely match desired formats.
Synthetic data also facilitates the development of advanced, futuristic models, he adds. However, he too sounded a note of caution about its limitations.
Model performance depends on accurately replicating real data's complexity, he continued. Inaccuracies in synthetic data can therefore lead to underperformance or unexpected behavior in models when applied to real-world scenarios.
As well as unrealistic biases, it can also overlook nuances, and fail to predict sudden, significant real-world events like pandemics or natural disasters. These limitations can render models and resultant investments obsolete.
“Thoughtfully generated synthetic data offers considerable advantages for training future AI models, provided its limitations and potential inaccuracies are carefully considered and managed in comparison to the use of real-world data,” he said.