Artificial Intelligence is Running Out of Data; Synthetic Data is on the Rise
From your data to synthetic data, how exactly does AI regurgitate so much information?
Photo Source: VentureBeat
With the introduction of new artificial intelligence processes, notably ChatGPT and Bard, people quickly adapted these tools into their daily lives with little to no questioning about their consumption of personal data. Being accustomed to social media platforms, it’s a second thought, but doesn’t hold priority over convenience or entertainment.
A majority of people have an understanding of where AI gets information from: the internet. We aren’t going to get into specifics, that’s an article for another time, but chatbots use knowledge from databases online that real people have created and contributed information to. For this reason, you can never solely rely on what AI regurgitates, but that’s not the only problem at hand.
At some point, AI tools will have consumed the maximum amount of data that human knowledge has been able to produce. In order to get smarter, AI will need to find new sources of data, and that begins with our data.
There are a multitude of reasons for people to use chatbots. Perhaps to ask questions, request an essay outline or brainstorm project ideas. Sometimes chatbots are used to summarize emails, documents and contracts. Providing information to the chatbot is feeding it new data to learn from, and when questions are asked about this new data, the chatbot is able to experience a deeper analysis training on it.
If people are already reliant on AI to complete tasks for work, what’s stopping them from using AI as a personal assistant? Soon enough, AI processes may be requesting access to personal emails, notes and files. Beyond this, people may grant AI control of browser extensions to track searching patterns.
As chatbots rapidly advance, Google has just released Gemini, a multimodal extension to its Bard chatbot. With multimodal features, AI can consume data from text, image, video and audio content. This takes data consumption to a new level, allowing AI to learn without us having to prompt them in writing.
Although AI processes have access to our data to create new databases of knowledge, it doesn’t end there. Eventually, AI will be able to self-learn with synthetic data. This form of data is artificially generated based on real world data samples. So far, AI has been trained to generate synthetic data in the form of images. These images are not real, but depict real world scenarios.
“Sam Altman, the C.E.O. of OpenAI, has said that synthetic data might also soon overtake the real thing in training runs for L.L.M.s,” according to The New Yorker. “The idea would be to have a GPT-esque model generate documents, conversations, and evaluations of how those conversations went, and then for another model—perhaps just a copy of the first—to ingest them.”
Currently, large language models (LLM) are deep learning AI models trained on large amounts of data. If these models start training on synthetic data, this means AI will be able to learn on its own without the help of humans. However, AI processes aren’t actively participating in real world experiences, so it remains a question as to how they will generate completely new ideas.
At Don’t Count Us Out Yet, our goal is to keep you informed on updates in artificial intelligence, and we think the human knowledge database that AI processes currently use will reach capacity soon. AI models are already being trained on synthetic data and will find more ways to gather and learn from our data. If you have knowledge of something, private or professional, that you want to keep to yourself or research before opening it to the public, we recommend you keep it at arm’s length from chatbots.
Best,
Ariana for the Don’t Count Us Out Yet Team