The Risk of a Lack of Data in the AI Industry

Thursday - November 16 2023, 12:02 UTC - 1 year ago

tldr #

The AI industry faces a potential risk of running out of high-quality training data in the near future, which could slow down the development of powerful AI models and even alter the trajectory of the AI revolution. Although this is cause for concern, there are several ways to address this issue including increasing the efficiency of algorithms and using AI to create synthetic data.

content #

As artificial intelligence reaches the peak of its popularity, researchers have warned the industry might be running out of training data—the fuel that runs powerful AI systems. This could slow down the growth of AI models, especially large language models, and may even alter the trajectory of the AI revolution.

But why is a potential lack of data an issue, considering how much there is on the web? And is there a way to address the risk? .

AI has the potential to contribute up to $15.7 trillion to the world economy by 2030

Why High-Quality Data Is Important for AI .

We need a lot of data to train powerful, accurate, and high-quality AI algorithms. For instance, the algorithm powering ChatGPT was originally trained on 570 gigabytes of text data, or about 300 billion words. Similarly, the Stable Diffusion algorithm (which is behind many AI image-generating apps) was trained on the LAION-5B dataset comprised of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.

The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords

The quality of the training data is also important. Low-quality data such as social media posts or blurry photographs are easy to source but aren’t sufficient to train high-performing AI models. Text taken from social media platforms might be biased or prejudiced, or may include disinformation or illegal content which could be replicated by the model. For example, when Microsoft tried to train its AI bot using Twitter content, it learned to produce racist and misogynistic outputs.

Researchers predict we will run out of high-quality text data before 2026 if current AI training trends continue

This is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.

Do We Have Enough Data? .

The AI industry has been training AI systems on ever-larger datasets, which is why we now have high-performing models such as ChatGPT or DALL-E 3. At the same time, research shows online data stocks are growing much more slowly than datasets used to train AI. In a paper published last year, a group of researchers predicted we will run out of high-quality text data before 2026 if current AI training trends continue. They also estimated low-quality language data will be exhausted sometime between 2030 and 2050, and low-quality image data between 2030 and 2060.

AI developers are seeking out high-quality content such as text from online articles, scientific papers, and Wikipedia

AI could contribute up to $15.7 trillion to the world economy by 2030, according to accounting and consulting group PwC. But running out of usable data could slow down its development.

Should We Be Worried? .

While the above points might alarm some AI fans, the situation may not be as bad as it seems. There are many unknowns about how AI models will develop in the future, as well as a few ways to address the risk of data shortages. One opportunity is for AI developers to improve algorithms so they use the data they already have more efficiently. It’s likely in the coming years they will be able to train high-performing AI systems using less data, and possibly less computational power. This would also help reduce AI’s carbon footprint. Another option is to use AI to create synthetic data to train syste. This could also help with the issue of biased and prejudiced data, as synthetic data isn’t generated from real-world sources.

AI developers are looking to improve algorithms so they use the data more efficiently

hashtags #

ai data aithreats syntheticdata

worddensity #

ai (19, 3.58%)
data (18, 3.39%)
train (6, 1.13%)
models (5, 0.94%)
training (4, 0.75%)