Online data has long been a valuable commodity. For years, Meta and Google have used data to target their online advertising. Netflix and Spotify used it to recommend more movies and music. Political candidates turn to data to figure out which groups of voters to train their sights on.
Over the past 18 months, it has become increasingly clear that digital data is also essential to the development of artificial intelligence. Here’s what to know.
The more data, the better.
The success of AI depends on data. That’s because AI models become more accurate and more human with more data.
In the same way that a student learns by reading more books, essays and other information, large language models — the systems that underlie chatbots — become more accurate and powerful as they will be given more data.
Some large language models, such as OpenAI’s GPT-3, released in 2020, are trained on hundreds of billions of “tokens,” which are essentially words or pieces of words. Recent large language models have been trained on more than three trillion tokens.
Online data is a valuable and finite resource.
Tech companies use publicly available online data to build their AI models, faster than generating new data. According to one prediction, high-quality digital data will be exhausted by 2026.
Tech companies will strive to get more data.
In the race for more data, OpenAI, Google and Meta are turning to new tools, changing their terms of service and engaging in internal debates.
At OpenAI, researchers created a program in 2021 that converted the audio of YouTube videos to text and then sent the transcripts to one of its AI models, in violation of YouTube’s terms of service , said people with knowledge of the matter.
(The New York Times sued OpenAI and Microsoft for using copyrighted news articles without permission for the AI development. OpenAI and Microsoft said they used the news articles in innovative ways that does not violate copyright law.)
Google, which owns YouTube, also used YouTube data to build its AI models, which treaded a legal gray area of copyright, people with knowledge of the action said. And Google changed its privacy policy last year so it could use publicly available material to build more of its AI products.
At Meta, executives and lawyers last year debated how to get more data for AI development and discussed buying a major publisher like Simon & Schuster. In private meetings, they weighed the possibility of putting copyrighted works into their AI model, even if it meant they would be sued later, according to recordings of the meetings, obtained by The Times.
One solution could be ‘synthetic’ data.
OpenAI, Google and other companies are exploring using their AI to create more data. The result is so-called “synthetic” data. The idea is that AI models generate new text that can then be used to build better AI
Synthetic data is dangerous because AI models can make mistakes. Reliance on such data may include those errors.