The race to lead AI has become a desperate search for the digital data needed to advance the technology. To get that data, tech companies including OpenAI, Google and Meta cut corners, flouting corporate rules and arguing for bending the law, according to an analysis by The New York Times.
At Meta, which owns Facebook and Instagram, managers, lawyers and engineers last year discussed buying publishing house Simon & Schuster to acquire long-form works, according to recordings of internal meetings that obtained by The Times. They also conferred with gathering copyrighted data from around the internet, even if that meant facing lawsuits. Negotiating licenses with publishers, artists, musicians and the news industry would take too long, they said.
Like OpenAI, Google transcribes YouTube videos to harvest text for its AI models, five people with knowledge of the company’s practices said. Which potentially violates the copyrights in the videos, which belong to their creators.
Last year, Google also expanded its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message seen by The Times, was to allow Google to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of this. AI products.
The companies’ actions illustrate how online information — news stories, works of fiction, message board posts, Wikipedia articles, computer programs, images, podcasts and movie clips — has increasingly become the lifeblood of the burgeoning AI industry. The creation of innovative systems depends on having enough data to teach the technologies to instantly produce text, images, sounds and videos similar to those created by a human.