These include MrBeast, MKBHD, and some of YouTube’s biggest creators.
A recent investigation by Proof News has revealed that some of the world’s leading technology companies, including Apple, Nvidia, and Anthropic have trained their AI models using transcripts from over 170,000 YouTube videos without obtaining consent from the creators.
The dataset, which was created by a nonprofit company called EleutherAI and is part of a larger compilation known as “The Pile”, contains transcripts of YouTube videos from more than 48,000 channels and was used by Apple, NVIDIA and Anthropic among other companies.
Proof News’ findings uncovered that the dataset contains video transcripts from some of YouTube’s biggest creators, including MrBeast , Marques Brownless PewDiePie, Stephen Colbert, and Jimmy Kimmel, as well as giant news publishers like The New York Times, the BBC, and ABC News, and Engadget.
“Apple has sourced data for their AI from several companies,” Marques Brownlee posted on X, commenting on the discovery. “One of them scraped tons of data/transcripts from YouTube videos, including mine. This is going to be an evolving problem for a long time.”
Apple has sourced data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids “fault” here because they’re not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024
Companies using YouTube’s data without permission are in direct violation of the platform’s terms of service. According to the report, Apple, Nvidia, Anthropic and EleutherAI did not respond to a request for comment from Engadget.
These findings shine a spotlight on the glaring concern with AI: namely, that the technology is largely built on the backs of data taken from creators without their consent or compensation.
Both OpenAI and Alphabet’s CEOs, Mira Murati and Sundar Pichai, have avoided detailing whether their companies used YouTube videos to train their AI tools. Earlier this month, artists and photographers criticized Apple for failing to reveal the source of training data for Apple Intelligence.