Intelligence, a newsletter in which our writers help you wrap your mind around artificial intelligence and a new machine age.
Earlier this week, The Atlantic published a new investigation by Alex Reisner into the data that are being used without permission to train generative-AI programs. In this case, dialogue from tens of thousands of movies and TV shows has been harvested by companies such as Apple, Anthropic, Meta, and Nvidia to develop large language models (or LLMs).
The data have a strange provenance: Rather than being pulled from scripts or books, the dialogue is taken from subtitle files that have been extracted from DVDs, Blu-ray discs, and internet streams. "Though this may seem like a strange source for AI-training data, subtitles are valuable because they're a raw form of written dialogue," Reisner writes. "They contain the rhythms and styles of spoken conversation and allow tech companies to expand generative AI's repertoire beyond academic texts, journalism, and novels, all of which have also been used to train these programs."
Perhaps it no longer comes as a major shock that creative humans are having their work ripped off to train machines that threaten to replace them. But evidence demonstrating exactly what data have been used, and for what purposes, is hard to come by, thanks to the secretive nature of these tech companies. "Now, at least, we know a bit more about who is caught in the machinery," Reisner writes. "What will the world decide they are owed?"
There's No Longer Any Doubt That Hollywood Writing Is Powering AI