OpenAI’s ChatGPT and Sam Altman are in massive trouble. OpenAI is getting sued in the US for illegally using content from the internet to train their LLM or large language models
A class action lawsuit has been filed against OpenAI, the creator of ChatGPT, claiming that the company’s AI training methods violated the privacy and copyright of practically everyone who has ever shared content online.
OpenAI gathered an enormous amount of data from various sources on the internet to train its advanced AI language models. These datasets consist of a wide range of materials, such as Wikipedia articles, popular books, social media posts, and even explicit content of niche genres. More importantly, OpenAI acquired all this data without seeking permission from the content creators.
OpenAI in big trouble
The class action lawsuit, which has been filed in California, argues that OpenAI’s failure to adhere to proper protocols, including obtaining consent from content creators, amounts to outright data theft.
The lawsuit filing states, “Instead of following established procedures for the acquisition and usage of personal information, the Defendants resorted to theft. They systematically scraped 300 billion words from the internet, including ‘books, articles, websites, and posts,’ which also included personal information obtained without consent.”
How OpenAI stole your idea, your work, your creation
It is a valid argument that if you have been active online in recent decades, your digital contributions are likely incorporated into OpenAI’s datasets. Consequently, any output generated by OpenAI’s language models, which is used for profit, may contain fragments of your data obtained through silent scraping.
Ryan Clarkson, the managing partner at the law firm suing OpenAI, explained to The Washington Post that “all of that information is being taken at scale” without it being originally intended for utilization by a large language model.
Is the Class Action Lawsuit really a concern for OpenAI?
However, the outcome of the case in court remains uncertain. The internet’s infrastructure is complex, and the notion of a free and open web is often not entirely accurate. Online platforms have their own terms and agreements with users, and even if users contribute content to these platforms, the ownership typically belongs to the platform itself rather than the users.
Katherine Gardner, an intellectual-property lawyer, noted that when users upload content to social media or any other site, they usually grant the platform a broad license to use their content in various ways. As a result, it would be challenging for ordinary users to claim entitlement to payment or compensation for the use of their data in training models.