The Fight Over AI Training Data: Scraping the Internet or Respecting Intellectual Property?

Category Artificial Intelligence

tldr #

Adobe's AI model, Firefly, is integrated into Photoshop and trained using explicit licensed data from their stock photo library. This contrasts with many tech companies' practices of scraping the internet for training data, which has sparked controversy and legal battles over intellectual property and potential inclusion of copyrighted and toxic content. Adobe believes that respecting intellectual property and ensuring responsible AI development outweigh the risks and challenges of this approach.


content #

Since the beginning of the generative AI boom, there has been a fight over how large AI models are trained. In one camp sit tech companies such as OpenAI that have claimed it is "impossible" to train AI without hoovering the internet of copyrighted data. And in the other camp are artists who argue that AI companies have taken their intellectual property without consent and compensation.

Adobe is pretty unusual in that it sides with the latter group, with anapproach that stands out as an example of how generative AI products can be built without scraping copyrighted data from the internet. Adobe released its image-generating model Firefly, which is integrated into its popular photo editing tool Photoshop, one year ago.

Adobe's image-generating AI model, Firefly, was released one year ago and is integrated into its popular tool Photoshop.

In an exclusive interview with MIT Technology Review, Adobe’s AI leaders are adamant this is the only way forward. At stake is not just the livelihood of creators, they say, but our whole information ecosystem. What they have learned shows that building responsible tech doesn’t have to come at the cost of doing business.

"We worry that the industry, Silicon Valley in particular, does not pause to ask the ‘how’ or the ‘why.’ Just because you can build something doesn’t mean you should build it without consideration of the impact that you’re creating," says David Wadhwani, president of Adobe’s digital media business.

Adobe stands with artists in advocating for a license-based model for AI training data, where creators are compensated for their work.

It soon became clear that to offer creators proper credit and businesses legal certainty, the company could not build its models by scraping the web of data, Wadwani says.

Adobe wants to reap the benefits of generative AI while still "recognizing that these are built on the back of human labor. And we have to figure out how to fairly compensate people for that labor now and in the future," says Ely Greenfield, Adobe’s chief technology officer for digital media.

Tech companies like OpenAI and Google are facing lawsuits over their use of scraped training data, with artists pushing for proper compensation.

To scrape or not to scrape .

The scraping of online data, commonplace in AI, has recently become highly controversial. AI companies such as OpenAI, Stability.AI, Meta, and Google are facing numerous lawsuits over AI training data. Tech companies argue that publicly available data is fair game. Writers and artists disagree and are pushing for a license-based model, where creators would get compensated for having their work included in training datasets.

Adobe trains Firefly using explicit licensed data, primarily from its own stock photo library.

Adobe trained Firefly on content that had an explicit license allowing AI training, which means the bulk of the training data comes from Adobe’s library of stock photos, says Greenfield. The company offers creators extra compensation when material isused to train AI models, he adds.

This is in contrast to the status quo in AI today, where tech companies scrape the web indiscriminately and have a limited understanding of what of what the training data includes. Because of these practices, the AI datasets inevitably include copyrighted content and personal data, and research has uncovered toxic content, such as child sexual abuse material.

By limiting its AI training data to licensed content, Adobe avoids including copyrighted and personal data, as well as potentially toxic content.

Scraping the internet gives tech companies a cheap way to get lots of AI training data, and traditionally, having more data has allowed developers to build more powerful models. Limiting Firefly to licensed data for training was a risky bet, Adobe admits. But the company says it believes that the benefits outweigh the costs.


hashtags #
worddensity #

Share