The foundational years of large language model (LLM) training have been rife with unauthorized data scraping, which some would call infringement and others would call theft. But there are signs that the momentum is turning in favor of copyright licensing. Books, music, articles, and images have all been used for training, whether their copyright owners wanted that or not. John Wiley & Sons, a forward-thinking publisher, recently announced that it licensed books for artificial intelligence (AI) training with an undisclosed technology company for $23 million. And Google announced a deal to use Reddit posts as training data for a whopping $60 million per year.
Unfortunately, deals like these are still the exception, not the rule. Lawsuits from musicians, authors, and artists groups are now in full swing, while quietly in the background, some organizations are negotiating license agreements.
Silicon Valley stars like OpenAI are prepared to take big risks. Remember Uber, which jumped into markets without regard for quaint taxi-related laws and regulations? OpenAI has taken a similarly aggressive path to train its LLM fast, on anything and everything that it can get its hands on, without paying much heed to copyright owners. As a result, OpenAI has been the target of many of the most high-profile lawsuits.
Against that tide are a breed of generative AI companies seeking use training data licensed from authorized sources. Supporting this momentum is a new certification nonprofit, called Fairly Trained, which has already certified at least nine companies for using only training data that has been properly authorized. (Those companies are Beatoven.AI, Boomy, BRIA AI, Endel, LifeScore, Rightsify, Somms.ai, Soundful, and Tuney.) Another startup, Calliope Networks, aims to facilitate large-scale licensing of books and podcast transcripts from authors and publishers to generative AI platforms.
What are the possible motivations to a generative AI company for using ethically licensed content when some of the industry leaders are getting it for free? According to Jim Golden, chief technology officer of Calliope Networks, there are at least two motivations at play. “First, many of these companies have downstream corporate customers who don’t want headaches with infringement suits and may not like the optics of being seen to be stealing from authors and artists. Second, licensed data makes better training data than scraped data with whatever irregularities and artifacts,” he said.
According to research from Microsoft, reliable content like textbooks enable faster training (less compute) with better (smarter) outputs. Responsible training will also likely reduce so-called “hallucinations,” where “facts” are completely fabricated by an LLM. In at least one well-reported example, a lawyer was censured for including fictional cases in a legal brief that had been hallucinated into existence by an LLM.
Following a professional licensing process also allows an opportunity to separate the wheat from the chaff rather than just gobbling up every bit and byte in sight. (It has been suggested that the training set for OpenAI includes thousands of retracted scientific articles!) It is the opposite of “garbage in, garbage out.”
It may take courts years to adjudicate the critical issues. And the Silicon Valley muscle in Washington is strong. But the early licensing deals validate that some major stakeholders believe licensing is required in many cases. Ethically sourced and licensed copyrighted content is here and growing. That will benefit artists, authors, and society as a whole. And, perversely, it will benefit AI companies, because their training data and outputs will be the richer for it.
Dave Davis is CEO and co-founder of Calliope Networks, which facilitates licenses from publishers and authors to AI companies.