How are AI companies surviving the “oil crisis”? How do media platforms make money doing almost nothing?

Imagine a world where artificial intelligence needs raw materials to exist and thrive, just as oil was a key resource for the Industrial Revolution. In this world, information becomes the new gold, and companies that own data and content become the new oil tycoons. The story of how AI companies are experiencing a data crisis, and content distribution platforms are making billions of dollars without doing anything.

Before we dive into the article, let me invite you into my telegram channel “Hunt for technologies”. Here, among exciting news and insights, I talk about technologies that are not only changing the world, but also transforming business. Join our community, read, discuss and discover the secrets of modern technology and its endless possibilities with us.

The emergence of ChatGPT and the rapid spread of Midjourney were real turning points in the development of artificial intelligence, marking the beginning of the era of large AI models.

Big models are complex machine learning systems with billions of parameters and multi-layered architecture. These models resemble supercomputers that can analyze and process gigantic amounts of data. Think of them as highly skilled analysts that can study millions of texts, images, and videos to produce results that are close to human understanding. For example, ChatGPT can generate texts, answer questions, and even write poetry using knowledge from all available sources. Meanwhile, Midjourney creates stunning visuals based on text descriptions, turning ideas into realistic images.

These models do what was previously thought impossible: they learn from huge and diverse data, allowing them to achieve outstanding results in pattern recognition, trend prediction, and automation of complex processes in real time and under constantly changing conditions. They are able to process data in the same way that humans process information, but with incredible speed and accuracy.

The first legal disputes over the use of data for AI

If you think of large AI models as cars, you can compare the raw data to crude oil. Just as cars require fuel, AI models require huge amounts of data to train and operate. This data serves as the primary “fuel” that allows the models to evolve, learn, and perform complex tasks.

The main sources of “crude oil” for AI companies include:

  • Open free data sources on the Internet: Wikipedia, blogs, forums, and news sites provide vast amounts of information that AI can use to learn.

  • Old news media and publishers: Historical data from newspapers and magazines helps models understand the context and evolution of language.

  • Universities and research institutions.

  • Endpoint Users: Data collected from users, such as on social media or when using various applications, is also an important source of information for AI.

Unlike the oil market, where extraction and use rights are governed by strict legal norms, in the AI ​​field, legal norms regarding the use of data are not yet fully formulated. This leads to numerous legal disputes and uncertainties.

Examples of such disputes include recent lawsuits. Major music labels have sued AI-powered music companies Suno and Udio for copyright infringement. These allegations are similar to lawsuits filed by The New York Times in December 2023 against Microsoft and OpenAI, alleging that the paper’s materials were used to train AI without permission.

Additionally, in July 2023, a group of writers accused ChatGPT of creating summaries of their works based on copyrighted content. A class action lawsuit was filed in California accusing OpenAI of collecting users' personal information without their consent to train ChatGPT.

OpenAI, for its part, denied the allegations, saying it did not consider the New York Times data relevant to its models and that it had been unable to reproduce the problems cited in the lawsuit. Still, it was an important lesson for OpenAI in properly managing relationships with data providers and defining the rights and responsibilities of the parties.

Over the past year, OpenAI has been actively pursuing partnerships with numerous data providers, including The Atlantic, Vox Media, News Corp, Reddit, the Financial Times, Le Monde, Prisa Media, Axel Springer, and the American Journalism Project. These collaborations allow OpenAI to legitimately use these outlets’ data and integrate their technology into its products, further developing and improving its AI models.

Profitable partnership

In addition to avoiding lawsuits, OpenAI and other AI companies are aggressively partnering with data providers for another important reason: a looming shortage of quality machine learning data. Research from MIT and other institutions predicts that the market could run out of available “quality language data” by 2026.

Quality data has become a key asset for giants like OpenAI and Google. It is necessary to train AI models that can perform complex tasks and achieve high levels of accuracy. Realizing this, content companies have begun to shift to a new strategy involving passive income from licensing their data.

Traditional media platforms like Shutterstock are increasingly striking deals with AI companies like Meta, Alphabet, Amazon, and Apple. In 2023, revenue from licensing content for AI models reached $104 million, and this figure is projected to grow to $250 million by 2027. Reddit already makes up to $60 million a year from licensing content to Google, while Apple offers news media at least $50 million a year in licensing fees. The growth rate of revenue from such deals is impressive, with an annual increase of 450%.

In recent years, content outside of streaming has proven difficult to monetize, posing a significant challenge to the industry. However, the advent of AI has opened up new opportunities for content companies, introducing new ideas and platform-driven revenue opportunities to the industry.

How Open AI Solved the Problem of Low Data Quality

Not all content is suitable for modern AI systems, and data quality plays a key role in the successful training of models. Just as oil refining requires quality oil, AI requires high-quality data to operate.

The dispute between OpenAI and The New York Times focuses on this quality. OpenAI argues that The New York Times content was not essential to its models, unlike Shutterstock, which generates significant revenue and is heavily used. Text-based media like The New York Times rely on topicality and may not be suitable for long-term AI training.

The lack of quality data has AI companies increasingly focusing on “cleaning technologies” and “general applications.” For example, OpenAI’s acquisition of Rockset on June 25 will improve real-time data processing. Rockset provides tools for analyzing and indexing data, which will enhance the capabilities of AI products such as recommendation engines and chatbots.

Rockset can thus be thought of as the “petrochemical department” of OpenAI, turning ordinary data into the high-quality data needed for AI systems to function effectively.

The problem of creators' rights to content: fantasy or reality?

Data from online platforms like Facebook and Reddit is largely generated by users. These platforms, which charge high fees to AI companies for access to the data, often include clauses in their terms of service that allow the data to be used to train AI models. However, many content creators are unaware of how their content is being used, are not compensated for it, and are unable to protect their rights.

At the Meta conference in February this year, Mark Zuckerberg confirmed that photos from Facebook and Instagram would be used to train AI tools. Meanwhile, Tumblr has reportedly already entered into a secret content licensing agreement with OpenAI and Midjourney, but the details of the deal have not been disclosed.

On the EyeEm platform, creators were recently notified that their photos would be used to train AI models, but there was no mention of compensation. EyeEm's parent company, Freepik, confirmed that they had struck deals to license the images, but the details remain confidential.

Similar problems are seen at other platforms, such as Getty Images, Adobe, Photobucket, Flickr, and Reddit. These platforms often ignore user rights, selling data to AI companies for significant sums. The entire process happens behind the scenes, and content creators rarely know how their content is being used or who is making money from it.

Web3 may offer a possible solution to these problems.. Blockchain, due to its decentralized and immutable nature, can ensure the protection of creators' rights. Already in 2021, media content has been migrated to the blockchain, and the transition of UGC (user-generated content) to Web3 platforms is just beginning. Many Web3 AI model platforms already reward users for contributing to model training, which could be a step towards a more equitable distribution of revenue.

An example of a Web3 project that rewards users for contributing to training AI models is Ocean Protocol.

Ocean Protocol

It is a decentralized data exchange platform where users can upload their data to the Ocean Marketplace and receive tokens for its use by AI companies. Users of this platform upload data and set the terms of use, receiving tokens for this, which can then be exchanged for cryptocurrency. Ocean Protocol ensures fair compensation for data and transparency in its use.


With the exponential development of AI models, the need to define data rights increases. Content creators should ask themselves: why are their works sold to AI companies for pennies without their consent and without receiving any income? Media platforms must find a balance between the interests of creators, platforms, and AI companies to ensure a fair distribution of data and income.

I hope you found this article fascinating and useful! There is still a lot of new stuff ahead: about technologies that have changed the world and business. If you don’t want to miss them – I invite you to my channel “Hunting for Technologies”I wish everyone well and see you soon!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *