Why artists don’t like neuroart and how to solve it

Introduction

The explosion in popularity of neural networks caused a counter wave of hate from artists. And, it seems, some time has passed, and now we see that neural networks are not magic at all, and do not replace artists at all, but complement them, and artists are still in demand. We see that this magic button is not entirely magical – it draws something unusual, sometimes beautiful, but using it to create an image from your head is not so easy.

However, the hostility of artists remains significant, and it is wrong to think that this is simply a fear of competition, neo-Luddism and reluctance to develop.

What is the cause of this problem, should it be solved and how to solve it. This is what this article is about.

What are the causes of the conflict

The basis of the conflict is in the txt2img paradigm

The usual scheme of interaction between the customer and the artist is as follows: formulate the task in text/voice, sometimes show some pictures as examples; and then get the result in the form of an image.

txt2img models implement the same principle, but instead of a human artist, there is an AI artist.

But there is absolutely no place for a human artist here. A-priory. That is, such a tool, by definition, is a tool AGAINST the human artist, I am not FOR him.

And to solve the problem, we must at least set a different goal – to create tools FOR artists, and not AGAINST.

But what does it mean?

Artists think in images

There are many different people living in the world, who are built a little differently inside. And the heads of artists seem to be built differently from the heads of txt2img model creators. The artist is not just disgusted by the idea that the drawing process can be replaced by writing text and adjusting dozens of sliders – it is a process that is truly incomprehensible to him.

Well, really, why write all these words if the end result is an image. Isn't it better to immediately imagine this image and create it without any words?

txt2img models always draw something of their own

The txt2img models also have a specific problem that makes them not very useful for the artist. These technologies create something very cool, sometimes surprising, but is this really what people wanted to achieve?

No, creating an image from your head is incredibly difficult. Having generated many iterations, moving many sliders as you work and getting some final result, you essentially always get a compromise result. Yes, this is not exactly what I imagined, but if I painted with a brush, I would have done no better (or even much worse), and I would have spent more time. User model txt2img thinks something like this.

It is important for a real artist to create exactly what he imagines – an image from his head. And quite accurately. And for this, it is important to have significant control over generation, allowing you to automate the routine, but at the same time realize exactly your plan.

But why are tools created against artists?

The dominant approach now is . The text is primary.

But we know that over the course of thousands of years, a different approach to creating art has developed – drawing with a finger/stick/pencil/brush. And, this traditional approach has proven its effectiveness, being formed and improved over these millennia. One may naively believe that all this accumulated experience will be immediately erased by textual models, but there is no rigorous evidence for this. It's just faith.

If it were really common for us to reduce the solution of all problems to working with text, each of us would not have a mouse on the table, but a keyboard would suffice; and instead of touch screens, the phones would have a huge keyboard, like on old Blackbury smartphones.

What can I say, each of us has faced situations many times in our lives when it is much easier to point a finger than to explain something with words.

So why is the text primary? Why is this idea being imposed on us so much?

In my opinion, it's just the spirit of the times. It was text models that developed very rapidly and achieved significant results for a number of reasons, mainly technological – models for working with text are less resource-intensive. Text modelers began to work on generating everything from text, including images, and this approach has now achieved significant results. As a result, these people are quite influential and well heard.

But it seems that AI developers are too carried away by the idea of creating anything from text, because there are no longer strict technological restrictions for this – the hardware has become more powerful, the algorithms have become more advanced, and a lot of money is allocated for the development of projects. It's time to move to some other paradigm.

And the point here is not only the dissatisfaction of the artists, but the fact that the practice of using txt2img models shows their serious shortcomings.

The big practical problem of the txt2img paradigm

A hypothetical apologist for the txt2img approach might argue like this: “Well, ok, the artists suffered, so let them die out as a class. And there will be no problem!”

But! With all this hypothetical desire of this speculative artist-hater, in practice, until artists are completely replaced, we are like the Moon.

Firstly, we note that now you can only generate 2D, but not video or 3D (not mesh, not UV, not texture). There is no neuro-replacement for professional 3D artists and video makers at all.

Will effective models txt2video and txt-2-3d appear? What are the reasons for this?

I already mentioned above that in practice the txt2img approach requires a huge number of iterations. Yes, formally, it is enough to write only one prompt, but neuroartists, we know this, act differently – they generate a huge number of iterations, hundreds and thousands, in order to get just one final picture! This is practice, not theory.

Hidden text

Even in order to generate illustrations for this article, it took me hundreds of iterations, and this is just a small article, and the pictures here are just some background to dilute the text, and, of course, the pictures turned out to be very compromises.

Only some of the iterations of the drawings for this article

Many iterations are a necessary measure, this is a way to replace the poor control capabilities with simple brute force. OK, this worked for 2D images, but in the problem of generating video and 3D, another dimension is added, and it will take an order of magnitude more iterations.

Brute force will not help in generating video and 3D! And it seems that this is a colossal problem.

To generate video and 3D, more advanced generation control tools are needed, and it is completely stupid to try to fit them into the Procrustean bed of the paradigm – this is just an artificial limitation taken out of thin air.

What about the new Sora from Open AI? This is real video generation! Yes, it looks cool and seems like a big breakthrough. But, in a practical sense, what has changed? To what extent do the presented videos realize the intent of their creator? Is this really an image from his head? Sora is closed, the details are not known, but it seems that the control here is minimal – only prompt. For a professional director, this is still useless and there is no threat to his employment.

What to do?

Firstly, we state that the task of reducing the cost of content production is really relevant. The distribution of electronic devices around the world is growing, there are more and more services, more and more content needs to be produced, and its quality is also constantly growing, because hardware capabilities are increasing, but the number of artists will not increase dramatically.

Secondly, to solve this problem, it is important to think not about how to replace artists, but about how to make their work more efficient, creating advanced tools that are close to their traditional processes – drawing with a pen, rather than writing text.

The basic principle of automation (as any automation specialist will tell you) is that you need to automate routine, not decision-making. And there is also a lot of routine in the work of artists. For example, to create a 2D drawing, it is necessary to perform a huge number of similar pen movements; each such movement is not unique. To create a 3D model, you need to correctly create many polygons using a certain set of technically complex techniques. Etc. These are complex tasks that require fine-tuned skills and brainpower, but these repetitive actions are routine. This needs to be automated. Not decision making.

Possible Solution

General principles

Solutions to complex problems do not appear overnight. This is a complex evolutionary process. There is a lot of room for thought and discussion, and probably even for basic science. And philosophical essays have probably already been written on this topic.

But reasoning will not lead to results if you do not take action. It is important to identify a goal and move in its direction, starting from the opportunities that exist today.

So, Target: create neuro-tools FOR artists, not AGAINST artists, that will make their work more effective.

When creating such tools, I believe it is important to be guided by the following: principles:

Use the tools familiar to artists to control generation – working not with text, but with images, using a pen and mouse, not a keyboard.
Try to keep decision-making in human hands, but automate the routine.
Use a familiar development environment (Photoshop/Blender/Maya, etc.), rather than instant messengers and web applications. This will simplify working with the tool and allow it to be integrated into existing development pipelines.

In my opinion, it is possible to act today, because some basis already exists.

Now I will move on to practice based on the use of modern available technologies. Such a transition may seem to you like too abrupt a descent from heaven to earth. But, once again, I believe it is important to act using the available opportunities, and not just speculate. Only actions, not reasoning, can lead to results.

Basic technologies

Today, most txt2img solutions offer fairly modest capabilities for controlling generation using a process similar to drawing, namely:

img2img mode, which generates an image based on some input image.
Inpaint mode, in which you can select an area for redrawing.
Sketching mode, where you can use a brush to roughly show what you want to draw.

However, there is a technology that provides much more possibilities – this is Stable Diffusion + ControlNet.

Yes, it is the free Stable Diffusion with the ControlNet extension, unlike all other paid solutions (even those creating higher quality images), that allows you to control the generation using images, so-called masks, that is, using a process traditional for artists. It is this approach that is truly critical as a basis for creating tools to help the artist, rather than against him.

Thus, ControlNet is not just some extension to the txt2img neural network, but the emergence of a content generation paradigm based not only on text.

This approach is not yet perfect and for some artists it is still unnatural – after all, when working with Stable Diffusion and ControlNet, you still need a prompt and you need to move different sliders, because Stable Diffusion is initially a txt2img model. But all the same, the role of text and sliders here is greatly reduced, because to a large extent the result is determined by the input images.

Since the inception of ControlNet, it has already evolved significantly, gradually shifting its focus from text input to working with images. For example, ControlNet IP-Adapter appeared so long ago, which allows you to work with reference images much more efficiently than before. Already now, in its modern implementation, IP-Adapter is much more convenient than a text prompt – it allows you to more accurately indicate what result you want to get. The main current drawback of IP-Adapter is that it does not exactly replicate the reference, but it is still much better than the prompt!

This is a very important trend, which I hope will, over time, lead to the emergence of very convenient tools for artists, if, of course, we purposefully strive to create them.

On the importance of Stable Diffusion for technology development

Here I cannot help but note how the right step for the development of technology was the decision of Stability AI to make Stable Diffusion publicly available. After all, ControlNet was not originally envisioned by the creators of Stable Diffusion, but it appeared as a result of the fact that many people gained access to this technology, disassembled it into separate cubes, climbed inside them, pretty much dismantled them and learned to use them in rather unexpected ways.

As for proprietary solutions such as DALL·E and Midjourney, despite the increase in image quality and resolution, they still do not offer anything close to equal control capabilities. They are still making a tool not for artists, but against it.

And this fact, the active evolution of open Stable Diffusion towards management and the ignorance of this phenomenon by large corporations, in my opinion, is an important confirmation of the thesis: “the approach is not at all the only correct one.” The approach was born in the minds of large corporations and the academic environment, that is, a relatively small group of people, a little isolated from practical problems, but with significant financial resources. But once the technology moved beyond corporations and into the hands of people with practical problems, they began to actively develop a different approach that was more suitable for solving their problems.

User Experience is important

Having core technology such as Stable Diffusion+ControlNet is necessary, but not sufficient.

Artists need a tool, not a technology, and that's where User Experience is key. Creating control masks should look like a familiar process of working with a 2D/3D editor, and in a familiar environment where the user usually works. If an artist works in Photoshop/Maya/Blender, then such a familiar environment is Photoshop/Maya/Blender, respectively, and the tools should be similar to the familiar tools of these editors. This process is easy to integrate into a traditional development pipeline. Constantly importing/exporting files between a 2D/3D editor and some website is not a good idea in terms of UX.

You don't have to look far for examples – take a look at Adobe Photoshop.

It would seem that AI from Adobe appeared much later than its competitors, when the market was already saturated, and in terms of the quality of generation, Adobe is inferior to OpenAI and Midjourney, and Stable Diffusion is much stronger in terms of customization, learning and management. Moreover, the Adobe product is also not cheap. However, its appearance greatly shook up the market and even caused a new wave of popularity for Photoshop and a huge increase in Adobe's capitalization. And the reason is excellent User Experience. You don’t need to register anywhere, you don’t need to open any websites, you don’t need to import/export anything. Everything is here, in a familiar environment.

Conclusion

So, to create a tool to help artists, two things are needed:

The presence of basic technology that allows you to control generation not using the keyboard, but the mouse and pen. A similar technology already exists and is being developed quite actively – this is Stable Diffusion + ControlNet.
A user-friendly tool that takes advantage of the underlying technology and is as close as possible to the tools and processes that artists and modelers use all the time.

I decided not just to reason, but to act and create my own neuro-texturing tool NeuralMaster – a free add-on for Blender. In its development, I try to adhere as much as possible to the approach proposed above and create a tool for artists, and not against them, as much as possible today using open base technologies.

Here is a small example of using the tool. Traditional methods of working in a 3D editor are used here – selecting polygons, moving the camera, drawing masks.

In the next article I plan to dwell in more detail on the technical problems of the neuro-texturing problem and possible ways to solve them.

The article uses neuroart created using SDXL and SD 1.5.

Why artists don’t like neuroart and how to solve it

Introduction

What are the causes of the conflict