How we create Visionatrix: simplifying ComfyUI

What this article includes:

  1. Description of ComfyUI problems and why we decided to create Visionatrix

  2. How we approached development and what we did first

  3. What happened in the end

I hope you find it interesting.

Here is a link to the project at GitHub-e

ComfyUI problems

The ComfyUI interface is intended for creating workflows, not for using

It is quite difficult for the average person who is not a programmer to understand what KSampler is, how the CFG parameters and latent noise value influence the result, and other nuances of diffusion.

We wanted to create something that the average person, familiar with a computer but not familiar with the intricacies of diffusion, could use.

Some might say there is A1111but we wanted an even simpler alternative, and for some reason we decided to base it on ComfyUI, since it has a large community of workflow developers.

We also did not want the user to see many input blocks when opening the program that require detailed study. Understanding that it is difficult for the average user to understand complex settings and blocks, we decided to give the opportunity to flexibly control which parameters are displayed in the interface and which ones are hidden.

Since the ComfyUI backend can only execute workflows in API format, which does not save much metadata from the original workflow, we had to write our own small optional nodes. Our Python backend parses them and generates a dynamic UI based on them.

We also left the option not to use special nodes, but to indicate all the information in the “names” of the nodes, so that the backend could understand what to give to it in the UI. The format is something like this:

input;display name;flag1;flag2;...

For example:

input;Photo of a person;optional;advanced;order=2

And based on this, the backend generates a list of workflow input parameters that the UI should display. Of course, you can connect any other UI; We understand that ours may not seem ideal for someone, and we try to do everything modularly.

Our interface based on flags optional (means that the parameter for workflow can be set or not set) and advancedindicates whether the option will be hidden by default or not, so that workflows do not overload the interface.

The average user just wants to generate an image based on a prompt or input and doesn't want to adjust the number of steps in the diffusion, right?

The parser was written quite quickly and quite simple; We tried not to complicate the project and not add unnecessary things; we ended up with the following options for the input parameters:

  • input – a keyword by which the backend understands that this is an input parameter

  • display name – name of the parameter in the UI

  • optional – is the parameter optional, for validation on the UI side that all required parameters are filled in

  • advanced – flag determining whether to hide the parameter by default

  • order=1 – display order in the UI

  • custom_id=custom_name – the custom id of the input param, by which it will transmit to the backend

Results of workflows in ComfyUI

Here we encountered another problem: most nodes have outputs, but they do not need to be output, since they are more often used to debug the workflow process.

Example: The workflow inside accesses VLM to obtain a description of the input image, then you do something with this text in the workflow and use the “ShowText” node to display the result.

We have not completely solved this problem, but it is on the TODO list for the near future.

To solve the problem of unnecessary output and make the interface more user-friendly, we decided that the results of the SaveImage class, as well as VHS_VideoCombine for video support, will always be displayed in the UI. This allows the user to focus only on key results and reduces information overload.

When we have time, we plan to add a flag debug for workflows. If it is passed at the time the task is created, we will collect all the outputs and display them in the UI as a separate drop-down list – this will be useful for debugging.

At the moment, we have simply added the ability to download the generated ComfyUI workflow from the UI; you can always load it into ComfyUI and see what is working wrong.

Generation history and multi-user support

ComfyUI does not properly support generation history and working with multiple users, but this was very important for us.

It’s convenient when a family has one powerful computer, you run the software on it, and each family member uses their own account without interfering with each other.

Installing workflows

ComfyUI has big problems with this. Most often you see great workflow on openart.aibut even after spending 30 minutes installing everything you need, you can’t start it, because… you need to install 20 nodes and get 5 new models from somewhere.

In addition, a workflow installation may work today, but stop tomorrow because, for example, the model used in it has been deleted.

We partially solved this problem by creating a temporary model catalog in JSON format and wrote GitHub CI to check the availability of models (this happens often).

It looks like ComfyUI frontend will soon have support for specifying hashes in nodes for loading models, and then we will get rid of our catalog, as we will be able to find a model on CivitAI by hash and write a parser for HuggingFace by name and hash comparison.

Also, a new version of ComfyUI with support will be released soon ComfyUI registry for nodes True, installing nodes will not become much easier and will not solve the problem of Python packages with incompatible dependencies, but overall the situation will improve a little.

At Visionatrix, we created a “small workflow store” using GitHub Pages and branch versioning. Workflows also have a version and extended metadata.

It turned out simple and effective: you don’t need to rent anything, everything is stored in the repository and collected on Github CI.

Ease of installation and scalability

Our main requirement was that everything be easy to install and use.

The solution also had to be easy to scale.

Therefore, it was decided to do everything comprehensively: a one-click installation for home use uses SQLite, but if necessary, PostgreSQL is supported for scaling.

One installation can operate in standard mode, including both the server part and the part that executes workflows (in ComfyUI these parts are inseparable). This can be easily changed by connecting one instance to another, where the connected one becomes an external task executor working in worker mode.

We’ll probably talk about how we implemented all this later, if anyone is interested. Perhaps it will be enough for you that in our free time we write and “generate” some kind of documentation: https://visionatrix.github.io/VixFlowsDocs/

Why exactly the similarity? Because the project is developing rapidly, ComfyUI is developing in parallel, and the ComfyUI developers do not particularly inform about future changes, so you need to be prepared for sudden unexpected changes.

We're also seeing how quickly and dynamically the field of generative AI as a whole is evolving, and if you're pressed for time, you're better off focusing on the parts that won't change in the near future.

How we approached development and what we did first

Using LLM to Help with Development

Let me make a reservation right away: the author of this article has been in programming for more than 10 years; he started programming in assembly language and Delphi at school, and then wrote in C/C++ for a long time.

The idea for Visionatrix was born in early 2024, when ChatGPT-4 was already available.

My colleague and I both work full-time in open-source and AI integrations, outside of diffusion and ComfyUI. After an 8-hour working day, there is little energy left, and I manage to devote 2-4 hours a day to a hobby project. Sometimes you can find some time on the weekend, but it’s still not enough for large projects.

Given the limited time and desire to speed up development, we decided to use generative AI to write the backend code. The main goal was for ChatGPT/Claude to work effectively with the most popular Python libraries such as httpx, FastAPI, Pydantic and SQLAlchemy.

It was also important that the project support both synchronous and asynchronous mode. Those familiar with Python will realize that this poses a significant challenge, as the need to support both modes greatly increases the complexity and size of the code.

Why is synchronicity needed in addition to asynchrony? Because in the future, perhaps with the popularization of Python version 3.13+, synchronous projects will get a second life; After all, real multithreading is multithreading.

We had one person write the backend part, and the second – mainly the UI and did testing. AI has so far been used mainly for the backend.

The entire project was divided into files that are convenient to “feed” to the AI, namely:

Database:

  • database.py

  • db_queries.py

  • db_queries_async.py

Pydantic models are placed in a separate file pydantic_models.py — ChatGPT really likes to get them for context.

All the logic of working with the database, relationships between tasks, child tasks (yes, we did that too; It’s convenient when the result of a task created from the current one can be seen without leaving the page of the current workflow) and everything else is also placed in separate files:

  • tasks_engine.py

  • tasks_engine_async.py

This made it possible not to write functions for working with the database ourselves; All you had to do was write the first 4-5 pieces, and then, when asking for functionality extensions, write something like:

“Please, based on the files database.py, pydantic_models.py, db_queries.py, tasks_engine.py add the ability to give tasks priority, add or expand existing synchronous functions for this. Keep the writing style as in the original files.

pydantic_models.py:

file contents

db_queries.py:

file contents

And then the next request separately, but without pydantic_models.py And database.pybut just dump files into the same chat tasks_engine_async.py And db_queries_async.py and write: “now let’s do the same, but for asynchronous implementations.”

Although ChatGPT-4 required a lot of manual editing, since the beginning of the summer it has become smart enough to produce much higher quality code that requires fewer edits.

This fall, with this approach, it made it possible, spending much less effort, to expand new functionality, adding different things, even after work, tired, lying on the couch, since you simply ask for what you need, and then, if anything, ask to redo it.

Now that it's available o1-previewthe situation became both better and worse at the same time. How so?

Requests for o1-preview it is much more difficult to compose, but with a good prompt it produces much better code than ChatGPT-4o – but the results are much harder to evaluate, since the code generated o1-previewmore complicated, resembles code written quickly by a person, and then immediately rewritten with optimizations.

But in general, the approach remained the same: a clear separation of functionality into files, so that it would be as convenient as possible to provide the AI ​​code and then evaluate it.

LLM also really likes it when the code is maximally typed, and this allows you to quickly catch errors using utilities ruff or pylint.

Still pre-commit in Python, this is why we love this language, when many errors can be caught without even running the code.

Using LLM to Write Tests

Initially, we were not going to write tests at all, since this was a hobby project and there was no time for them.

But then the situation changed a little, because, firstly, I wanted to measure the performance between versions of PyTorch, ComfyUI itself, the parameters with which all this can be run, and test it all on different hardware.

Initially, we did this manually a couple of times, but after several attempts it became clear that this could not continue.

This happened already this fall, during the time of ChatGPT-4o and o1-preview.

We had files with backend sources, openapi.jsonaccess to LLM and an incredible desire to have tests…

We gave all this to LLM, and, having explained the situation, we received what we expected – almost working benchmarking code the first time.

For those interested, almost the entire file was written by ChatGPT:

https://github.com/Visionatrix/VixFlowsDocs/blob/main/scripts/benchmarks/benchmark.py

What interesting things did you discover during the writing process?

We realized that we needed more than 5 requests to LLM to generate working code.

Maybe this is because of the disadvantages of LLM? Perhaps, but perhaps the reason is that the backend documentation (openapi.json) and the backend algorithms themselves are not good enough.

An LLM can be thought of as a developer who went to the repository and tried to use the project's API. He doesn’t succeed on the first, second, third try, and he moves on; it’s easier to look for software for which to write integration.

And this is an indicator of how easy it will be for other developers to use your product.

Although the tests were eventually written normally using AI, we rented a couple of instances with different video cards such as 4090, A100, H100 and ran them through the draft. To write them, I had to explain to LLM that to create a task from a ComfyUI workflow, you need to pass all the arguments and all the files one by one as parameters in the form, which endpoint to call to get the task status, how to get the results.

This all turned out to be quite unobvious the first time for LLM (and, apparently, will not be obvious to a person either) and requires further improvements and refinements.

The main thing that was found out is that how clearly the project and its OpenAPI documentation are compiled can be checked using LLM, asking you to write a script for tests on it or just a UI on Gradio.

The fewer questions you have to answer to AI, the better you did, and this will also reduce the time in the future to support the project, since you can outsource more tasks juniors AI and speed up the development process, freeing yourself from routine.

What happened in the end

We have created a solution that is easy to install on both Linux and macOS, and also works on Windows.

Uses a database to save the history of tasks that the user creates from workflows.

Supports working with SQLite and PostgreSQL databases.

Connection to one instance of another instance installed both locally and remotely is supported.

Since it's all built on ComfyUI, most ComfyUI features are supported.

Implemented support for translation of prompts using Gemini and Ollama.

A small script has been written for benchmarking, which is also used for tests.

A small script has been written for GitHub, which checks whether everything is in order with the models used in workflows.

What is the project missing?

  • Prompt extension: for example, the drawing model does not know who the Witcher Geralt of Rivia is; LLM should transform and simplify the prompt to get a better result.

  • More convenient API for creating tasksas well as OpenAPI specifications for each flow – this will make it much easier to write integrations for the project, as well as create demos based on Gradio with a single request to the LLM.

  • Automatic generation of documentation and examples for workflowswith the help of the same LLM.

  • Development of a universal input method in chat formatwhen, based on some result, you ask to redo something that you don’t like, perhaps with words, perhaps with a choice of objects, after which the LLM itself must choose which workflows to use to achieve the result (maximum ease of use as possible when you yourself initially you don’t know what you want).

Conclusion

The development of Visionatrix was an interesting experience for us in creating an add-on for ComfyUI, aimed at simplifying use and expanding capabilities for ordinary users. We continue to actively develop the project and will be glad if the community is interested in it.

If you have ideas, suggestions or would like to contribute, we welcome your participation. The project is open to contributors and any help is welcome. Also, if you liked the project, don't forget to put ⭐️ on GitHub – we will be very pleased.

If you are interested in any of the topics covered in the article, let me know in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *