We analyze the use of open-source Wunjo AI in your projects with artificial intelligence and just neural networks

At the time of writing, Wunjo AI v1.6

How to use Wunjo AI in your projects?

In a new article and video, we'll look at how to use Wunjo AI through Postman, as well as creating a complete build, including installing drivers and necessary tools. Detailed instructions for setting up the application and problems related to Windows will be described in the video. It's much easier with Linux, detailed instructions. To create the video, I rented a Windows 10 machine since I work mostly on Linux. Basic start at 1:45 if you plan to use the project on your machine. However, renting a server or gaming computer can be useful if, like me, you have little video memory, for example, for generating videos or changing videos using text.

So, let's start with the code. The first step after downloading the code from GitHub is to replace line 1348 with app.run(port=8000, host="0.0.0.0")so that the application starts as a regular Flask application with the selected port, and not FlaskUI. To output information to the console, change line 1225 to if app.config['DEBUG']:otherwise all information will be available via GET requests to 127.0.0.1:8000/console_log. Then, to run in developer mode, configure the dependencies and run briefcase dev.

The frontend part will be available at 127.0.0.1:8000, or, if you launched at 0.0.0.0, via static IP:PORT from the Internet or local network. The article and video show how to work with the application via Postman, create assemblies, install drivers and programs.

If you have questions or need information about a specific neural network in the application, use Telegram bot on wladblog for easy communication and getting answers on Wunjo.

Working with Wunjo AI modules via Postman

Open the developer console via F12 and select a module, for example, mouth animation.

Mouth animation module

Mouth animation module

Made a request and received the address and body of the request

Made a request and received the address and body of the request

In Postman we indicate the address 127.0.0.1:8000/synthesize_mouth_talk/select POST in JSON format and use the request body:

{
    "face_fields": {
        "x": 372,
        "y": 77,
        "canvasWidth": 588,
        "canvasHeight": 330
    },
    "source_media": "video_1710318305261_eJstB",
    "driven_audio": "audio_1710318305262",
    "type_file": "video",
    "media_start": "0",
    "media_end": "10.262993",
    "emotion_label": null,
    "similar_coeff": "1.2"
}

Interesting parameters that I would highlight:

  1. media_start and media_end: Define the video time frame for trimming, for example, up to 1 minute if you use it in your web application.

  2. face_fields: Represents the point of the selected face along with the dimensions of the image. This is necessary to animate the mouth (a) and obtain the actual coordinates of the point relative to the size of the video/image (b). You can specify the actual size of the media with the actual position of the point, or use relative coordinates, as in the example.

  3. similar_coeff: Determines the degree of similarity of the face in the previous frame with the face in the next one. This ratio is adjusted depending on the quality of the content and the size of the face. With high quality and a large face area, the coefficient should be increased in order to correctly detect the face in each frame. In case of low quality and small faces, the coefficient should be reduced to improve accuracy.

  4. type_file: Determines whether the file is a video or an image. The source_media and driven_audio parameters represent our video/images and audio, respectively.

Let's talk about source_media and driven_audio. Note that source_media and driven_audio do not contain an absolute path, only the file name. Files are saved in .wunjo/tmp before launch. If you need to use an absolute path or the path to a new media folder, you can find the line os.path.join(TMP_FOLDER, something_name) in the code and replace it with the desired path. It is also possible to leave something_name if an absolute path is used.

In general, we place our sources in the directory .wunjo/tmp before launch.

Sources in .wunjo/tmp

Sources in .wunjo/tmp

Open Postman, select the POST method, copy the request in JSON format.

Postman

Postman

A received status of 200 indicates successful execution of the program, while another status may indicate that the processor is busy with other tasks.

Same for any other module. Let's look at the requests:

Deepfake faces 127.0.0.1:8000/synthesize_face_swap/ has a request body:

{
    "face_target_fields": {
        "x": 372,
        "y": 77,
        "canvasWidth": 588,
        "canvasHeight": 330
    },
    "target_content": "video_1710319409907_mzk96",
    "video_start_target": "1.0105791794310721",
    "video_end_target": "9.544358916849015",
    "type_file_target": "video",
    "face_source_fields": {
        "x": 111,
        "y": 103,
        "canvasWidth": 252,
        "canvasHeight": 397
    },
    "source_content": "image_1710319409908_Dxbmj",
    "video_current_time_source": 0,
    "video_end_source": 0,
    "type_file_source": "img",
    "multiface": true,
    "similarface": false,
    "similar_coeff": "2"
}

Options face_target_fields, target_content, video_start_target, video_end_target, type_file_target are responsible for the file on which the face replacement will take place, and face_source_fields, source_content, video_current_time_source, video_end_source, type_file_source – from which the face will be taken. Wherein multiface – replacement of all persons, and similarface indicates that, for example, there may be twins in one frame.

Let's look at removing objects or text and getting objects from a video with a transparent background. The request is sent to 127.0.0.1:8000/synthesize_retouch/ and the request body:

{
    "source": "video_1710320112143_0tmqx",
    "source_start": "0.00",
    "source_end": "2.23",
    "source_type": "video",
    "model_type": "retouch_object",
    "mask_text": true,
    "mask_color": "transparent",
    "masks": {
        "1": {
            "start_time": 0,
            "end_time": 2.234,
            "point_list": [
                {
                    "canvasHeight": 330,
                    "canvasWidth": 588,
                    "color": "lightblue",
                    "x": 373,
                    "y": 37
                }
            ]
        },
        "2": {
            "start_time": 1,
            "end_time": 

2.234,
            "point_list": [
                {
                    "canvasHeight": 330,
                    "canvasWidth": 588,
                    "color": "lightblue",
                    "x": 257,
                    "y": 35
                },
                {
                    "canvasHeight": 330,
                    "canvasWidth": 588,
                    "color": "red",
                    "x": 255,
                    "y": 81
                }
            ]
        }
    },
    "blur": "10",
    "upscale": false,
    "segment_percentage": 25,
    "delay_mask": 0
}

Among the interesting things, pay attention to point_listWhere color lightblue or red. Lightblue is the point coordinates for the feature to be included for SAM, and red is the point coordinates for the area to be excluded from receiving the feature mask. Not to mention that for each object you can set your own frame for interaction. Wherein mask_text means that we are working with text, which is obtained automatically by a separate neural network, and mask_color means that the selected objects and text are saved as separate images with a transparent background. model_type – the model we use, regular object deletion or improved, which already requires a large amount of VRAM. The remaining parameters can be understood intuitively by launching applications in different parameters.

Let's look at changing the video with text. Requests are sent to 127.0.0.1:8000/synthesize_diffuser/ and the request body:

{
    "source": "video_1710320796223_0MOh9",
    "source_start": "0.00",
    "source_end": "2.06",
    "source_type": "video",
    "masks": {
        "1": {
            "start_time": 0,
            "end_time": 2.059,
            "point_list": [
                {
                    "canvasHeight": 330,
                    "canvasWidth": 588,
                    "color": "lightblue",
                    "x": 373,
                    "y": 39
                }
            ],
            "input_strength": "0.95",
            "input_seed": "0",
            "input_scale": "7.5",
            "prompt": "a superman",
            "n_prompt": "deformation"
        },
        "background": {
            "start_time": "0.00",
            "end_time": "2.06",
            "point_list": null,
            "input_strength": "0.7",
            "input_seed": "0",
            "input_scale": "7.5",
            "prompt": "a dog room",
            "n_prompt": ""
        }
    },
    "interval_generation": "20",
    "controlnet": "canny",
    "preprocessor_loose_cfattn": "loose_cfattn",
    "preprocessor_freeu": "freeu",
    "segment_percentage": 25,
    "thickness_mask": 10,
    "sd_model_name": "null"
}

Of interest, prompt and n_prompt (negative prompt) are associated with the Stable Diffusion 1.5 model and allow you to work with prompts for each object or for the background. The interval_generation parameter determines the interval for generating a new image based on the current and previous ones, and sd_model_name is the name of the model; if null, default is used. The remaining parameters can be studied manually.

Let's look at additional features, you can study them yourself. There's video enhancement, facial enhancement, audio separation, or speech enhancement. Requests are sent to 127.0.0.1:8000/synthesize_media_editor/ and the request body:

{
    "source": "image_1710321056462_fws2c",
    "gfpgan": "gfpgan",
    "animesgan": false,
    "realesrgan": false,
    "get_frames": false,
    "vocals": false,
    "residual": false,
    "voicefixer": false,
    "media_start": 0,
    "media_end": 0,
    "media_type": "img"
}

Where gfpgan, animesgan, realesrgan, get_frames, vocals, residual, voicefixer are the methods to run.

And we’ll finish it all with speech synthesis and voice cloning. Requests are sent to 127.0.0.1:8000/synthesize_speech/ and the request body:

[
    {
        "text": "И закончим всё синтезом речи и клонированием голоса",
        "voice": [
            "Russian man"
        ],
        "rate": "1",
        "pitch": "1",
        "volume": "0",
        "auto_translation": false,
        "lang_translation": "ru",
        "use_voice_clone_on_audio": true,
        "rtvc_audio_clone_voice": "rtvc_audio_1710321345864_uMSb0"
    }
]

It is important to know that the parameter voice is responsible for selecting the voice to create speech synthesis, and rtvc_audio_clone_voice is an audio file in a folder .wunjo/tmp, which will be used for speech cloning. Speech cloning works best with audio files lasting from 30 to 60 seconds; it is not recommended to exceed this time range.

Parameter use_voice_clone_on_audio determines whether to use speech cloning from the supplied audio file. To set the language of the text that you want to voice with cloned speech, use the parameter lang_translation.

And finally, a few words about the general approach. In this context, I have provided a description of how to use POST/GET requests for various Wunjo AI modules in your projects. It's important to note that you can create unique combinations of modules, for example, combining face replacement with lip animation.

When adding new modules or changing current ones, the basic logic will remain unchanged in obtaining the request body and address. By working with parameters in the UI and tracking changes in the console, you can easily integrate queries into your project without having to interact with the frontend. This also opens up the possibility of creating mobile applications or bots on Telegram.

Just neural networks

The application, in addition to the ability to work with deepfakes for faces and lip animation, features about 40 other neural networks for working with videos and images at the time of writing this article. A complete list can be found in the file deepfake.json at the following link in GitHub.

In addition, the application contains about 20 neural networks for working with sound. Full lists are available in files rtvc.jsonAndvoice.json By link.

Information about how different neural networks workI described in my blog. These are a kind of notes so as not to forget the experience and not lose your own knowledge.

And the question remains: is my real voice in one of these videos? Anyway. Save the project to yourself GitHub, so as not to lose and follow its development and version v2, and create forks to save current changes. Now that's it! Until new updates.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *