How video generation works in the open source project Wunjo CE

Getting to know the functionality

If you are interested in seeing the full functionality, explanation of the parameters and instructions on how to use the application, I will attach a video at the end of the article. Also, if you do not want to delve into technical details, at the end of the article there will be a video showing how to use video generation and a comparison with Pika And Gen.

Anatole France wrote: “You can only learn joyfully… To digest knowledge, you must absorb it with appetite.”

These words best describe that for each person there is something valuable in the material, and some details can be missed.

Specifications for video generation

Video can be generated from text and images using Stable Diffusion models. You can add custom 1.5 models and also extend to XL. However, in my opinion, this does not make much sense, since the video generation model will not deliver the quality that can be achieved with Stable Diffusion XL models, and it will require more power.

Generation parameters

  • FPS: The maximum value is 24 FPS, which allows you to get 4 seconds of video. By decreasing the FPS, you increase the length of the video.

  • VRAM: Video generation is possible on 8 GB of VRAM.

  • Formats: Popular aspect ratios available for generation from text: 16:9 (YouTube), 9:16 (Shorts), 1:1, 5:2, 4:5, 4:3. Aspect ratio for generation from image is preserved.

Generate video from text

First, an image is generated in the format you choose. You can see the image, regenerate it, or change details before creating the video.

Generate video from image

The aspect ratio of the original image is used, but you can fine-tune its elements before generating the video.

This reminds me of something

This reminds me of something

Manual setup

If automatic download of the necessary repositories failed, you will need about 85-90 GB of space on your hard drive. You can choose the storage location of the models yourself, more details on this in documentation.

The program downloads all repositories and models automatically, but there may be problems with downloading from Hugging Face without VPN. In this case, you need to go to the directory .wunjo/all_models.

Download manually

  1. Runwayml: This is a repository that includes the necessary models such as vae, safety_checker, text_encoder, unet and others. You need to create a directory runwayml and download the models to the appropriate folders.

You can also use a console command to automate this process.

git clone https://huggingface.co/runwayml/stable-diffusion-v1-5

Before you start downloading, make sure you have the ability to download large files.

git lfs install

Downloading Custom Stable Diffusion Models

To the directory .wunjo/all_models/diffusion you can download various Stable Diffusion 1.5 models that will be used to generate images. These models can be found in the public domain, for example, at Hugging Face or Civitai.

Setting up custom_diffusion.json

In file .wunjo/all_models/custom_diffusion.json you specify the paths to your models. Example of setup:

[
	{
		"name": "Fantasy World", "model": "fantasyWorld_v10.safetensors", "img": "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/2e321e71-c144-4ba6-8b55-3f62756fc0a1/width=1024,quality=90/01828-3669354034-giant%20pink%20balloon,cities%20floating%20on%20balloons,magnificent%20clouds,pink%20theme,.jpeg", "url": ""
	}
]

If you specify the url, then the model is downloaded automatically.

This step can be skipped as two Stable Diffusion 1.5 models are enabled by default, which is sufficient for creating both realistic and hand-drawn content.

Switching to the diffusers library

Previously, the code worked only with the necessary parts of generative models, which saved time and reduced the amount of unnecessary code. However, to expand the functionality, I decided to switch to the library diffusers. Specifically the repository runwayml used to generate images.

Example code from the project:

# Определяю компоненты модели runwayml
vae = AutoencoderKL.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained(sd_path, subfolder="text_encoder", torch_dtype=torch.float16)
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
unet = UNet2DConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
safety_checker = StableDiffusionSafetyChecker.from_pretrained(sd_path, subfolder="safety_checker", torch_dtype=torch.float16)  # to filter naked content
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")

# Если указана кастомная модель 1.5, подгружаю её веса.
if weights_path:
    weights = load_file(weights_path)  # weights_path
    unet.load_state_dict(weights, strict=False)

You can change Stable Diffusion 1.5 to a more powerful model, since it uses pipelines from the library diffusersTo do this, it is enough to change sd_path And StableDiffusionPipeline.

pipe = StableDiffusionPipeline(
    vae=vae,
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    unet=unet,
    scheduler=DDIMScheduler.from_pretrained(sd_path, subfolder="scheduler"),
    safety_checker=safety_checker,  # Значение None отключает цензуру модели SD
    requires_safety_checker=False,
    feature_extractor=feature_extractor,
).to(device)

And links to documentation on Stable Diffusion XL And Turbo.

Using ControlNet

ControlNet and the corresponding pipelines are used to change the image elements in the application. Example of the setup:

def make_canny_condition(image):
    image = np.array(image)
    image = cv2.Canny(image, 100, 200)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    image = Image.fromarray(image)
    return image

if controlnet_type == "canny":
    control_image = make_canny_condition(init_image)
else:
    control_image = init_image

controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16).to(device)

By default the canny method is available, but you can extend the application by adding different methods. ControlNet. For example, for XL models, you can replace controlnet with IP-Adapter or T2I-Adapter.

Downloading ControlNet Models

To the directory .wunjo/all_models need to create directory controlnet_canny and download models from the repository sd controlnet canny.

git clone https://huggingface.co/lllyasviel/sd-controlnet-canny

Also create a directory controlnet_tile and download models from the repository control_v11f1e_sd15_tile.

git clone https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile

Why ControlNet Tile? More on that later.

Video generation

To generate videos I use the repository stabilityai and the corresponding Pipeline. The question arises: if the model is limited to images of 576×1024 format and generates no more than 25 frames, how can we use any format in the model and get 4 seconds of video at 24 FPS?

# Определяем компоненты 
vae = AutoencoderKLTemporalDecoder.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(sd_path, subfolder="image_encoder", torch_dtype=torch.float16)
scheduler = EulerDiscreteScheduler.from_pretrained(sd_path, subfolder="scheduler")
unet = UNetSpatioTemporalConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")
# Инициализируем
pipe = StableVideoDiffusionPipeline(
    vae=vae,
    image_encoder=image_encoder,
    scheduler=scheduler,
    unet=unet,
    feature_extractor=feature_extractor
)
pipe.enable_model_cpu_offload()
pipe.unet.enable_forward_chunking()
pipe.enable_xformers_memory_efficient_attention()

Preparing images

Before feeding the image to the model for video generation, I build it up to 576×1024 format without changing the content of the user frame. After outpaint I use controlnet_tile with the appropriate mask to improve the quality of the added zones. The better the quality of the completed zones, the better the animation. Generation is better focused on the movement of objects rather than the created zones.

Generation iterations

The video cannot be generated indefinitely, because the model adds noise to each final frame. After the third iteration, you get a mess of pixels. So I generate the first two iterations, reverse them, generate the iterations again, and combine everything, cropping to the aspect ratio for the user. These tricks expand the possibilities for stabilityai.

Improving the quality of video generation

You can replace the repository stable-video-diffusion-img2vid-xt on stable-video-diffusion-img2vid-xt-1-1 to get better video quality than the model I use.

Results and comparison

For you, I have collected in one video various examples of generation in comparison with Pika and Gen-2. I have not included Gen-3, Luma, Sora models, since open source models cannot compete with them. From the entire list, I was able to use only Luma for free, and even then with restrictions – for example, no more than 5 generations per day. In this comparison, I focused on approaches that give approximately the same generation time. It is important to understand that Wunjo CE is a completely free and unlimited open source solution.

Features of the models

By the location of the result on the video.

  • Pika: The model does not add much movement, but smooths the result, sometimes even too much. Can add sound to the video.

  • Wunjo C.E.: Preserves the original image quality and adds interesting movements to some objects. However, the directions of these movements can randomly change in the frame and generating one video takes 15 minutes on NVIDIA GeForce RTX 3070 8 GB VRAM.

  • Gen-2: Adds more realism, but can create distortions for unusual objects. It is possible to increase the duration of the result, but the quality decreases with each iteration,

Examples of generation

The prompts were simple, like “cat sitting” or “dogs riding in a car”. I used only one video generation for each approach, without specially selecting or improving the results, to demonstrate the real potential of the models.

The cat is sitting

The cat is sitting

Dogs are riding in the car

Dogs are riding in the car

This generation made me laugh: the model generated the frames so interestingly that it seems as if the girl is cursing at life for some of her own reasons.

Lofi girl

Lofi girl

To explain in detail the parameters of video generation and the use of the application, I made a tutorial in which we get the following results:

Video example

Watch the video itself to see all the generation results.

Additional functionality

You can learn about the rest of the functionality of Wunjo CE from the video in playlist. There are overviews of the main functionality, installation instructions from the code and installer, and how to use the API for your projects.

Support the project

If you want to support the project, go to GitHubsave to bookmarks so as not to lose or miss updates. Plans include adding audio generation for video and animation of a talking head from an image. You can download installers from the official website wunjo.online and from Boost. On Boosty, you can vote for which code of the functionality will be open sourced and available on GitHub. It all depends on your interest.

Alternatives

As with any article about video generation, it is impossible not to mention alternatives. One of the interesting open source projects is ToonCrafterwhich generates hand-drawn animations. It only creates motion between the first and last frame, not from text or a single image. The resolution is quite low – 320×512, and I have not tested how much video memory is required to run this solution. However, it is a good alternative with potential for improvement. The ToonCrafter model adds motion to the animation, which I really like. All interesting solutions for video, voice cloning, etc., I collect Here.

Your suggestions

Be sure to write in the comments your alternatives to open source for video generation. This will help improve the current approach and build a knowledge base about this new and exciting direction.

A bit of philosophy

Generating video from text and images is not just a technological achievement, but a new form of creativity and self-expression. Thanks to open source and commercial projects, a new form of expression of thoughts is being created, when text alone can create a video with something that is difficult and expensive to shoot technologically.

It also sounds interesting that with one request it will be possible to adapt one video for different countries and regions, changing the color of skin, faces, objects and captions, adding new elements and videos. The future promises even more realistic and detailed models. Video generation is becoming not just a tool, but a new philosophy in creativity, combining technology, simplicity and art.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *