How video generation works in the open source project Wunjo CE
Getting to know the functionality
If you are interested in seeing the full functionality, explanation of the parameters and instructions on how to use the application, I will attach a video at the end of the article. Also, if you do not want to delve into technical details, at the end of the article there will be a video showing how to use video generation and a comparison with Pika And Gen.
Anatole France wrote: “You can only learn joyfully… To digest knowledge, you must absorb it with appetite.”
These words best describe that for each person there is something valuable in the material, and some details can be missed.
Specifications for video generation
Video can be generated from text and images using Stable Diffusion models. You can add custom 1.5 models and also extend to XL. However, in my opinion, this does not make much sense, since the video generation model will not deliver the quality that can be achieved with Stable Diffusion XL models, and it will require more power.
Generation parameters
FPS: The maximum value is 24 FPS, which allows you to get 4 seconds of video. By decreasing the FPS, you increase the length of the video.
VRAM: Video generation is possible on 8 GB of VRAM.
Formats: Popular aspect ratios available for generation from text: 16:9 (YouTube), 9:16 (Shorts), 1:1, 5:2, 4:5, 4:3. Aspect ratio for generation from image is preserved.
Generate video from text
First, an image is generated in the format you choose. You can see the image, regenerate it, or change details before creating the video.
Generate video from image
The aspect ratio of the original image is used, but you can fine-tune its elements before generating the video.
Manual setup
If automatic download of the necessary repositories failed, you will need about 85-90 GB of space on your hard drive. You can choose the storage location of the models yourself, more details on this in documentation.
The program downloads all repositories and models automatically, but there may be problems with downloading from Hugging Face without VPN. In this case, you need to go to the directory .wunjo/all_models
.
Download manually
Runwayml: This is a repository that includes the necessary models such as vae, safety_checker, text_encoder, unet and others. You need to create a directory runwayml and download the models to the appropriate folders.
You can also use a console command to automate this process.
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5
Before you start downloading, make sure you have the ability to download large files.
git lfs install
Downloading Custom Stable Diffusion Models
To the directory .wunjo/all_models/diffusion
you can download various Stable Diffusion 1.5 models that will be used to generate images. These models can be found in the public domain, for example, at Hugging Face or Civitai.
Setting up custom_diffusion.json
In file .wunjo/all_models/custom_diffusion.json
you specify the paths to your models. Example of setup:
[
{
"name": "Fantasy World", "model": "fantasyWorld_v10.safetensors", "img": "https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/2e321e71-c144-4ba6-8b55-3f62756fc0a1/width=1024,quality=90/01828-3669354034-giant%20pink%20balloon,cities%20floating%20on%20balloons,magnificent%20clouds,pink%20theme,.jpeg", "url": ""
}
]
If you specify the url, then the model is downloaded automatically.
This step can be skipped as two Stable Diffusion 1.5 models are enabled by default, which is sufficient for creating both realistic and hand-drawn content.
Switching to the diffusers library
Previously, the code worked only with the necessary parts of generative models, which saved time and reduced the amount of unnecessary code. However, to expand the functionality, I decided to switch to the library diffusers
. Specifically the repository runwayml
used to generate images.
Example code from the project:
# Определяю компоненты модели runwayml
vae = AutoencoderKL.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained(sd_path, subfolder="text_encoder", torch_dtype=torch.float16)
tokenizer = CLIPTokenizer.from_pretrained(sd_path, subfolder="tokenizer")
unet = UNet2DConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
safety_checker = StableDiffusionSafetyChecker.from_pretrained(sd_path, subfolder="safety_checker", torch_dtype=torch.float16) # to filter naked content
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")
# Если указана кастомная модель 1.5, подгружаю её веса.
if weights_path:
weights = load_file(weights_path) # weights_path
unet.load_state_dict(weights, strict=False)
You can change Stable Diffusion 1.5 to a more powerful model, since it uses pipelines from the library diffusers
To do this, it is enough to change sd_path
And StableDiffusionPipeline
.
pipe = StableDiffusionPipeline(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
unet=unet,
scheduler=DDIMScheduler.from_pretrained(sd_path, subfolder="scheduler"),
safety_checker=safety_checker, # Значение None отключает цензуру модели SD
requires_safety_checker=False,
feature_extractor=feature_extractor,
).to(device)
And links to documentation on Stable Diffusion XL And Turbo.
Using ControlNet
ControlNet and the corresponding pipelines are used to change the image elements in the application. Example of the setup:
def make_canny_condition(image):
image = np.array(image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
image = Image.fromarray(image)
return image
if controlnet_type == "canny":
control_image = make_canny_condition(init_image)
else:
control_image = init_image
controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16).to(device)
By default the canny method is available, but you can extend the application by adding different methods. ControlNet. For example, for XL models, you can replace controlnet with IP-Adapter or T2I-Adapter.
Downloading ControlNet Models
To the directory .wunjo/all_models
need to create directory controlnet_canny
and download models from the repository sd controlnet canny.
git clone https://huggingface.co/lllyasviel/sd-controlnet-canny
Also create a directory controlnet_tile
and download models from the repository control_v11f1e_sd15_tile.
git clone https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile
Why ControlNet Tile? More on that later.
Video generation
To generate videos I use the repository stabilityai and the corresponding Pipeline. The question arises: if the model is limited to images of 576×1024 format and generates no more than 25 frames, how can we use any format in the model and get 4 seconds of video at 24 FPS?
# Определяем компоненты
vae = AutoencoderKLTemporalDecoder.from_pretrained(sd_path, subfolder="vae", torch_dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(sd_path, subfolder="image_encoder", torch_dtype=torch.float16)
scheduler = EulerDiscreteScheduler.from_pretrained(sd_path, subfolder="scheduler")
unet = UNetSpatioTemporalConditionModel.from_pretrained(sd_path, subfolder="unet", torch_dtype=torch.float16)
feature_extractor = CLIPImageProcessor.from_pretrained(sd_path, subfolder="feature_extractor")
# Инициализируем
pipe = StableVideoDiffusionPipeline(
vae=vae,
image_encoder=image_encoder,
scheduler=scheduler,
unet=unet,
feature_extractor=feature_extractor
)
pipe.enable_model_cpu_offload()
pipe.unet.enable_forward_chunking()
pipe.enable_xformers_memory_efficient_attention()
Preparing images
Before feeding the image to the model for video generation, I build it up to 576×1024 format without changing the content of the user frame. After outpaint I use controlnet_tile
with the appropriate mask to improve the quality of the added zones. The better the quality of the completed zones, the better the animation. Generation is better focused on the movement of objects rather than the created zones.
Generation iterations
The video cannot be generated indefinitely, because the model adds noise to each final frame. After the third iteration, you get a mess of pixels. So I generate the first two iterations, reverse them, generate the iterations again, and combine everything, cropping to the aspect ratio for the user. These tricks expand the possibilities for stabilityai
.
Improving the quality of video generation
You can replace the repository stable-video-diffusion-img2vid-xt
on stable-video-diffusion-img2vid-xt-1-1
to get better video quality than the model I use.
Results and comparison
For you, I have collected in one video various examples of generation in comparison with Pika and Gen-2. I have not included Gen-3, Luma, Sora models, since open source models cannot compete with them. From the entire list, I was able to use only Luma for free, and even then with restrictions – for example, no more than 5 generations per day. In this comparison, I focused on approaches that give approximately the same generation time. It is important to understand that Wunjo CE is a completely free and unlimited open source solution.
Features of the models
By the location of the result on the video.
Pika: The model does not add much movement, but smooths the result, sometimes even too much. Can add sound to the video.
Wunjo C.E.: Preserves the original image quality and adds interesting movements to some objects. However, the directions of these movements can randomly change in the frame and generating one video takes 15 minutes on NVIDIA GeForce RTX 3070 8 GB VRAM.
Gen-2: Adds more realism, but can create distortions for unusual objects. It is possible to increase the duration of the result, but the quality decreases with each iteration,
Examples of generation
The prompts were simple, like “cat sitting” or “dogs riding in a car”. I used only one video generation for each approach, without specially selecting or improving the results, to demonstrate the real potential of the models.
This generation made me laugh: the model generated the frames so interestingly that it seems as if the girl is cursing at life for some of her own reasons.
To explain in detail the parameters of video generation and the use of the application, I made a tutorial in which we get the following results:
Video example
Watch the video itself to see all the generation results.
Additional functionality
You can learn about the rest of the functionality of Wunjo CE from the video in playlist. There are overviews of the main functionality, installation instructions from the code and installer, and how to use the API for your projects.
Support the project
If you want to support the project, go to GitHubsave to bookmarks so as not to lose or miss updates. Plans include adding audio generation for video and animation of a talking head from an image. You can download installers from the official website wunjo.online and from Boost. On Boosty, you can vote for which code of the functionality will be open sourced and available on GitHub. It all depends on your interest.
Alternatives
As with any article about video generation, it is impossible not to mention alternatives. One of the interesting open source projects is ToonCrafterwhich generates hand-drawn animations. It only creates motion between the first and last frame, not from text or a single image. The resolution is quite low – 320×512, and I have not tested how much video memory is required to run this solution. However, it is a good alternative with potential for improvement. The ToonCrafter model adds motion to the animation, which I really like. All interesting solutions for video, voice cloning, etc., I collect Here.
Your suggestions
Be sure to write in the comments your alternatives to open source for video generation. This will help improve the current approach and build a knowledge base about this new and exciting direction.
A bit of philosophy
Generating video from text and images is not just a technological achievement, but a new form of creativity and self-expression. Thanks to open source and commercial projects, a new form of expression of thoughts is being created, when text alone can create a video with something that is difficult and expensive to shoot technologically.
It also sounds interesting that with one request it will be possible to adapt one video for different countries and regions, changing the color of skin, faces, objects and captions, adding new elements and videos. The future promises even more realistic and detailed models. Video generation is becoming not just a tool, but a new philosophy in creativity, combining technology, simplicity and art.