Multimodal applications on Llama 3.2 and Llama Stack

The recent release of Llama 3.2 with multimodal versions 11B and 90B opens up the possibility of creating AI applications that analyze visual input.

There have been multimodal models before, but this is the first official version of Llama with such features. The model can be used to recognize objects and text in an image, as GPT-4o does. The technical recipe for creating multimodal Llama 3.2 is quite interesting. The previous version was taken as a basis – 3.1, a regular text LLM. It makes sense if you consider that the ultimate goal is to extract image features and “translate” them into text tokens.

An image encoder was added to LLM, this is a module that embeds the representation of an input image into a vector space. And also image adapter layers – in order to transfer the resulting visual features to the language model. You can read more about encoders and image adapters, for example, in the article Bordes et al. 2024 – introduction to visual language models. VLM is trained on image-text pairs, this is exactly how Llama 3.2 was trained. Moreover, in several stages – first on a large data corpus, and then applied fine tuning on a smaller but higher quality sample. Past experience with Llama 3 models shows that this approach gives good results. A basic model trained on a large data corpus (for example, 15 trillion Llama 3 tokens) generalizes well during fine tuning and is less susceptible to overfitting. Example – my model ruslandev/llama-3-8b-gpt-4o-ru1.0which, after training on a small but high-quality dataset, surpassed GPT-3.5 on the Russian-language benchmark.

The architecture of Llama 3.2 multimodal models is an interesting question, but in this article I want to talk about the application side, that is, the development of AI applications using these models.

The creators of Llama suggested Llama Stack – a framework for working with their models, which allows you to deploy multifunctional APIs (for inference, agent systems, generating your own data for training and other tasks). Llama Stack has several client SDKs, including Python. Recently, the iOS mobile platform has been supported – because the Llama 3.2 1B and 3B models can run on a mobile device. These are ordinary text models, only very lightweight ones. Comparable in quality to Gemma 2 and Phi-3.

But if you are specifically interested in multimodal Llama 3.2, then deploying it on Llama Stack will require a GPU – especially for version 90B. I deployed Llama Stack with multimodal 11B in the immers.cloud cloud on an RTX 4090 video card and tested it through the Inference API and Python client. According to my impressions, both the model and the API are quite ready to be launched into production. Llama Stack supports a variety of API backends, both self-hosted (for example, TGI) and cloud-hosted (AWS Bedrock, Together and others).

If you need to deploy Llama Stack on your virtual machine – here's how I did it in cloud. By the way, the RTX 4090 video card is more than enough to run multimodal 11B. If you want the 90B version, you can choose another GPU or several. Llama Stack supports multi-GPU and quantization.

I installed the framework using anaconda, but there is also an option for those who prefer docker. My test of the model and framework can be seen in this video:

One of the disadvantages of the framework is that more detailed documentation would be helpful. There is a demo example in the Llama Stack repository – an assistant interior designer application. It demonstrates several concepts of the framework, including agent creation and configuration, multimodal inference, memory management, and RAG.

What's good about having another framework for Generative AI? At first glance, the functionality of Llama Stack resembles what already came before it – LangChain, LlamaIndex and other similar frameworks. The good thing is that this tool is part of the Llama ecosystem and will likely become the official public API of future versions of Llama. The focus on cross-platform and multitasking says the same thing. Early homegrown solutions will most likely cease to be relevant, which also applies to my own framework – gptchain. However, this is a clear sign that the Generative AI application industry is becoming more mature.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *