How to run part of a big AI on weak hardware

This notebook will introduce you to the basics of Petals, an inference and fine-tuning system for language models with hundreds of billions of parameters without the need for high-performance GPUs. With Petals, you can share computing resources with other people and run large language models with 176 billion parameters, such as BLOOM-196B or BLOOMZ same size as GPT-3.

If you have any problems launching this notebook, please let us know on the channel #running-a-client our Discord!

Let’s install Petals:

# Магия IPython закомментирована для совместимости с Python.
# %pip install -q petals

Step 1. The easiest way to generate text

Let’s start with the model DistributedBloom and its application to text generation.

This machine will download a small fraction of the weight of the model (~8 GB out of 352 GB) and will rely on other computers on the network to run the entire model. Loading the local part of the scale usually takes ~3 minutes.

We suggest starting with a regular BLOOM, but you can use BLOOMZ – A version of BLOOM optimized to follow human instructions in learning mode without trying. To load this model, set the model name: MODEL_NAME = "bigscience/bloomz-petals".

import torch
from transformers import BloomTokenizerFast
from petals import DistributedBloomForCausalLM

MODEL_NAME = "bigscience/bloom-petals"
tokenizer = BloomTokenizerFast.from_pretrained(MODEL_NAME)
model = DistributedBloomForCausalLM.from_pretrained(MODEL_NAME)
model = model.cuda()

Now let’s try to generate something using the method model.generate().

The first call takes about 5 seconds to connect to the petals swarm. After connection, you can expect a generation speed of 1–1.5 seconds per token. If you don’t have enough GPUs to host the entire model, it’s much faster than what you’ll get with other methods like unloading, which takes at least 10-20 seconds per token.

inputs = tokenizer('A cat in French is "', return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=3)
print(tokenizer.decode(outputs[0]))

Method model.generate() by default runs eager generation. You can choose other methods such as top-p/top-k sampling or beam searching by passing the appropriate parameters. You can even write custom generation methods, more on that below.

Your data is processed by other people from the common swarm. About privacy – Here. For sensitive data, you can create private swarm around people you trust.

Step 2. Chatbots and Interactive Generation

If you want to communicate with the model interactively, you can use the interface inference sessionwhich provides an easy way to display generated tokens on the fly or create a chatbot that responds to human phrases.

The inference session looks for a chain of servers that will execute successive inference steps and store past attention caches. So, to generate a phrase, you do not need to re-run previous tokens through the converter. If one of the remote servers fails, Petals will independently find a replacement and restore a small part of the caches.

Let’s see how to use Petals to write a simple chatbot that shows tokens as soon as they are created:

with model.inference_session(max_length=512) as sess:
while True:
prompt = input('Human: ')
if prompt == "":
break
prefix = f"Human: {prompt}\nFriendly AI:"
prefix = tokenizer(prefix, return_tensors="pt")["input_ids"].cuda()
print("Friendly AI:", end="", flush=True)

 while True:
 outputs = model.generate(
 prefix, max_new_tokens=1, do_sample=True, top_p=0.9, temperature=0.75, session=sess
 )
 outputs = tokenizer.decode(outputs[0, -1:])
 print(outputs, end="", flush=True)
 if "\n" in outputs:
 break
 prefix = None # Prefix is passed only for the 1st token of the bot's response

Building apps with Petals

If you are developing a tool for other people, then with Petals you can turn the code into a convenient web application, such as chat.petals.ml. Under the hood, this application can connect to a lite HTTP endpoint for an output that redirects all requests to the Petals swarm.

If you are building a BLOOM app using Petals, make sure it matches terms of use BLOOM.

Step 3. How does it work?

Your model is a real BLOOM-176B, but only part of it is loaded into the computer’s GPU. Let’s look under the hood:

model.transformer
DistributedBloomModel(
  (word_embeddings): Embedding(250880, 14336)
  (word_embeddings_layernorm): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
  (h): RemoteSequential(modules=bigscience/bloom-petals.0..bigscience/bloom-petals.69)
  (ln_f): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
)

The vector representation of words and some other layers are common PyTorch modules hosted on your computer, but the rest of the model (such as transformer blocks) is wrapped in a class RemoteSequential. This is an extended PyTorch module that runs on a distributed set of other machines.

However, you can access individual layers and their pins, as well as cycle through them back and forth:

first_five_layers = model.transformer.h[0:5]
first_five_layers

dummy_inputs = torch.randn(1, 3, 14336, dtype=torch.bfloat16, requires_grad=True)
outputs = first_five_layers(dummy_inputs)
outputs

loss = torch.mean((outputs - torch.ones_like(outputs)) ** 2)
loss.backward() # backpropagate through the internet
print("Grad w.r.t. inputs:", dummy_inputs.grad.flatten())
Grad w.r.t. inputs: tensor([ 0.0265, -0.0212,  0.0121,  ...,  0.0019, -0.0002,  0.0012],
       dtype=torch.bfloat16)

In general, you can mix and match distributed layers like in normal PyTorch, and even insert and train custom layers (like adapters) between pre-trained layers.

Read the details in our article

Step 4: Adding a Learning Adapter

Although remotely placed transformer blocks are frozen, in order to keep the pretrained model the same for all users, so that BLOOM solves a variety of subsequent tasks, parametrically efficient adapters (small trainable layers between pretrained model blocks, for example, LoRA) or trainable inputs added before the model inputs, for example, in P-Tuning v2.

Below is an example of adding a simple trainable line layer between the 5th and 6th transformer blocks of the pretrained model. Layer weights and corresponding optimizer statistics will be stored locally:

import torch.nn as nn
import torch.nn.functional as F

class BloomBasedClassifier(nn.Module):
def __init__(self, model):
super().__init__()
self.distributed_layers = model.transformer.h
self.adapter = nn.Sequential(nn.Linear(14336, 32), nn.Linear(32, 14336))
self.head = nn.Sequential(nn.LayerNorm(14336), nn.Linear(14336, 2))

def forward(self, embeddings):
hidden_states = self.distributed_layers[0:6](embeddings)
hidden_states = self.adapter(hidden_states)
hidden_states = self.distributed_layers[6:10](hidden_states)
pooled_states = torch.mean(hidden_states, dim=1)
return self.head(pooled_states)

classifier = BloomBasedClassifier(model).cuda()
opt = torch.optim.Adam(classifier.parameters(), 3e-5)
inputs = torch.randn(3, 2, 14336, device="cuda")
labels = torch.tensor([1, 0, 1], device="cuda")

for i in range(5):
loss = F.cross_entropy(classifier(inputs), labels)
print(f"loss[{i}] = {loss.item():.3f}")
opt.zero_grad()
loss.backward()
opt.step()

print('predicted:', classifier(inputs).argmax(-1)) # l, o, l

Step 5 Custom Sample Methods

Interface __model.inference_session()__ in Petals allows you to write custom pin code. You can use this to implement any necessary sampling algorithms, or write a beam search algorithm that forbids swearing.

Here’s how you can re-implement the standard interface model.generate() with direct passes through all layers manually:

text = "What is a good chatbot? Answer:"
token_ids = tokenizer(text, return_tensors="pt")["input_ids"].cuda()
max_length = 100
with torch.inference_mode():
with model.inference_session(max_length=max_length) as sess:
while len(text) < max_length:
embs = model.transformer.word_embeddings(token_ids)
embs = model.transformer.word_embeddings_layernorm(embs)

 h = sess.step(embs)
 h_last = model.transformer.ln_f(h[:, -1])
 logits = model.lm_head(h_last)

 next_token = logits.argmax(dim=-1)
 text += tokenizer.decode(next_token)
 token_ids = next_token.reshape(1, 1)
 print(text)
What is a good chatbot? Answer: A
What is a good chatbot? Answer: A chat
What is a good chatbot? Answer: A chatbot
What is a good chatbot? Answer: A chatbot that
What is a good chatbot? Answer: A chatbot that is
What is a good chatbot? Answer: A chatbot that is able
What is a good chatbot? Answer: A chatbot that is able to
What is a good chatbot? Answer: A chatbot that is able to answer
What is a good chatbot? Answer: A chatbot that is able to answer the
What is a good chatbot? Answer: A chatbot that is able to answer the most
What is a good chatbot? Answer: A chatbot that is able to answer the most common
What is a good chatbot? Answer: A chatbot that is able to answer the most common questions
What is a good chatbot? Answer: A chatbot that is able to answer the most common questions of
What is a good chatbot? Answer: A chatbot that is able to answer the most common questions of your
What is a good chatbot? Answer: A chatbot that is able to answer the most common questions of your customers

Step 6: Sharing is Caring

We designed Petals as a community-driven system, so we rely on people contributing their GPUs to increase swarm power. If you have GPUs that are not always busy, please consider running a Petals** server. Pause it at any time if you want to use the GPU for something else. Those who work on the server get a certain speedup bonus when using Petals, because a larger part of the model is locally hosted.

And if you have a GPU machine with a static public IP, you can start the server in the Anaconda environment:

conda install pytorch pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -U petals
python -m petals.cli.run_server bigscience/bloom-petals

Or via our GPU enabled Docker image:

sudo docker run --net host --ipc host --gpus all --volume petals-cache:/cache --rm \
 learningathome/petals:main python -m petals.cli.run_server bigscience/bloom-petals

This will prevent other people from running custom code on your computer. More about security – Here.

If your computer is behind a NAT or firewall, setting up a public server might be more difficult, but it’s possible. Please describe your settings in the channel #running-a-server our Discord, we will help you.

Step 7: Other Fine and Quick Tuning Methods

While you can write adapters, Petals implements several standard methods efficient fine-tuning of parameters. In our repository on GitHub, we provide more complex examples:

  • Personalized chatbot training: notebook
  • Tweaking BLOOM for semantic text classification: notebook

What else?

Now you are familiar with how to use Petals for different tasks, how it works inside and how to increase its power.

Here are some helpful resources:

  • Learn more about Petals. File README Our GitHub repo has links to additional Petals-related material, including instructions on how to start your own swarm (possibly with a model other than BLOOM).
  • Discord server. If you have feedback, questions or technical issues, join the Discord server and let us know. If you’d like to create something based on Petals, we’d love to hear what you have in mind.
  • Research Article. We released articlewhich details our research and what goes on under the hood at Petals.

And even more practice and useful theory – in our courses:

Brief catalog of courses

Data Science and Machine Learning

Python, web development

Mobile development

Java and C#

From basics to depth

And

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *