CodeLama in your keyboard | Local Copilot for any input field

I mean, I’m standing in the morning (about 2 o’clock in the afternoon) near the coffee maker and leafing through the Habr tape, and there CodeLama is out. Is Copilot for the poor or a panacea in the world of local text models? I will try not to answer this question, because your neighbors below will drown in the water that is now pouring from the screen.
Read more at your own risk. The article was written with the spinal cord and late at night, as a result, I got an entity stretched over a globe that can be encapsulated in a technotext so that it would arouse less suspicion in a casual reader. Well, you understand the level, right?

I want to make a reservation right away that there will be nothing fundamentally new in the article – only a guide for a couple of libs already developed for us and a very small mvp code (60+60 lines) as a result.

Who might be interested in reading this? – To those who wrote to me in the habr chat and cart questions like “How to run a neural network?” or “How to save a colab notebook?”. I’m not sorry in any way, so here’s a little tutorial for 10 minutes of your screen time.

I propose to do with a similar brief introduction and go straight to the table of contents:


We are all used to the fact that neural networks are integrated into commercial products and work in the cloud. In this article, I am going to run the model locally and use it not in a highly specialized application, but in conjunction with your keyboard input. For example, I am currently writing this text and I can press a keyboard shortcut that will continue it for me.

It’s funny that the above idea is a reference to the dark past of GPT models in the form of phone prompts, which I guess they would like to forget about)

If I compile the ideas that I managed to generate while I was thinking about the introduction, it would turn out something like this:

  1. In standard user interfaces, there is such an entity as editable text fields or input fieldsfor example: password entry string from wifi, open file in VScodebrowser address bar, etc.
    And combines the opportunity allocate text in them cut (ctrl+x), copy (ctrl+c) And insert (ctrl+v). While ignoring specific cases like Windows cmdwhere these combinations transmit service commands and do not perform basic functions.

  2. The software can emulate user input from the keyboard. If you were born with the stigma of a pythonist, then your libraries for the rest of your life will be: pyautogui And PyDirectInput.

Reflecting on this, such bearded utilities as PuntoSwitcher And Caramba. If you’ve never heard of them, here’s how they work:

  1. Reading the entire stream of user input from the keyboard

  2. We follow the last word, checking it with Russian and English dictionaries

  3. If a word looks like abracadabra in one language, but looks like another if it is transliterated with a layout (Hello=ghbdtn), the user forgot to switch the layout.

    1. In this case, we switch it ourselves and erase the garbage written by the user

    2. We write, but with the correct layout

    3. We do the previous two points very quickly and imperceptibly for the user who is distracted by the keyboard (unless, of course, he can touch typing)

We repeat the same exercise, but in the context of working with neural networks.

In a previous article on the topic (see Really Endless (Summer) RuGPT3.5: Generation of a Novel on the Go by a Neural Network) it might seem that I was very critical of the entire llama family, but this is not so. It’s just that a model that natively writes in Russian primyerno tak didn’t suit me at all in a particular case.

LLama outstanding, if only because they are as scalable as possible. The developers made a pretrain of all standard sizes in advance, with the calculation of a local launch on a GPU of any price segment.

And since we are talking about the size of the model, you need to immediately decide on it.

Cheat sheet GPT sizes, depending on the version

Model

original size

Quantized size

7B

13 GB

3.9GB

13B

24 GB

7.8GB

30B

60GB

19.5GB

65B

120GB

38.5GB

If you want to quickly figure it out, you can simply multiply the number of billions of neurons by 2 to get the number of gigs

The choice of coffee variety is not very important to me personally, but since I’m used to 50/50 Arabica and Robusta, and my video card has less than 15 gigabytes of video memory, then I will stop at the first line. You, in turn, are free to experiment with any other versions.
The next article in my series that follows this article will use the 13B version of the classic llama, so let’s not relax. There will also be dense tests in interesting contexts.

Model run

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Python-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    use_safetensors=True,
    low_cpu_mem_usage=True,
    device_map="auto",
)

13 gigs VRAM from available 15 filled – sort of fit? Not really…
We need a high generation rate, and it is born with free space for calculations. Nobody prevents you from leaving everything as it is, but I will add load_in_4bit and load the model with reduced bitness – for PoC will do.

from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "codellama/CodeLlama-7b-Python-hf"
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    use_safetensors=True,
    low_cpu_mem_usage=True,
    device_map="auto",
)

Fortunately, the creators of Llama took care of us, and this time the conversation with quantized models was limited to a couple of lines instead of torment that cuts to the bone.

For ease of interaction, we abstract from the tokenizer and do something like a pipeline in a couple of lines (yes, I’m casual)

def txt2txt(text,**kwargs):
  inputs = tokenizer(text, return_tensors="pt").to("cuda")
  output = model.generate(
      inputs["input_ids"],
      **kwargs
  )
  output = output[0].to("cpu")
  return tokenizer.decode(output)

Here I sharply wanted to separate the functionality of the server (host with computing power) and the client (s) into different programs. While the coffee maker is trying to pass steam to me through coffee instead of water (because I forgot to turn off the cappuccinatore mode)we are writing a simple interface for interacting with the server on FastAPI.

from fastapi import FastAP

app = FastAPI()
@app.get("/")
def read_root():
    return {"I am": "alive"}

from pydantic import BaseModel

class Item(BaseModel):
    prompt: str
    add: int
    temperature: float

@app.post("/")
def root(data: Item):
    return {"return": txt2txt(data.prompt,max_new_tokens=data.add,do_sample=True,top_p=0.9,temperature=data.temperature)}

Accept POST requests with a prompt, the number of required tokens for generation, and the temperature hyperparameter.

The essence of the server is implemented, left to brew our espresso. Throw the above code into a file server.py and run it with this script:

#with open("server.py","w") as s: s.write(text)
from pycloudflared import try_cloudflare
try_cloudflare(port=8000)
try_cloudflare.terminate(port=8000)
try_cloudflare(port=8000)
!uvicorn server:app --reload
Try cloud…what?

PyCloudFlared – python wrapper over utility from cloudflarewhich allows you to create a temporary HTTP tunnel, like NGROKonly without registration and api.
Each raising of the tunnel generates a new URL of the form:
https://XXXX.trycloudflare.com/
However, if you use a port on which the tunnel has already been raised, then the wrapper gives out an already existing URL, which is an actual flaw.
After all, the tunnel may already be down by that time, but the old URL will be returned. To do this, just in case, I immediately restart the tunnel and get a new, but exactly up-to-date URL

All previous code was written in vanilla python, but I still planned to run it in google colab. This part of our animal (which I called the server) can work both remotely (on another machine that has a better GPU) and locally (on the same machine as the client). How could you guess Tesla T4 colab is better than mine Rx580even there is not to take into account the bloody rituals of sacrifice required to run ML projects on the gpu of the red vendor.
As a result, I decided to stop there.

Install requirements post factum

!pip install transformers==4.32.1
!pip install fastapi==0.103.0
!pip install bitsandbytes==0.41.1
!pip install accelerate==0.22.0
!pip install "uvicorn[standard]"
!pip install pycloudflared==0.2.0
All server code in ipython
!pip install transformers==4.32.1
!pip install fastapi==0.103.0
!pip install bitsandbytes==0.41.1
!pip install accelerate==0.22.0
!pip install "uvicorn[standard]"
!pip install pycloudflared==0.2.0

api="""
from transformers import BitsAndBytesConfig
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "codellama/CodeLlama-7b-Python-hf"
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    use_safetensors=True,
    low_cpu_mem_usage=True,
    device_map="auto",
)


def txt2txt(text,**kwargs):
  inputs = tokenizer(text, return_tensors="pt").to("cuda")
  output = model.generate(
      inputs["input_ids"],
      **kwargs
  )
  output = output[0].to("cpu")
  return tokenizer.decode(output)


from fastapi import FastAPI

app = FastAPI()
@app.get("/")
def read_root():
    return {"I am": "alive"}

from pydantic import BaseModel

class Item(BaseModel):
    prompt: str
    add: int
    temperature: float

@app.post("/")
def root(data: Item):
    return {"return": txt2txt(data.prompt,max_new_tokens=data.add,do_sample=True,top_p=0.9,temperature=data.temperature)}
"""
with open("server.py","w") as s: s.write(api)
from pycloudflared import try_cloudflare
try_cloudflare(port=8000)
try_cloudflare.terminate(port=8000)
try_cloudflare(port=8000)
!uvicorn server:app --reload

Testim

Follow the last link and get json response {“I am”:”live”}.

Adding /docs to the url and get into the automatically generated webui

We can play around and test our backend without a client.

Trying to throw a simple prompt
prompt = 12345

Answer:
return = 123456789012345

Seems to be working…

Client

Well, everything is as simple as possible. We heat the milk with steam, turning it into foamin parallel with this, we receive responses from a remote server and use them, as if GPT is right at our fingertips.

import urllib.parse
import requests
import json
import pyautogui as gui
import clipboard
import time

server="https://donations-tunes-institutions-fed.trycloudflare.com/" # Актуальная ссылка

# Или 127.0.0.1 если сервер запущен на том же компьютере что и клиент

def generate(text):
    url = server
    myobj = {'prompt': text,"add":10,"temperature":0.1}
    x = requests.post(url, json = myobj)
    print(x.text)
    return json.loads(x.text)["return"][4:]

Context parser

We use pyautogui to emulate clicks. First, I propose to look at a piece of code, and after that get excuses explanations for it.

def scan():
    back = clipboard.paste()
    gui.keyDown('shift')
    gui.press('pgup')
    gui.keyUp('shift')
    gui.keyDown('ctrl')
    gui.press('x')
    gui.keyUp('ctrl')
    gui.keyDown('ctrl')
    gui.press('v')
    gui.keyUp('ctrl')
    pasted = clipboard.paste()
    clipboard.copy(back)
    return pasted

That’s such a terrible boilerplate, yes. A proper plan looks like this:

  1. Shift + PageUP (highlights all previous context)

  2. Ctrl+X, ctrl+v (Cut and paste this context back so that the caret returns to its place, and the selection subsides)

  3. We take the resulting prompt from the buffer and restore the old buffer as it was before all our manipulations

Of course, only if you are a fan of latte. You could make a raf or a cappuccino, using as a shift back not PageUPA Home or spam RightArrow. This would add compatibility, because there are much more interfaces that do not execute specific commands on the right arrow than those that ignore PageUP.
Bottom line: To change the context capture method, it is enough to change the click script in the scan function to any other one.

Now we only do this when some key combination is pressed (in my case, just F7)

import keyboard
while True:
    time.sleep(0.01)
    try:
        if keyboard.is_pressed("f7"):
            pr=True
        else:
            pr=False
    except:
        pr=False
    if pr:
        old=scan().replace("\r","")
        print([old])
        new=generate(old)
        back = clipboard.paste()
        clipboard.copy(new[len(old):])
        gui.keyDown('ctrl')
        gui.press('v')
        gui.keyUp('ctrl')
        clipboard.copy(back)
        print(new[len(old):])
All client code in python

import urllib.parse
import requests
import json
import pyautogui as gui
import clipboard
import time
server="https://ссылку сюда вставить надо, да/" # Актуальная ссылка

# Или 127.0.0.1 если сервер запущен на том же компьютере что и клиент

def generate(text):
    url = server
    myobj = {'prompt': text,"add":10,"temperature":0.1}
    x = requests.post(url, json = myobj)
    print(x.text)
    return json.loads(x.text)["return"][4:]

def scan():
    back = clipboard.paste()
    gui.keyDown('shift')
    gui.press('pgup')
    gui.keyUp('shift')
    gui.keyDown('ctrl')
    gui.press('x')
    gui.keyUp('ctrl')
    gui.keyDown('ctrl')
    gui.press('v')
    gui.keyUp('ctrl')
    pasted = clipboard.paste()
    clipboard.copy(back)
    return pasted


import keyboard  # using module keyboard
while True:  # making a loop
    time.sleep(0.01)
    try:  # used try so that if user pressed other than the given key error will not be shown
        if keyboard.is_pressed("f7"):
            pr=True
        else:
            pr=False
    except:
        pr=False
    if pr:  # if key 'q' is pressed 
        old=scan().replace("\r","")
        print([old])
        new=generate(old)
        back = clipboard.paste()
        clipboard.copy(new[len(old):])
        gui.keyDown('ctrl')
        gui.press('v')
        gui.keyUp('ctrl')
        clipboard.copy(back)
        print(new[len(old):])

Field tests
Once
Two
Three
I added in a cycle, the screenshot is a bit outdated

I added in a cycle, the screenshot is a bit outdated


Conclusion

By no means, I’m not going to squeeze out analytics and evaluate the quality of the work of CodeLLama itself, as it was originally postulated in 2 sentence of the first paragraph of this text.
And there is a good explanation for this. I have no doubts about the quality of CodeLLama itself, but I strongly doubt that I will see its potential by running stripped down Ultra lite version)
If you need a really effective support, you can install TabNine in VSCode. I won’t even leave a link, paid subscriptions and their advertising are corporate evil.
You can always run this little code yourself and test it on your specific problem. And of course, I would be interested in getting feedback in the comments.

I hope at least someone has reached this point. If it was you, then thank you, dear reader, who mastered this small tutorial.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *