Description of the internal git protocol
One of the important developer tools, regardless of language (and religious beliefs), is a version control system (VCS). And a distributed system like GIT has become almost an industry standard. In our daily work, we (developers, DevOps engineers, technical writers and everyone involved) use it to bring goodness and light to people unite the efforts of teams in working on our projects. And everyone has long since learned the basic commands by heart (if you haven’t learned them, then quickly learn them, there’s great book) and turned into a routine something that not long ago (old people here?) seemed ingenious, complex, and to some magical. And modern IDEs have simplified our life even more by hiding the command line and git commands from us, replacing them with the ability to click with the mouse. But wait, weren’t you interested in understanding as a child how this or that toy works inside, how the refrigerator or the engine in your dad’s Lada works (are the old ones still here?)? So I became interested in looking under the hood of GIT. Of course, as with any complex mechanism, the level of this “looking” under the hood can be different, for some it will be enough to see the engine cover and the hole where to pour the antifreeze, but for others, “looking under the hood” means reaching the grade of steel used for the manufacture of this or that Lada parts. Therefore, let's immediately indicate the level of our immersion in this article. In the article we will look in detail at what happens when we do the usual “git clone/push”, let’s see how this process works and what possibilities it has. Entities and processes that, of course, will remain outside the scope of this story can be found independently (I provided the link above), because it is not possible to cover such a broad topic as Git, and especially its engine compartment, at once. So for anyone who is interested, please refer to the cat.
So, let's start with the fact that we will look under the hood according to the following plan:
What types of protocols are there in Git?
How communication occurs via the ssh protocol.
Let's look at the capabilities of the Git protocol.
Types of GIT protocols
Git can work with four network protocols for data transfer: local, HTTP, Secure Shell (SSH) and Git. In this part we will discuss only the SSH and HTTP protocols; the rest are “niche” and are not widely used, but we’ll say a few words for a general idea.
Local protocol
The basic protocol is Local protocol, for which the remote repository is another directory on disk. And that’s all we will say about it, because in essence this protocol is archaic.
HTTP protocols
Smart HTTP
Smart HTTP runs on top of standard HTTP/S ports and can use various HTTP authentication mechanisms, this is often easier for the user than something like SSH as you can use login/password authentication instead of setting up SSH keys. It has probably become the most popular way to use Git now, as it can be used both for anonymous access (if allowed of course) and for pushing changes with authentication and encryption like the SSH protocol. Instead of using different URLs for these purposes, you can use one URL for everything. If you are trying to push changes and the repository requires authentication (usually it does), the server may ask for a username and password. The same applies to read access.
“Dumb” HTTP
If the server does not respond to a Git smart request over HTTP, the Git client will try to fall back to a simpler one Blunt HTTP protocol. The stupid protocol expects a bare (oops..) Git repository to be served as a collection of files by the web server. The beauty of the dumb HTTP protocol is how easy it is to set up. Essentially, all you need to do is place the bare repository in the root directory of any web server that can serve static content over HTTP/S. Now anyone can clone a repository as long as they have access to the web server that hosted it. In essence, such a protocol is read-only, but in theory, in this one task it can be faster for highly loaded git hostings.
Git protocol
The next protocol is the Git protocol. Git comes with a special daemon that listens on a separate port (9418) and provides a service similar to the SSH protocol, but with absolutely no authentication. To use the Git protocol for a repository, you must create a file in that repository git-export-daemon-ok
, otherwise the daemon will not work with this repository. Accordingly, any repository in Git can either be available for cloning by everyone or not. As a result, changes cannot typically be sent using this protocol. You can provide write access, but due to the lack of authentication in this case, anyone knowing the URL of your project can change it. Overall, this is a rarely used feature.
SSH protocol
A commonly used transport protocol for self-hosted Git is SSH. The reason for this is that many servers already have SSH access, and if they don't, it's very easy to set up. Additionally, SSH is an authenticated protocol, and due to its popularity, it is usually easy to set up and use. Data is transmitted encrypted through authorized channels. Finally, like HTTP/S, Git, and local protocol, SSH is efficient because it compresses data as much as possible before transmission.
The disadvantage of SSH is that using it you cannot provide anonymous access to the repository (if that can be considered a disadvantage). Clients must be able to access the machine via SSH, even to operate in read-only mode, which makes SSH unsuitable for open source projects. But in a corporate setting, this shortcoming is not present by definition.
Next, let's look at how git services communicate via the ssh protocol, since this communication is exactly the same as in the case of HTTP.
Communication via SSH protocol
For communication via the SSH protocol, the “smart” protocol is most often used (notarized 99.9%). It is “smart” because it requires a special process on the server that knows about the structure of the Git repository, can figure out what data needs to be sent to the client, and generates a separate pack file with the missing changes for it (thereby optimizing data transfer). The operation of the smart protocol is ensured by several processes: two for sending data to the server and two for downloading from it.
Uploading data to the server
Processes are used to upload data to a remote server send-pack
And receive-pack
(actually these are commands from the git utility, but if you rummage around on your system where git is installed, you will find a separate executable file: git-receive-pack
). Process send-pack
runs on the client and connects to receive-pack
on server. And this is done without any magic, but only using the ability of the ssh protocol to remotely run a command on the server. So for example you want to push your local changes in a repository to a remote repository, your “git push origin” command under the hood will make the following call: “$ ssh -x git@server "git-receive-pack 'simplegit-progit.git'"
” Thus, on the server side the executable file will be launched git-receive-pack
and communication between these two processes will begin according to a certain protocol, which we will talk about later.
The communication process between the git client and the git server is built on a rather specific git protocol. Data is transmitted in packets. Each packet starts with a 4-byte hexadecimal value that specifies its size (including those 4 bytes). Packets typically contain one line of data followed by a terminating line break. Let's look at an example (all names are fictitious and the matches are random):
|
The first packet starts with 00a5, which is 165 in decimal and means the packet size is 165 bytes. The next value in the example “ca82a6dff817ec66f4437202690a93763949” is the SHA-1 of the object's git commit, followed by a reference to it. Afterwards there is a list of the server’s capabilities, we will talk about them a little later (by the way, within the framework of this protocol there is an agreement that the capabilities are transmitted only in the first packet). The next packet starts with 0000, which indicates that the server has finished transmitting the list of links.
Now that send-pack
found out the state of the server, it identifies commits that exist locally, but are missing on the server. This information process send-pack
passes to the process receive-pack
for each link that is to be sent. For example, if we update a branch master
and add a branch experiment
answer send-pack
will look like this:
|
For each link that is updated, Git sends a string containing its own length, the old hash, the new hash, and the name of the link. The first line also sends the client's capabilities; we will describe the capabilities in more detail below. A hash consisting of zeros indicates that there was no such link before – after all, we are adding a new branch experiment
. When deleting a branch, everything would be the other way around: the zeros would be on the right.
The client then sends a pack file containing objects that are not on the server. Finally, the server transmits the status of the operation – success or error: 000eunpack ok
Downloading data.
To retrieve data from the remote repository, a second pair of processes is used: fetch-pack
And upload-pack
(Similarly, you can also find a separate executable file git-upload-pack). And just like pushing, your “git clone” command under the hood will do: “$ ssh -x git@server "git-upload-pack 'simplegit-progit.git'"
” and the process of communication between these processes will begin. As soon as the connection is established, the server sends the following to the client:
|
This is very similar to the answer receive-pack
, but only the possibilities are different. In addition upload-pack
sends back a HEAD link (symref=HEAD:refs/heads/master
) so that the client knows which branch to switch to if cloning is in progress.
At this stage the process fetch-pack
looks at the available objects, and for missing objects responds with the word “want” indicating the SHA-1 of the required object. For each of the objects it has, the process sends the word “have” indicating the SHA-1 of the object. At the end of the list he writes “done”, which indicates to the process upload-pack
start sending a pack file with the necessary data:
|
That is, using just two commands “want” and “have”, the client and server can agree on what data the client needs and what data it already has, and thus the server can prepare and send not all the data for it, but only the missing ones , thereby greatly optimizing data exchange.
Protocol capabilities
We mentioned above that in the first message, both when loading and downloading data, the capabilities that the server supports and which the client can use to obtain the desired result are transmitted. Let's take a closer look at them.
Let us recall an example of the first message from the server when downloading data:
|
In this message, the server lists the features it supports. The client, having read such a message, can respond by sending a message with a list of those capabilities declared by the server that he wants to use. Let's look at the possibilities shown in the example:
multi_ack – This feature allows, in the process of communication between the client and the server, to optimize the process of finding a base commit (a base commit is a commit that both the client and the server have and from which you can reach another commit or commits along the object tree). For example, we have the following tree of git objects (Green shows the commits that are on the server, blue are the commits that the client has):
Let's say that the client wants to get commits X,Y (and, accordingly, the missing commits preceding them). During communication, the client “says” “have F,S” (in fact, two messages, but for brevity we combined them into one message), but the server does not know anything about these objects, since only the client has them. Then the client “says” – “have E, R”, which the server also does not know about. So the process continues until the client “says” – “have D”, the server has commit D and the server responds “ACK D continue”, thereby informing the client that the “base” commit has been found (the client has commit D and the server has both that the originally requested commit Y) is reachable and that the client does not need to further request commits from that branch of the CBA tree. But since the client still wants to commit X, a similar request is made to search for the “base” commit on the SRQ branch. If the server did not support the ability multi_ack then you would need to pass a CBA to find a common “base” commit for XY.
multi_ack_detailed – the most that we have been able to find out about this feature is that it allows the server to respond in more detail, but what this detail is and what it affects is not clear.
thin-pack – the ability of the server and client to receive and process “thin” packets. Git stores data in a pack file (in some cases this can be several files, but which can always be combined into one), which contains all the information about all git objects. But in order to optimize the amount of data transferred between the client and the server, you can transfer not all objects each time, but only those that are not on the receiving side, the client or the server. Such a package is called “thin”. But at the same time, you need to make sure that the server can send “thin” pack files, and the client is ready to process them. This feature is the basis for the speed and efficiency of Git.
side-band side-band-64k – This feature allows the server to send and the client to receive data using data multiplexing, transmitted pack file, progress data and error data. The point is that, as we said above, data is transmitted in packets, in the case side-band the packet size is 1000 bytes and 65520 bytes in case side-band-64k (now of course it is only used side-band-64k). As we remember, each packet begins with a 4-byte hexadecimal value that determines its size (including these 4 bytes), followed by 1 byte indicating the stream code, and then the rest of the space in the packet is taken up by data. The stream code just allows you to specify the type of data being transferred.
Code | Data type |
---|---|
1 | pack file data |
2 | Operation progress data |
3 | Error data |
Thanks to this feature, we see the processing of our clone or push command, intermediate messages, as well as errors and warnings.
ofs-delta – this feature means that the server and client can use the pack v2 format, which allows you to send and receive objects by position in the pack file, and not by object identifier, which allows you to increase processing speed.
shallow – with this feature, the client can, using the additional commands “deepen”, “shallow” and “unshallow”, receive not the entire repository, but only a certain limited amount, a slice to a certain depth by commit, or by date. It is this server feature that is used by the client when we specify, for example, the “–depth 1” flag in the “git clone” command, which will allow us to get a repository with the state of the last commit of the desired branch. Thus, without completely downloading the entire history of the repository, which can be very useful in various CI/CD where, as a rule, only the latest state of a specific branch of the project is needed and you can greatly reduce the amount of downloaded data and increase the speed, for example, of building and rolling out an application to the stand .
no-progress – the ability, at the client’s request, to disable receiving data type with code 2 (above in side-band side-band-64k we talked about this type). When we run the “git clone” command using the “-q or –quiet” flag, we can refuse to receive messages about the progress of the command, the client will send the corresponding command to the server.
include-tag – the ability of the server to automatically include objects of the Tag type in the transferred pack file, if they are associated with any other objects in the transferred file. As a rule, this feature is always used, since Tags are often used in projects.
symref=HEAD:refs/heads/master – using this feature, the server tells the client which link the special HEAD pointer refers to, and the client uses this information to understand which default branch to switch to during the “git clone” operation.
agent – using this feature, the server sends its version of git, to which the client can send its own in response. But the exchange of this information does not affect anything, but is used solely for output to the log for the possible use of this data for debugging purposes.
Conclusion
In this modest article, we were able to take a look under the hood of Git, in terms of communication via the git protocol between the client and server processes, which occurs when we want to retrieve data and when we want to push our changes to a remote repository. Despite the fact that we have looked and examined not all aspects, nuances and essences of the large internal world of Git, which I warned you about at the very beginning. But as they say, big things begin with small things and the one who walks will master the road. I strongly advise you to read a book about Git if you use it, and I’m sure you do, otherwise you wouldn’t have read this far. I won't be lazy paste the link to the book again.
Thank you for your attention!