Behind the scenes of Google Colab


Let’s take a look inside Google Colab and find out how you can customize Colab to fit your work needs, rather than adjusting to the limitations of this tool. Details before launch flagship course in Data Science.


Google Collaboratory, better known as “Colab”, is a free framework for Jupyter notebooks. In addition to the Python and R notebook launch environment, Colab allows you to share free access to a limited number of GPUs and TPUs.

Colab quickly became de facto environment programming Jupyter notebooks, but using Colab for anything other than Jupyter notebooks is incredibly difficult. This is especially true for ML engineers who want to create models and bring them out of the notebook stage. The notebook is ideal for research, but does not play well with the broad MLOps toolkit, which codifies learning into a formal pipeline.

Behind the scenes

Colab’s secret sauce is its backend: Google’s infrastructure servers allow you to run code with a click of your fingers or a single click of a button. So, our first step is to analyze this back-end API. The simplest is to check the calls to the Colab API during its normal operation:

  • We launch DevTools Chrome, find the Network tab and try to run the code cell.

  • DevTools starts recording every Colab request – and almost immediately we find something interesting:

It looks like the URL /tun/m//socket.io is a Jupyter socket proxy on the remote machine. If we launch the Files panel (by default it shows the /content directory) from the left pane of UI Colab, we get another interesting request:

This time, the response body is JSON listing the files on the remote host. It looks like the URL /tun/m//api/contents/ points to a service that provides file metadata:

Double clicking on a file in the Files panel loads and displays that file in Colab. If we try to click on /content/sample_data/README.md, we notice a request to /tun/m//files/ which returns the contents of this file:

It is understood that https://colab.research.google.com/tun/m// is a reverse proxy to the server running the Colab instance providing the /socket.io, /files and /api endpoints /contents.

Let’s try to see if any of the services are running inside the Colab container instance. To do this, run lsof installed inside the Colab container: lsof -iTCP -sTCP:LISTEN to list all processes that are listening on a TCP port:

Yes. The colab-fileshim, node and jupyter-notebook processes look promising for studying surfaces. We’ve already dealt with the Files panel, so let’s take a look at colab-fileshim first. Its PID is 28, and if so, check the /proc filesystem to see the entire CLI command:

The next step is to explore /usr/local/bin/colab-fileshim.py. Ironically, you can do this by going to it on the Files panel itself. Basically, the program seems like an uninteresting file server. Little is known, other than that the server itself responds to the localhost:3453/files request with the actual contents of the file, and to localhost:3453/api/contents with JSON metadata). This means that Colab redirects these requests from the tunnel URL to port 3453 of the instance itself.

In the Network Chrome DevTools tab, we can right-click on a request to copy the corresponding cURL command and play it back. Here are the cURL options for viewing README.md:

$ curl 'https://colab.research.google.com/tun/m/m-s-3oy94z70yrj59/files/content/sample_data/README.md?authuser=0' \
  -H 'authority: colab.research.google.com' \
  -H 'x-colab-tunnel: Google' \
  -H 'accept: */*' \
  -H 'dnt: 1' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'referer: https://colab.research.google.com/' \
  -H 'cookie: <<READACTED>>' \
  -H 'range: bytes=0-930' \
  --compressed

By executing this command on the terminal of the local computer, we will get the contents of the README, and after trial and error we will see that most of these headers can be truncated, and only these ones can be left:

$ curl 'https://colab.research.google.com/tun/m/m-s-3oy94z70yrj59/files/content/sample_data/README.md?authuser=0' \
  -H 'x-colab-tunnel: Google' \
  -H 'cookie: <<READACTED>>'

The x-colab-tunnel header is intended to prevent us (or attackers) from making these requests from normal browser tabs, ostensibly to stop XSS attacks. The cookie header is responsible for Google authentication, which proves that we have the right to access the notebook instance. The cookie is long and unwieldy, so let’s store it in the $COLAB_COOKIE shell variable.

$ COLAB_COOKIE="<<PREVIOUSLY REDACTED VALUE>>"
# Usage: $ curl ... -H "cookie: $COLAB_COOKIE"

1. Replacing Colab servers with your own servers

Let’s see if we can use the discovered reverse proxy to tunnel requests. Let’s not mess around with the existing colab-fileshim.py server, but just replace the process with our server! Run pkill -f colab-fileshim to kill the process and then start our server on the same port. Demo – The default HTTP server to serve our files at localhost:3453/files.

Voila! Now let’s rewrite the cURL command to download our files!

$ curl 'https://colab.research.google.com/tun/m/m-s-3oy94z70yrj59/files/message.txt?authuser=0' \
  -H "x-colab-tunnel: Google" -H "cookie: $COLAB_COOKIE"
Hi! You've reached our own file server!
$ curl 'https://colab.research.google.com/tun/m/m-s-3oy94z70yrj59/files/shadow?authuser=0' \
  -H "x-colab-tunnel: Google" -H "cookie: $COLAB_COOKIE"
root:*:18585:0:99999:7:::
daemon:*:18585:0:99999:7:::
bin:*:18585:0:99999:7:::
sys:*:18585:0:99999:7:::
sync:*:18585:0:99999:7:::
# ...

Pay attention to the log line in the Colab cell. It proves that the request was processed by our server:

Serving HTTP on 0.0.0.0 port 3453 (http://0.0.0.0:3453/) ...
172.28.0.1 - - [22/Jun/2022 16:43:10] "GET /files/message.txt HTTP/1.1" 200 -
172.28.0.1 - - [22/Jun/2022 16:43:16] "GET /files/shadow HTTP/1.1" 200 -

Unfortunately, due to the x-colab-tunnel: Google header requirement, the server cannot be easily accessed from a browser.

Further exploration

Let’s take a look at another interesting process – this is node. Let’s check /proc/7/cmdline and see that the process is executing /datalab/web/app.js. And there we will find that /datalab/web contains a fairly standard NodeJS application. Along with the /socketio/ route, it also provides the /_proxy/{port}/ route. This should allow access to anyone URL from any port!!! on a Colab instance:

$ curl 'https://colab.research.google.com/tun/m/m-s-3oy94z70yrj59/_proxy/1234/some/path?authuser=0' \
  -H "x-colab-tunnel: Google" -H "cookie: $COLAB_COOKIE"
<html><head><title>Colab Forwarded Server!</title></head><body><h1>Hi from Colab!</h1><h2>path=/some/path</h2></body></html>%

If only we could view this page from a browser tab… Unfortunately, Colab refuses to pass requests unless they have an x-colab-tunnel: Google. In an attempt to visit these URLs from a browser, we will see a 400 HTTP error:

Displaying entire web pages

Fortunately, to insert HTTP headers into browser requests, you can use extension Chrome. Let’s configure it to send x-colab-tunnel: Google for all requests:

And we will launch tunnel URLs right in the browser!

To Jupyter!

Finally, let’s look at the third and final interesting process – jupyter-notebook, which listens on port 9000. You can try to visit the port from the browser using our proxy and header trick, visit /tun/m//_proxy/9000. Unfortunately, instead of the Jupyter UI, we will see an HTTP 500 error page.

Weird. We try to run !curl -i localhost:9000 from the notebook itself, but we still get an error message:

The previous lsof output gives us a hint: instead of listening on 0.0.0.0/:: (all IPs on all interfaces), Jupyter only listens on the private IP provided to the Colab instance. This is done, apparently to avoid exposing the Jupyter interface. Of course. Google didn’t go out of their way to hide it.

To get around the listening address limit, you need to create a process that listens on all interfaces and IPs, redirecting all traffic it receives to a specific IP that Jupyter is listening on. To do this, you can install socat (“Socket Cat”) and forward traffic from localhost:9000 to $HOSTNAME:9000 and back through it:

Began! Reloading the URL in the browser shows the Jupyter UI snippets, but it’s obviously broken.

Jupyter expects it to be accessed at the root of the domain (/), but our Colab tunnel has the path /tun/m//_proxy/9000. This messes up all absolute paths to resources like CSS and JS files. There is no simple solution here – we need a whole (sub)domain to redirect traffic to our server.

Show UI Jupyter

Luckily, Colab has a well-hidden but official solution for this! Oddly enough, it’s hidden so well that it took me longer to find it than it did to find the internal reverse proxy!

To learn how to use the official Colab port forwarding, you need to open the Code Snippets tab on the left sidebar and find the output processing snippet. Click “View Source Notebook” and you will be taken to advanced_outputs.ipynb, compilation snippets from Colab. These snippets showcase the platform’s intimidatingly documented features. The specific snippet we need can be found in the “Browsing to servers executing on the kernel” section.

We can use this snippet to provide a Jupyter UI on a subdomain:

And now you can click on the link, add /tree to the URL to silence Jupyter, and see the fully working Jupyter UI! Well, almost completely. Google seems to have limited the official proxy to GET requests only, allowing you to view but not launch notebooks.

And we will help you upgrade your skills or master a profession that is relevant at any time from the very beginning:

Choose another in-demand profession.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *