Understanding MavenGate. Is it really that scary?

Source: https://o-bake-tyan.livejournal.com/184979.html

Source: https://o-bake-tyan.livejournal.com/184979.html

Earlier this year, experts from the company OverSecured published articlewhich described in detail the attack on the MavenGate supply chain, so I will not focus too much on the description of the attack. In short, an attacker can buy out the domain of a library developer and get the opportunity to freely update this library on behalf of the author, and in theory, can add something bad to this library. And application developers, without checking the contents, will pull the infected library into their project. The situation is unpleasant.

According to OverSecured, 18% of all dependencies in public repositories such as MavenCentral, jCenter, and jitpack may be susceptible to this attack. This is quite a lot, especially considering that open-source projects mostly use either dependencies from public repositories or the same open-source libraries from jitpack.

And here I became interested in how open-source Android mobile application projects are actually susceptible to this attack, and how many libraries are already “infected” as part of these very projects.

Selection of test subjects

The easiest option is to take the most popular open source applications and check them. I googled them and the list turned out to be like this:

I didn't take browsers for the study, because their security is a separate topic, and for the most part browsers are written using native code, and MavenGate is more about JVM code. I also didn't take applications that are more than 50% written in C/C++, because there is no point in this either. Of course, Maven repository artifacts can store native code, but I come across such libraries very rarely.

Research idea

Once the subjects were selected, I wondered what to do with them all. The attack involves replacing the library with a similar one, but with a “hello”, so their hash sums will be different. This means that if someone managed to put something bad into the library in one repository, the hash in another repository will be different.

However, maven repositories store not only jar files, but also aar, the so-called Android Archive, they can store the application manifest, and in general, are closer to the Android system than jar files. The same library can be built as both jar and aar, inside they will look different, respectively, the hashes of these files will be different, this is worth considering.

The attack also involves buying out the domain if the library author hasn't taken care of it before, so you need to find out who owns the domain, when it was registered, when it was renewed, and when it expires. If the domain is bought out right after this news comes out, it may indicate that something bad may be brewing. Whois services can help with this.

These are the two main ideas for dependency analysis. If you go through each library and calculate its hash in each of the popular repositories, reverse the domain, send a whois request, and put it all together in a table view, it will allow you to clearly demonstrate which libraries are completely safe and which ones could potentially be “hello”.

Then the table will have columns:

  • Domain

  • Library name

  • Version

  • Domain owner

  • date of creation

  • Date of update

  • Domain registration end date

  • hash of jar library from repository 1

  • aar hash of the library from repository 1

  • hash of jar library from repository N

  • aar hash of the library from repository N

In order to clearly show in the table which library can actually pose a danger and which cannot, you need to set a coloring rule for each row of this table:

Let me explain each case in this diagram.

  • We don't paint – when the hashes match and the domain is updated before the news release, everything is fine.

  • Blue – the hashes match, but the domain was recently purchased, which may indicate that it may not have been purchased by the library’s author.

  • Yellow – the domain was bought before the news was published, it is assumed that the attacker could not have planted a library with “hello”, but the hashes do not match, which is very strange. Of course, there may be different reasons why this happened. The developer may simply not have noticed that he was pouring different libraries marked with the same version into different repositories.

  • Red – an extreme situation in which the domain has recently been updated and the hashes do not match, there are a lot of questions about this library, you need to download it and see what might be wrong there, and if an infected library is detected, take action.

Getting dependencies

Having defined the algorithm of actions, you need to think about how to get dependencies. Since we are talking about maven repositories and android projects, the first thing that comes to mind is to use the Gradle assembler for this, it has the functionality of working with dependencies. Some of you may have the idea to parse the build.gradle files in search of the dependeices block. This can certainly be done, but there are also transitive dependencies that also need to be obtained and parsed. And there are also dependencies that the plugin brings. All applications can have different plugins and they can bring their own dependencies. Of course, most often, they do not bring dependencies, but there is a chance.

Without thinking twice, I used the task:

./gradlew dependencies

But this command will output a list of dependencies only for the root of the project, and these are not all dependencies of the project, since there may be subprojects – gradle modules that may have their own dependencies and plugins, they also need to be taken into account as part of the analysis of dependencies of one project.

./gradlew <имя модуля>:dependencies

To get a list of modules, you can call the task:

./gradlew projects

As you might guess, I parsed all this stuff using regular expressions. Then, for each library domain, I used Whois to find out information about the domain and accessed the most popular Maven repositories (Maven Central, jCenter, jitpack, clojars – I added just in case) to find out the library hash, then I simply put everything into a pandas array, so that I could then transfer everything, say, into an Excel table.

Why Excel, you ask, because it is easiest to post-process the data manually (of course, if necessary), find patterns, and, ultimately, transfer it back to pandas, so that some other utility can do the post-processing. And the data can be sorted directly in Excel by any feature, which is very convenient.

It is important to clarify that I did not take the Google repository, since it is unlikely that you can bring anything from outside to Google, Google will not give it)

I also did not analyze libraries whose authorship clearly belongs to “big brother”. We are talking about Google, JetBrains, etc. These guys simply will not allow their libraries to be compromised.

I used February 1, 2024 as the reference date for whois, despite the fact that the news was released on January 17.

Result

As a result, I ran all the applications I mentioned above, but according to the rule described in the diagram, the blue ones were found more often, that is, the recently updated domain, in principle, this is logical, because developers began to look more closely at their code in public repositories after the news about MavenGate was released.

It turned out something like this:

At the end I will attach a link to all xls files to make it easier to read.

I didn't find any yellow ones in any application, which confirms that developers still upload the same libraries to different repositories.

However, I did find one “red” library, it was in the k-9 mail attachment, as you can see on the screenshot, the hashes are clearly different in MavenCentral and jitpack, and it would seem, here it is, bingo! One of the two libraries must clearly behave differently.

But no! If you manually download these two archives from the links:

https://repo.maven.apache.org/maven2/io/github/detekt/sarif4k/sarif4k/0.4.0/sarif4k-0.4.0.jar
https://jitpack.io/io/github/detekt/sarif4k/sarif4k/0.4.0/sarif4k-0.4.0.jar

You can verify that the hashes are actually different:

So, what's the difference? You need to look inside the archive and compare. This can be done using windiff:

As you can see from the windiff output, this is just a kotlin module, with only the manifest being different. Let's look at this manifest:

If you look closely, you can see that the lines are simply swapped) This could have happened simply because the developer could have built libraries for different repositories at different times or the build was under a different configuration. But in fact, it doesn't matter why the lines are swapped, in fact, it won't change anything during the build of the application with this library. If the executable code was different, it would definitely be bingo!

And so, in all applications I got 947 libraries, of which:

  • 448 libraries are colorless, that is, completely safe for the study

  • 498 libraries – blue, domain updated recently

  • 0 libraries – yellow the strangest category is missing, fortunately

  • 1 library is red, but we have already seen that it is harmless)

conclusions

I think that 947 libraries, of which 1 is potentially infected, is an insignificantly small percentage to attack at least some significant number of JVM applications, especially taking into account the fact that this library is clearly from github and that it turned out to be harmless.

When Gradle pulls a library from a public repository, there is a 50% chance that you will get a library whose domain was purchased after February 1, 2024.

Based on the research figures, it can be said that in 99.9% of cases, libraries downloaded from different repositories are are the samewhich means that the order of repositories declaration in build.gradle/pom.xml files doesn't matter.

Yes, there may be a situation where libraries are already compromised and have long contained something bad (red rule), but this situation is beyond the scope of the study, since it is often difficult to understand which library was the original. This needs to be done in a separate study with the addition of code analyzers, but what's the point if the library hashes are almost always the same.

It turns out that MavenGate is not as scary as it is “painted”.

Ultimately, no one is stopping you from using the simplest and most effective way to ensure the security of your applications on the JVM platform – using a package mirror in the company's environment, where code analyzers work, which can reliably determine which library is “healthy” and which is not.

The results of the study can be found at link.

If you have any questions, write in the comments. I'll try to answer 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *