How to copy all packages from nuget.org

Let's say you wanted to keep a copy of all nuget.org packages, just in case. How to detect and download all packages without attracting the attention of service administrators?

NuGet Protocols

How does the NuGet client discover packages? The client has a list of packages that the user wants to download, the client must figure out where to download them from, recursively resolve the dependencies and actually do the download. To obtain all the necessary information, he accesses the service API.

Those who are especially curious know that there are two versions of the NuGet protocol: v2 and v3 with corresponding “source URLs”:

V2 is based on OData (XML, strange query syntax – that's all) and is actually an interface to the database. As far as I understand, there is no official documentation for it, but there is unofficial.

Using v2, you can, in principle, list all the packages, but the service administration does not like it when people knock on v2 a lot and limits v2 request rate and cuts back possibilities. Don't use v2.

The v3 protocol was designed to improve scalability. Almost the entire v3 runs from static files that are distributed via a CDN, which is much easier to scale than a web service with a database. Only search requires some computing power to operate.

More about v3

The protocol is more or less adequate documented. A request to the v3 source URL will return us JSON with a list of “services” provided by the NuGet server implementation.

From the actual nuget.org answer, we will be interested in the following services:

{
    "@id": "https://api.nuget.org/v3/catalog0/index.json",
    "@type": "Catalog/3.0.0",
    "comment": "Index of the NuGet package catalog."
},

{
    "@id": "https://api.nuget.org/v3-flatcontainer/",
    "@type": "PackageBaseAddress/3.0.0",
    "comment": "Base URL of where NuGet packages are stored, in the format https://api.nuget.org/v3-flatcontainer/{id-lower}/{version-lower}/{id-lower}.{version-lower}.nupkg"
},    

All (or almost all) JSON objects returned by v3 are actually JSON-LD if suddenly you know what it is and how to use it. Because of this, they contain interesting properties with @ at the beginning of the name, which can be safely ignored. Perhaps because of this, there are… “peculiarities” of parsing some objects, more on that below.

Catalog

By using Catalog/3.0.0 we will list all packages. The directory is a log of everything that has happened to NuGet packages since the “beginning of time.” After reading the directory from start to finish, we will reproduce all the changes on nuget.org, which will give us a list of all packages. Nothing is removed from the directory, only entries are added to the end.

The v3 protocol did not exist from the very beginning of nuget.org, so when it was created, all packages that existed at the time of launch were imported, which led to the creation of a large number of entries at the beginning of the directory with close timestamps (February 1, 2015). After this, all new packages have a timestamp close to the time of publication.

https://api.nuget.org/v3/catalog0/index.json returns JSON with approximately the following structure:

{
  "@id": "https://api.nuget.org/v3/catalog0/index.json",
  "@type": [
    "CatalogRoot",
    "AppendOnlyCatalog",
    "Permalink"
  ],
  "commitId": "a304b4af-3a2c-4653-8ba8-2cdfd667951d",
  "commitTimeStamp": "2024-10-10T02:42:09.3106213Z",
  "count": 20671,
  "nuget:lastCreated": "2024-10-10T02:41:49.91Z",
  "nuget:lastDeleted": "2024-10-09T16:35:48.4746061Z",
  "nuget:lastEdited": "2024-10-10T02:41:49.91Z",
  "items": [
    {
      "@id": "https://api.nuget.org/v3/catalog0/page7713.json",
      "@type": "CatalogPage",
      "commitId": "9f4532df-09d2-473e-a5b5-acfe3fa3935a",
      "commitTimeStamp": "2018-12-29T16:00:42.7935125Z",
      "count": 533
    }
    ...
  ]
}

This is the “catalog index” – information about all its pages. To avoid having one huge file, the directory is divided into pages.

Properties starting with nuget: in this answer – undocumented service fields related to catalog generation.

commitId – GUID of the last record (more precisely, a group of records, details below). If it has changed since the last reading, then new entries have been added.

commitTimeStamp – time of last recording. All timestamps use UTC.

count – number of catalog pages.

items – an array of objects with information about each page: GUID of the last post, time of the last post, number of posts and a link to the page.

The array of records is not sorted, if you want the pages in the order they were created, you will have to sort it by field commitTimeStamp.

Catalog pages

If we send a request with the page address, we will receive, for example:

{
  "@id": "https://api.nuget.org/v3/catalog0/page0.json",
  "@type": "CatalogPage",
  "commitId": "19a4aedc-5139-4df5-81a3-b40aeabb3f3c",
  "commitTimeStamp": "2015-02-01T06:30:11.7477681Z",
  "count": 540,
  "items": [
    {
      "@id": "https://api.nuget.org/v3/catalog0/data/2015.02.01.06.22.45/adam.jsgenerator.1.1.0.json",
      "@type": "nuget:PackageDetails",
      "commitTimeStamp": "2015-02-01T06:22:45.8488496Z",
      "nuget:id": "Adam.JSGenerator",
      "nuget:version": "1.1.0",
      "commitId": "b3f4fc8a-7522-42a3-8fee-a91d5488c0b1"
    },
    {
      "@id": "https://api.nuget.org/v3/catalog0/data/2015.02.01.06.22.45/agatha-rrsl.1.2.0.json",
      "@type": "nuget:PackageDetails",
      "commitTimeStamp": "2015-02-01T06:22:45.8488496Z",
      "nuget:id": "Agatha-rrsl",
      "nuget:version": "1.2.0",
      "commitId": "b3f4fc8a-7522-42a3-8fee-a91d5488c0b1"
    },
    ...,
    {
      "@id": "https://api.nuget.org/v3/catalog0/data/2015.02.01.06.30.11/superfarter.1.0.0.json",
      "@type": "nuget:PackageDetails",
      "commitTimeStamp": "2015-02-01T06:30:11.7477681Z",
      "nuget:id": "SuperFarter",
      "nuget:version": "1.0.0",
      "commitId": "19a4aedc-5139-4df5-81a3-b40aeabb3f3c"
    }
  ],
  "parent": "https://api.nuget.org/v3/catalog0/index.json",
  "@context": {
    "@vocab": "http://schema.nuget.org/catalog#",
    "nuget": "http://schema.nuget.org/schema#",
    "items": {
      "@id": "item",
      "@container": "@set"
    },
    "parent": {
      "@type": "@id"
    },
    "commitTimeStamp": {
      "@type": "http://www.w3.org/2001/XMLSchema#dateTime"
    },
    "nuget:lastCreated": {
      "@type": "http://www.w3.org/2001/XMLSchema#dateTime"
    },
    "nuget:lastEdited": {
      "@type": "http://www.w3.org/2001/XMLSchema#dateTime"
    }
  }
}

Here we see already familiar commitId, commitTimeStamp, count And items in the root object with the same meaning as in the previous request, but only in the context of the current page. itemsaccordingly, contains data about directory entries, instead of pages.

parent contains a link to the catalog table of contents.

@context can be ignored.

Array elements items contain a link to the directory entry itself, the type of this entry, the time and GUID of the entry group, the identifier and version of the package. The array is again not sorted.

Groups of records

As you can see in the example above, the first two elements items have the same property values commitTimeStamp And commitId. These are the features of the process that creates the directory.

The generator process wakes up every few minutes and checks for fresh changes in the database. All detected changes are recorded on one page with one commitId And commitTimeStamp. If the addition of read changes to the current page exceeds a certain threshold, the process creates a new page and places the records there. Because of this, also, the page size is not constant.

There is probably a limit to the size of a single commit, otherwise there would be one (or several) huge pages at the beginning of the directory, but this is clearly not the case.

Another observation: at the beginning of the directory, the pages were approximately 550 entries in size, and at the end – more than 2700. The size was changed in 2022 to slow down page count growth.

Post types

The documentation describes two types of directory entries:

  • nuget:PackageDetails – created for all new packages, as well as if the package metadata has changed.

  • nuget:PackageDelete – created if the package was deleted. Deletion occurs in some exceptional cases, so such entries are rare.

A few words about changes to package metadata. Until 2018, downloaded packages could be edited on the website. Due to the fact that the metadata is in the package in a .nuspec file, the site repackaged the package with the new metadata. In 2018, the ability was added for authors and repositories to sign packages (the signature of the author and the repository can be present at the same time), which put an end to the ability to edit packages after publication and the ability was sawed out.

Accordingly, until 2018 you can find several catalog entries for one pair (идентификатор; версия) with different metadata. In this case, the entries following the first contain updated package metadata and the current version of the metadata is in the last entry.

Documentation describes in what cases new entries for the same packages can be added today.

PackageDetails entry

{
  "@id": "https://api.nuget.org/v3/catalog0/data/2015.02.01.08.38.05/bclcontrib-abstract.spring.0.1.6.json",
  "@type": "nuget:PackageDetails",
  "commitId": "0502702c-6a9e-4eb6-93a6-e0798a3a0dc7",
  "commitTimeStamp": "2015-02-01T08:38:05.5456876Z",
  "nuget:id": "BclContrib-Abstract.Spring",
  "nuget:version": "0.1.6"
},

Contains the identifier and version of the package, as well as a link to the “leaf” (catalog leaves, similar to leaf nodes of trees), which contains more information: a sheet of type PackageDetails contains data from the .nuspec file of the package: section data metadata from the .nuspec file, dependency information, list of files, date and time of creation and publication, package size and hash, commit information: GUID and creation time, the same as on the catalog page and the package visibility indicator.

PackageDelete entry

Looks like this:

    {
      "@id": "https://api.nuget.org/v3/catalog0/data/2015.10.28.10.44.16/imagesbuttoncontrol.1.0.0.json",
      "@type": "nuget:PackageDelete",
      "commitId": "15d2ae77-d9e4-413e-a5da-f3ea3d5abeb1",
      "commitTimeStamp": "2015-10-28T10:44:16.9226556Z",
      "nuget:id": "ImagesButtonControl",
      "nuget:version": "1.0.0"
    },

There is no point in downloading the sheet to which this entry refers, because… the same thing is written there.

“Features” of parsing

I don’t know whether this comes from the JSON-LD format itself, the library that was used to generate this JSON-LD, or whether these are just bugs, but in JSON files hosted on api.nuget.org some properties can be represented either as a string/numeric literal or as an array.

For examplesometimes in the dependencies section you can find something like this:

{
  "@id": "https://api.nuget.org/v3/catalog0/data/2016.02.21.11.06.01/dingu.generic.repo.ef7.1.0.0-beta2.json#dependencygroup/.netplatform5.4/nuget.commandline",
  "@type": "PackageDependency",
  "id": "NuGet.CommandLine",
  "range": "[3.3.0, )"
},
{
  "@id": "https://api.nuget.org/v3/catalog0/data/2016.02.21.11.06.01/dingu.generic.repo.ef7.1.0.0-beta2.json#dependencygroup/.netplatform5.4/system.runtime",
  "@type": "PackageDependency",
  "id": "System.Runtime",
  "range": [
    "[4.0.10, )",
    "[4.0.21-beta-23516, )"
  ]
},

Property range may be a string, or may be an array. Not a big problem, just keep in mind that this can happen.

Download packages

If we are only interested in the packages themselves, we don’t have to download the catalog sheets. To download a package, we just need to know the identifier and version of the package: in the comments to the resource PackageBaseAddress/3.0.0 everything we need to know in order to construct a download link is written. Documentation lovers will find that there recorded the same thing: we convert the package identifier and version to lower case and add it to the resource address according to the scheme:

https://api.nuget.org/v3-flatcontainer/{id-lower}/{version-lower}/{id-lower}.{version-lower}.nupkg

So all you have to do is go through the directory entries, generate links to packages and download them. You should keep in mind that a package can have several entries and you only need to download the package once; the package may be removed, so a request to the constructed link will return 404.

Because entries are always added only to the end of the directory; if you need to download packages as they are downloaded, then it is enough to remember commitId and check the directory table of contents every few minutes to see if the last commit identifier has changed, and when it has changed, look for new entries in the last pages of the directory. Orientation by time stamps will allow you to quickly find everything.

If the goal is to clone nuget.org with the ability to point the NuGet client to your clone and restore packages from it, then you will need to recreate the resource RegistrationsBaseUrl/3.6.0. To calculate the complete package tree, the client uses only this one, so this should be enough. The article is already long, so I will leave this as an exercise for the reader.

I don't know how much space is needed. The last time I tried this was several years ago, and then a 4 terabyte disk was not enough for me. The number of packages has increased several times since then. Surely, these days the total volume has exceeded well over 10 TB. Go for it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *