HTML Agility Pack and AngleSharp

Start: “PowerShell: HTML parser, deprecated methods.”

Getting the “HTML Agility Pack” and “AngleSharp” libraries

Both of these libraries (sets of classes) can be obtained for free from the package repository “www.nuget.org” in the Internet:

With “PackageManagement” (formerly “OneGet”)

First I tried to use the “PackageManagement” (previously called “OneGet”) package manager built into the Windows PowerShell wrapper using the HTML Agility Pack library as an example. The list of cmdlets required to work with this package manager can be viewed in website documentation Microsoft company.

I managed it with the following command:

> Install-Package -Name "HtmlAgilityPack" -Source "nuget.org" `
                  -Scope CurrentUser -SkipDependencies

But working with this package manager is, in my opinion, a below-average pleasure. Some additional configuration and updating of some modules of the “Windows PowerShell” wrapper program may be required. The “NuGet” provider module must be installed, the correct package source URL must be specified, and so on. Other than that, as you can see from the command above, I had to use the parameter -SkipDependenciesto opt out of getting dependencies (packages that the “HTML Agility Pack” library depends on). Without this parameter, nothing could be loaded.

Parameter -Scope CurrentUseris of course not required. I added it because I did not want to elevate the rights to administrator. By default, packages are installed in the folder Program Files (administrator rights are needed), and with the specified parameter, the packages are installed in the folder of the current user (the rights of the current user are enough).

Parameter -Source "nuget.org" means package source at URL https://api.nuget.org/v3/index.json.

The idea behind this package manager is interesting (the idea of ​​bringing together a number of package managers under one roof), but poorly implemented. Besides, on the site of this package manager It says it’s not currently in development. Three years have passed since the termination of work on this program (since 2019). It also says that only “supporting” versions will be released.

Using the “nuget.exe” utility

This is the method I like the most. Fast, clear, no problem. This utility can be downloaded from the package repository site “www.nuget.org” In chapter “Downloads“. Its current version is 6.2.1 (2022). Size – 6.75 MB.

I installed the above two libraries in the current folder (project folder) using the following command from the Windows PowerShell wrapper:

> .\nuget install htmlagilitypack
> .\nuget install anglesharp

The composition of the libraries

The current version of the Html Agility Pack library is 1.11.43 (dated June 2, 2022). The library itself is a file “HtmlAgilityPack.dll”. It has a description of the classes included in it, the methods of these classes, the properties of these classes in the file “HtmlAgilityPack.xml”. The latter can be viewed in any text editor.

When downloading from the package repository, the following tree of files and folders was created on my computer:

HtmlAgilityPack.1.11.43
│   .signature.p7s
│   HtmlAgilityPack.1.11.43.nupkg
└───lib
    ├───Net35
    ├───Net40
    ├───Net40-client
    ├───Net45
    ├───NetCore45
    ├───netstandard1.3
    ├───netstandard1.6
    ├───netstandard2.0
    ├───portable-net45+netcore45+wp8+MonoAndroid+MonoTouch
    ├───portable-net45+netcore45+wpa81+wp8+MonoAndroid+MonoTouch
    └───uap10.0

Each of the subfolders in the folder lib contains its own version of the Html Agility Pack library. In principle, I can use any of these versions in my PowerShell script. These are just different versions of the same library designed to work in different implementations of the .NET platform. I tested all these eleven options on a simple test case and each of these options works without problems for me (there will be a practical example later in the article). The size of the library (file “HtmlAgilityPack.dll”) varies between 130-165 KB, which, in my opinion, is a meager size.

An idea of ​​the library options contained in the downloadable package can be obtained from the package page in the “Frameworks” tab. This tab shows in general all the currently possible options that the library for the .NET platform can theoretically implement, and the actually implemented options are marked in blue.

For some reason, no dependencies (packages on which the downloaded package depends) for the “Html Agility Pack” library did not load, although on the package page in the “Dependencies” tab it is shown that there are dependencies for some versions of the library.

The current version of the AngleSharp library is 0.17.1 (dated June 2, 2022). The library is the “AngleSharp.dll” file. It has a description in the AngleSharp.xml file.

When downloading from the package repository, the following tree of files and folders was created on my computer:

.\
├───AngleSharp.0.17.1
│   │   .signature.p7s
│   │   AngleSharp.0.17.1.nupkg
│   │   logo.png
│   └───lib
│       ├───net461
│       ├───net472
│       └───netstandard2.0
├───System.Buffers.4.5.1
├───System.Memory.4.5.4
├───System.Numerics.Vectors.4.5.0
├───System.Runtime.CompilerServices.Unsafe.6.0.0
└───System.Text.Encoding.CodePages.6.0.0

Five folders on the same level as a folder AngleSharp.0.17.1 are dependency packages on which the AngleSharp library depends. The “nuget.exe” utility downloaded them at the same time it downloaded the “AngleSharp” package. Each of the three subfolders in the folder lib contains a version of the “AngleSharp” library for a separate implementation of the “.NET” platform. However, it seems that all three versions of the library (the file “AngleSharp.dll”) are identical, since they have the same (with byte precision) size of 862 KB.

When working with the AngleSharp library, you will not necessarily need all five dependency packages. For my script so far, only one dependency has been needed – System.Text.Encoding.CodePages.

Sample Code Using Libraries

I decided to use the library variant from the “netstandard2.0” folder for both libraries, since only for this specification in both tested libraries simultaneously there are options.

As a reminder, I am coding a very simple HTML parser example in a script for the Windows PowerShell version 5.1 wrapper on Windows 10. I get the HTML code for analysis from a UTF-8 encoded file (without the BOM tag) and put it in a variable $html type System.String. I leave these operations outside the scope of the article and will not demonstrate them here, so as not to increase the size of the article.

Example 1 Using the “Html Agility Pack” library:

Add-Type -Path "HtmlAgilityPack.1.11.43\lib\netstandard2.0\HtmlAgilityPack.dll"
$dom = New-Object -TypeName "HtmlAgilityPack.HtmlDocument"
$dom.LoadHtml($html)
""
foreach ($node in $dom.DocumentNode.DescendantNodes()) {
    if (("#text" -eq $node.Name) -and ("" -eq $node.OuterHTML.Trim())) {
        #   Пустые узлы (пробелы, символы новой строки и т.п.) не выводим
    } else {
        $content = ""; if ("#" -eq $node.Name[0]) {
            $content = ", '" + $node.OuterHTML.Trim() + "'"
        }
        $node.Name + $content
    }
}
""

Example 2 Using the “AngleSharp” library:

Add-Type -Path ("System.Text.Encoding.CodePages.6.0.0\lib\" +
                "netstandard2.0\System.Text.Encoding.CodePages.dll")
Add-Type -Path "AngleSharp.0.17.1\lib\netstandard2.0\AngleSharp.dll"
$parser = New-Object -TypeName "AngleSharp.Html.Parser.HtmlParser"
$dom = $parser.ParseDocument($html)
""
foreach ($elem in $dom.All) {
    $elem.tagName
}
""

In the case of the AngleSharp library, as can be seen in the code above, it was necessary to first load into the session of the Windows PowerShell shell program one of its dependencies received along with this library – System.Text.Encoding.CodePages.6.0.0 . Without this, this library did not work for me, the script gave an error.

The test code itself is somewhat different, but at this stage it doesn’t matter to me. Now it is important for me to check how the libraries are connected and whether they can be used for the HTML parser in principle.

Test script results

The content of the test file with HTML code (in UTF-8 encoding without BOM tag):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Название страницы</title>
</head>
<body>
  <!-- содержимое страницы -->
  <p>Текст.</p>
</body>
</html>

Result for example 1. Using the “Html Agility Pack” library:

<!DOCTYPE html>
html
head
meta
title
Название страницы
body
<!-- содержимое страницы -->
p
Текст.

Result for example 2. Using the “AngleSharp” library:

HTML
HEAD
META
TITLE
BODY
P

conclusions

Both libraries can be used to create an HTML parser. I am not yet ready to say which one is more convenient in terms of functionality. I need to write a working prototype of the program I need, then it will become clearer. For now, I’ve decided to use the Html Agility Pack library because it’s smaller and has no dependencies.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *