HTML parser, deprecated methods

I recently wrote a script in PowerShell that, in the first step, extracts text from a JSON text file and turns it into an object. An object is obviously easier to work with than pure text, albeit in JSON format. In the PowerShell language, this is easy to do, because it has a convenient cmdlet ConvertFrom-Jsondesigned for this.

The presence of this cmdlet in the language made me think that for the case with a text file containing HTML code, PowerShell also has some similar cmdlet. So I wanted to try writing a small script to parse a file with HTML code. However, my assumption turned out to be wrong. PowerShell has a cmdlet ConvertTo-Htmlbut no cmdlet ConvertFrom-Html. That is, you can easily upload a report by automatically converting it to an HTML page, but you will have to come up with something to convert an existing text file with HTML code into an object.

More about what I need

As you know, HTML code has a tree structure. The problem is that initially there is a simple text file with HTML code for analysis. The programmer cannot immediately start iterating through the branches of the HTML document tree in a loop to find the required HTML elements. First, in the text, you should detect and select individual branches of the HTML document tree, that is, HTML elements. To do this, you need to find HTML tags in the text that indicate the boundaries of HTML elements, and then decompose individual HTML elements into separate variables. Individual variables can be array elements or, better still, object properties. The result should be an object containing the tree of the HTML document.

The search and selection of individual branches of the HTML document tree from the text is scientifically called “lexical analysis”, while placing selected individual branches from linear text into a tree structure (in our case, into an object) – “parsing“. When automating these operations, the lexical and syntactic analysis of the source text is performed by the same program, which is often called a “parser” (tracing paper from the English word “parser”).

That is, I want to get (at the first stage) an HTML parser. As input, the HTML parser receives a text file with HTML code encoded in UTF-8 (without the BOM tag). The result of the HTML parser is an object containing the tree of the HTML document. Such an object is also calledobject model of the HTML document” (in English “Document Object Model” or abbreviated “DOM”).

If you have ever tried to write even some simple parser, you can understand that this is not an easy task. Moreover, it is not easy for modern HTML code. This is not a task that can be taken on alone. Thus, it is obvious that I need a ready-made solution created by other people, that is, some kind of library designed for this task. It’s good that such libraries exist and are easy to find. It is more difficult to choose the best solution from the available libraries.

Specificity. It should be clarified here that I am writing a script to work in Windows PowerShell version 5.1 and PowerShell version 7 shell programs on the Windows 10 operating system. Therefore, the file with the script must be encoded in UTF-8 with the BOM tag, if the script is supposed to output messages in Russian to the console.

I called the test file with the HTML code “file.html”. In this file, I wrote the following text in UTF-8 encoding (without the BOM mark):

<html>
    <head>
        <title>Тестовая страница</title>
    </head>
    <body>
        Текст.
    </body>
</html>

I will not describe here in detail how to get text from a file into a variable with all the necessary checks. The acquisition itself is performed as follows (variable $file contains the path to the desired file):

$html = get-content $file -Encoding utf8

Thus we have a variable $htmlA that contains the text from the specified file with the specified HTML code.

Deprecated Methods

First, I will give the code of two methods, and then I will give some explanations.

Method 1.

$dom = New-Object -ComObject "HTMLFile"
$enc = [System.Text.Encoding]::Unicode   # UTF-16LE
$dom.write($enc.GetBytes($html))

Method 2.

Add-Type -AssemblyName System.Windows.Forms
$webBrowser = New-Object -TypeName "System.Windows.Forms.WebBrowser"
$webBrowser.DocumentText = ""            # Здесь можно присвоить любой текст
$webBrowser.Document.write($html)
$dom = $webBrowser.Document

In the first method, we create the desired object using the class IHTMLDocument2referring to it with COM. This class is based on the browser engine “Trident” (aka “MSHTML”) by Microsoft. It appeared in the Internet Explorer browser in 1997 and lived in it until the latest version of this browser (version 11), released in 2013.

In the second method, to obtain the desired object, we first create an object of the class “WebBrowser“. This class represents one of the interface elements (“control”, in English “control”) included in the library “Windows Forms» (using this library, you can construct a window interface for your program). After some manipulations, from the “Document” property of the “WebBrowser” class object, we get an object of the “HTMLDocument“.

Actually, the “HTMLDocument” class is a wrapper around the “IHTMLDocument2” class used in the first method. That is, although the classes “HTMLDocument” and “IHTMLDocument2” have some differences, but in general they provide approximately the same capabilities. You could even say that it’s the same way.

If you want to use one of these methods, you should pay attention to the fact that the “write” method in the first and second methods takes a parameter of a different type. In the first method, a byte array should be passed to this method (type [byte[]]) with UTF-16LE encoded text. Therefore, the source text was recoded from UTF-8 encoding (without BOM) to UTF-16LE encoding. Moreover, it is an array of bytes that is transmitted, and not a regular string like System.String! In the second method, an ordinary string of the type is passed to this method System.String.

The second method uses the assignment of text to the “DocumentText” property of the “WebBrowser” class object. There is no special meaning in this assignment, so the assigned text can be anything. This line in the code is needed because as a result of this assignment, an object of the HTMLDocument class is created in the “Document” property of the “WebBrowser” class object, which you can then work with. I was unable to create a self-contained object of class “HTMLDocument” outside the context of an object of class “WebBrowser”. The fact is that the controls of the “Windows Forms” library are sharpened specifically for creating a window interface and are not intended for such use as in the above script. We can say that we are trying to hammer a nail with a camera.

I took the same test code for both methods, it can be placed in each of the two above methods after the above code. Here he is:

$dom.GetType()
""
$dom.all[0].outerHTML
""
foreach ($elem in $dom.all) {
    "--$($elem.tagName)"
}

Result 1.

IsPublic IsSerial Name                           BaseType
-------- -------- ----                           --------
True     False    __ComObject                    System.MarshalByRefObject

<HTML><HEAD><TITLE>Тестовая страница</TITLE></HEAD>
<BODY>Текст.</BODY></HTML>

--HTML
--HEAD
--TITLE
--BODY

Result 2.

IsPublic IsSerial Name                           BaseType
-------- -------- ----                           --------
True     False    HtmlDocument                   System.Object

<HTML><HEAD><TITLE>Тестовая страница</TITLE></HEAD>
<BODY>Текст.</BODY></HTML>

--HTML
--HEAD
--TITLE
--BODY

It can be seen that the objects used are different. Otherwise, the result is the same.

conclusions

I’m not going to use these methods in my script, since these classes are obsolete (based on the old “IE11”, which, as far as I understand, does not even support many of the innovations of recent years made to the HTML language standard). I just wanted to try them, as there are so many links on the Stack Overflow website for these methods. These methods still work, they can be used for some simple cases.

I’m going to consider one of two suitable libraries for use in my script: “HTML Agility Pack” or “AngleSharp“. Another interesting technology in this context is “webdriver”, but if you use it, then, apparently, not from “PowerShell”, because the browser is involved there. Whether I need it, I’m not sure.

Similar Posts

Leave a Reply