how to store more data in QR codes

Encoding data in decimal requires many more characters than the same data encoded in base64 – 06513249 vs YWJj. However, this rule does not apply when it comes to QR codes. They work much better using decimal numbers. There is no magic, just all the additional digits are stored as efficiently as if there was no coding at all. Decimal encoding allows QR codes to store more data and are also easier to scan.

In the article I will tell you:

  • how, in practice, using decimal encoding reduces (slightly) the density of a QR code containing a URL;

  • why it works this way: all decimal data is URL safe and is efficiently stored in the numeric mode of the QR code, while base64 has a 75% loss since the data has to be stored in binary format;

  • how to cram as much data as possible into a URL into a QR code.

Two QR codes that encode the same data: the one on the left, which uses base64, is slightly denser than the one on the right, which uses decimal (compare the sections in the top center).  Large modules (small squares) of a decimal QR code make it easier to scan.

Two QR codes that encode the same data: the one on the left, which uses base64, is slightly denser than the one on the right, which uses decimal (compare the sections in the top center). Large modules (small squares) of a decimal QR code make it easier to scan.

In the article “Mechanical sympathy for QR codes: improving registration in New South Wales” I've been researching QR codes used for COVID contact tracing. It turns out that several states have included a bunch of information in their QR code URLs as a Base64-encoded JSON object, presumably because it's convenient.

They used 228-character URLs like this: https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc 3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ== and this eyJ...was a binary large object. At the error correction level H this can be encoded into an 81×81 QR code (version 16).

If you still need to store data in JSON format (the article describes why we don’t do this), there are ways more efficient than base64: by rewriting the data in decimal format, we will get a 353-character URL https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=072685680885510189821994892577900638215789419258463239488533499278955911240512279111633336286737 08900838429306693197431130553333789459140433065670260399803592059658551713155596743015525925740271167169927643240820915139763 8174974409842883898456527289026013404155725275860173673194594939.

Any sane person would say it's 50% longer. Fortunately, for QR codes everything is a little different, and there will be 20% fewer modules (squares) required. The above URL fits into a 73×73 QR code (version 14).

QR codes with fewer modules easier to scan.

Complex data in URLs

QR codes can store arbitrary data, but they are typically used to store a URL so that when scanned, you can go to a site and get all the useful information from there. On the other hand, the QR code can be placed somewhere where there is no access to the Internet, and it should also contain a sufficient amount of useful information. This results in URLs with a lot of data.

However, a URL is a limited container: trying to insert arbitrary data into it can lead to problems associated with incorrect interpretation or distortion of special characters. Luckily, URLs are mostly just text, and there are many ways to convert arbitrary data to text: Wikipedia page on encoding binary data to text As many as 28 pieces are described.

One of the most common, perhaps, Base64: It's built into every browser. For example, the one from which you are now reading this article. Base64 encodes 3 bytes into 4, selected from a shortened 64-character alphabet. This is a reasonable default choice. It can be included in JSON, HTML/XML, CSS, and of course URL attributes.

All this means is that base64 data may end up in QR codes, where a URL containing a large amount of data is encoded. Unfortunately, base64 is a poor method for encoding binary data in a QR code: the choice of alphabet forces the QR code to store data in an unnecessarily inefficient way.

Coding schemes

There are a huge variety of ways to encode a set of arbitrary bytes into a shortened character set. We will take a detailed look at base64, base10 and base45 from RFC 9285which is designed for QR codes (others such as base16 (hexadecimal), base32 or base36 are clearly inferior).

coding

characters

input:output ratio

example

Is the URL secure?

absent

symbols

1

abc

No

base64

0–9, A–Z, a–z, +, /

1.33

YWJj

partially

base64url

0–9, A–Z, a–z, -, _

1.33

YWJj

Yes

base45

0–9, A–Z, $%*+-.:/, space

1.5

0EC92

No

base10

0–9

2.41

06513249

Yes

Some are URL-safe: the encoded data can be inserted directly into the URL without requiring any special escaping or processing, while others are not or may require special handling on the server's part. We are going to host data in URLs, so URL security is a must. So we need to consider the following options:

I came up with “base10” thanks to viewing bytes as a huge integer (little endian) base 256, and then printed an integer in a different number system. This doesn't scale to large data, but works fine for a couple of kilobytes that can be stored in a QR code. The Python version could be like this:

import math

_DIGITS_PER_BYTE = math.log(2**8, 10)

def b10encode(data: bytes) -> str:

    raw = str(int.from_bytes(data, byteorder="little"))

    # Handle leading zeros so that, for instance, b"", b"\x00"

    # and b"\x00\x00" each have their own unique encoding, not

    # just "0".

    encoded_length = math.ceil(len(data) * _DIGITS_PER_BYTE)

    prefix = "0" * (encoded_length - len(raw))

    return prefix + raw

def b10decode(s: str) -> bytes:

    # Deduce the length of the result from the input, matching

    # b64encode's zero-padding (NB. a real implementation

    # should validate the length is valid)

    decoded_length = math.floor(len(s) / _DIGITS_PER_BYTE)

    return int(s).to_bytes(

        length=decoded_length,

        byteorder="little",

    )

In the table, the “Input:Output Ratio” column shows how many output symbols are required (on average) over the number of input symbols.

For example, when encoded in base10, long inputs are nearly the same as the average, while short inputs can vary more:

input (hexadecimal)

conclusion

ratio

01

001

3/1 = 3

12 34 56

05649426

8/3 = 2.67

FF FE…01 00

00000496…7615 (617 digits)

617/256 = 2.41

QR code modes

A QR code stores data in a bitstream and the input data is encoded in segments. Each data segment can be encoded in one of four different modes. Each mode supports different inputs. For example, the “alphanumeric” mode, which supports storing only numbers, capital letters, and some punctuation marks, defines how these characters are mapped to bits for storage.

mode

symbols

bits per symbol

costs

Digital

10:0–9

3.33

0.34%

Alphanumeric

45: 0–9, A–Z, $%*+-.:/space

5.5

0.15%

Binary

256: random bytes

8

0%

Kanji

8189 1

13

0.0041%

For numeric and alphanumeric modes, multiple input characters are stored together, resulting in fractional bits for a single character. For example, digital mode stores groups of 3 digits into 10 bits, for example 123456 is encoded in two chunks, 123 and then 456 in 20 bits.

There will be “costs”, albeit small ones, in any coding. It's convenient that 103 just a little less than 210 and 452 less than 2eleven.

When a QR code contains multiple pieces of data, it is still one long stream of characters, so careful selection of pieces allows you to choose the optimal mode for each significant substring of the input data.

A la decimal mode

The mode required to store encoded data depends on the output symbol set. For the encodings we looked at above:

coding

symbols

QR mode

base64url

A–Z, a–z, 0–9, -, _

Binary

base10

0–9

Digital

Base64url contains lowercase letters and therefore requires binary mode when saved in a QR code. 3 input bytes (24 input bits) turn into 4 output characters and these 4 characters must be stored in a 4×8 = 32 bits QR code. This results in an average overhead of 33%: 1 input byte is stored as 1.33 bytes in the QR code. Each byte can store up to 256 different values, but base64 encoding only uses 64 of them. This results in 75% of the values ​​being lost, or 2 bits in each byte stored.

For base10 the calculation is not so simple, but we can do it

  • The input data will be a number of 8-bit bytes, each of which has 28 = 256 possible values.

  • Each input byte log(256, 10)on average turns into ≈ 2,408 output digits.

  • 3 digits are stored in 10 bits.

  • In total 10 / 3 * log(256, 10)each input byte requires ≈8.027 bits to store.

As a result, 1 input byte is stored on average as 1.0034 bytes: an overhead of 0.34%.

These are precisely the costs of the digital mode itself. There is no overhead when encoding binary text into Base10 encoding! Encoding the same data in Base10 requires 242% more characters, but these characters can be efficiently stored in a QR code. There is no excess left during storage.

Let's summarize

Let's go back to our URL from before: https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data=eyJ0IjoiY292aWQxOV9idXNpbmVzcyIsImJpZCI6IjEyMTMyMSIsImJuYW1lIjoiVGVzdCBOU1cgR292ZXJubWVudCBRUiBjb2RlIiwiYmFkZHJlc 3MiOiJCdXNpbmVzcyBhZGRyZXNzIGdvZXMgaGVyZSAifQ==

Parameter “data=” contains some Base64 encoded JSON. This can be encoded better, without JSON or base64, but let's assume that we need to do JSON. We can take the blob eyJ0I… and decode it to basic JSON: {"t":"covid19_business","bid":"121321","bname":"Test NSW Government QR code","baddress":"Business address goes here "}.

Passing these bytes through the function b10encode gives a very long number: 07268568088551018982199489257790063821578941925846323948853349927895591124051227911163333628673708900838429306693197 43113055333378945914043306567026039980359205965855171315559674301552592574027116716992764324082091513976381749744098428838984 56527289026013404155725275860173673194594939.

Decimal encoding seems long to humans, but for QR code encoding it is simpler and takes up only slightly more space than raw JSON requires [^base16-base32-base36]:

line

length

raw JSON

118 bytes

base64 encoding

160 characters

base64 QR storage

160 bytes

base10 encoding

285 digits

base10 QR storage

119 bytes

You may want to ask about other encodings such as base16 (hexadecimal), base32 And base36 (which are good for alphanumeric mode) or base 8 (which are good for digital mode but are easier to work with than base10). They are less efficient for digital mode than base10, but are potentially more convenient in some cases.

In a URL, the rest of the part is not purely numeric, so in order to see the benefits of this encoding, two segments must be used:

  • one with “boring” URL fragments at the beginning, probably using binary mode

  • one with a large block of data in Base10 format, using numeric mode

Using Segno 1.6.1 you might see something like this:

url_prefix = "https://www.service.nsw.gov.au/campaign/service-nsw-mobile-app?data="

encoded_data = b10encode(data)

# Two segments, so that they're encoded with separate modes

qr = segno.make([url_prefix, encoded_data])

The NSW COVID registration QR codes used error correction level H. If we take this as a basis and simply change the encoding of the data parameter, we see the result from the beginning of the post:

Two QR codes that encode the same data: the left one uses base64, which gives an 81x81 QR code (version 16), and the right one uses base10, which gives a 73x73 QR code (version 14).

Two QR codes that encode the same data: the left one uses base64, which gives an 81×81 QR code (version 16), and the right one uses base10, which gives a 73×73 QR code (version 14).

Extreme values

The ability to select modes, encodings and segments allows you to satisfy the limitations of the QR code format. QR codes can store up to 23.6 kilobits ≈ 3.0 KB when using version 40 with error correction level L. This corresponds to approximately 7 thousand decimal digits in numeric mode or 3 thousand bytes in binary mode.

We can try this on some test URLs like http://example.com/{encoded_data}. The maximum we can fit into a QR code looks like this:

Coding

Input data

URL Length

base64url

2.2 KB

2.9 thousand

base10

2.9 KB

7.0 thousand

In theory, a much longer base10 URL doesn't matter: thanks to QR mode, it is compressed very efficiently.

However, after the publication of this article there was notedthat iOS is incorrectly crawling a huge base10 URL when reading http://example.com instead of http://example.com/310….

So URL length matters! It looks like this length limitation is different from normal URL restrictions because 8 KB is a reasonable lower limit for “normal” browsers these days, so 7k characters should be safe. Pasting the same link in Safari directly works great!

So it's better to do your own testing.

Two maximum QR codes, which for most people are almost indistinguishable from each other.  But you and I know that the second contains 33% more data.

Two maximum QR codes, which for most people are almost indistinguishable from each other. But you and I know that the second contains 33% more data.

Thank you for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *