The Story of How Graphviz and BOR Cracked Sony's Encryption

I want to dedicate my first article to the story of how I decided to investigate the often-occurring obscure byte strings in PlayStation Portable modules. I couldn't find any documentation in the Homebrew community, so I took it upon myself.


I started reverse engineering PSP games for modding purposes about two years ago, before that I somehow couldn't get around to it, although I watched other modders' YouTube channels. In my local Patapon trilogy modding community, I suddenly became one of the most advanced specialists and started researching for fun in Hydra absolutely every game I had on my PSP-Go. While I was doing this, I started to notice that the data section sometimes had very similar, almost-ASCII lines.

I spent over an hour looking for this screenshot on Discord when I was writing this article…
This whole sequence of non-zero bytes really interested me then.

This whole sequence of non-zero bytes really interested me then.

No one referred to them in the programs, so there was no way to find out what kind of animals they were.

One fine day I noticed that absolutely the same lines were present inside ~SCE headers of some PRX modules.

PRX in the context of Playstation Portable

Now it would be unacceptable for me to continue the story without outlining what PRX is.

On PSP, executable files are in PRX (Playstation Relocatable Executable) format.

In fact, it's a slightly modified ELF:

  1. The field has been rethought p_paddr in the structure Elf32_Phdr

  2. New types of relocation added

  3. The file is encrypted with the KIRK cryptographic system.

  4. A header is assigned, starting with ~PSP

However, there is a non-zero number of modules (like libraries, like a regular .dll/.so), which have another header at the beginning:

Font library file in game resources

Font library file in game resources

PPSPP and the PSP loader itself in this case does the simplest thing – reads the size of this header (bytes [0x4; 0x7]), and then skip it as if it never existed. As you can see, inside this header is a line similar to the one I found in the game data (highlighted). If the PSP itself doesn't address this place, it becomes clear that we are faced with a task similar to the task of understanding a language that no one can read or speak anymore.


My reverse engineering was greatly simplified when I found quite a few games that were released with debug information (I'm not talking about logs, which are present in small quantities in almost all games, but about real “-g” debug information with line numbers, function signatures, basically DWARF in its best form).

In one of the games I saw that there was a symbol next to the line __psp_libinfo__so I started to assume that the meaning of these lines is something like the version of the lib with which the program was statically linked, and I started calling them “libinfo” (libinfo, plural “libinfos”, libinfs, with the stressed “y” sound). All PRX files with a header ~SCEwhich were at my disposal, turned out to be very truncated – not only is there no debug information or section names… There are no sections at all. However, inside each one there was exactly one libinfa, which coincided with the libinfa in the header.

I immediately came up with a plan: to collect a database of such libs, and then use Aho-Corasick to automatically go through the game files to understand what libs are present there.

We'll find out right away which SDK functions make sense to look for in the binary, and which don't! And then we'll get to the meaningful code...

We'll find out right away which SDK functions make sense to look for in the binary, and which don't! And then we'll get to the meaningful code…

It turned out in practice that files with the same name can have different libs, so I started adding a suffix to the duplicates _{номер}.

The base is a regular JSON file, in which the name is mapped to base64 from libbinfa bytes

The base is a regular JSON file, in which the name is mapped base64 from libinfa bytes

It seemed normal to me, well, really, if this is a version of the library, then it is obvious that there will be more than one libinfa. However, I did not believe in such a large number of library revisions. I can only remember where this or that entry came from in my database by referring to the logs of the script that replenishes the existing database. A friend of mine helped me with collecting information, he ran my script on PRX-s from his game storage (he has several hundred games, I feel like).

I decided it was time to crack these strings. The first idea was XOR with some key. I tried 256 keys – all wrong. I didn't try larger key sizes, because either it was unclear how to XOR, or the task was too complicated. Besides, it was obvious that these strings weren't hashes at all: they were too similar…

I mentioned the Aho-Corasick automaton earlier. If you didn't know, it's based on the “boron” data structure.

Bor, aka Trie

(By the way, from the word reTRIEval)

A bor containing the words “he,” “she,” “his,” and “hers”

A bor containing the words “he,” “she,” “his,” and “hers”

This structure allows for relatively efficient (and visual) storage of a set of strings. You can add new elements, delete them, and check whether a string is in the set. The trie is a rooted oriented tree, where the edges represent characters and the nodes represent words. The first word added to the trie is simply split into a chain of characters and attached to the root, with the last node marked as “terminal”. This is necessary so that the structure knows which words are actually contained in it. When adding a new word, the trie does the same thing, but instead of always attaching to the root, it creates a new branch at the point where it no longer has enough characters in the tree (i.e. the new word is sort of budded off in the middle of the branch). This saves space – if two words have a common prefix, it will only be stored once (see photo above). To add a word whose chain is already in the trie, we simply make the last node in it terminal.

Now it is obvious how to search for strings in a tree: we go through all the letters in the input word and try to follow the same path in the tree from the root. If there is no vertex, we immediately say that there is no string in the tree, and if we have successfully passed the entire path, we check whether the vertex is terminal. I am not interested in deletion, but it is done very simply – we search for a word, and then remove the terminal flag from the vertex if it is found. Plus or minus, we also clear the memory…

The implementation of this tree remains unclear, since each node must somehow know all of its children, and the inclusion check must be fast, so that the vector from std::unique_ptr won't work very well. If the alphabet (the set of letters that can appear in lines) is of size Nyou can store an array of at the top N pointers and char convert into an index by subtracting the first character of the alphabet… But it is not always convenient, so I used dictionaries. In addition, I added a “tag” field to all nodes to store some additional information in it.

The resulting code is as follows:

class TrieNode:
    def __init__(self):
        self.value = 0x0
        self.terminal = False
        self.children: Dict[int, TrieNode] = collections.defaultdict(TrieNode)
        self.tag: Any = None

    def __repr__(self):
        return chr(self.value)

    def __str__(self):
        return chr(self.value)


# That's the generator behind the TrieIterator class
def get_trie_words(root: TrieNode):
    if root.terminal:
        # We're a word in the Trie, let's return nothing
        yield b""

    for letter, child in root.children.items():
        suffixes = get_trie_words(child)
        yield from (letter.to_bytes(1, "little") + word for word in suffixes)


def count_trie_words(root: TrieNode):
    count = 1 if root.terminal else 0

    for child in root.children.values():
        count += count_trie_words(child)

    return count


# Constructed by the Trie.__iter__ method
class TrieIterator:
    def __init__(self, node: TrieNode):
        self.gen = get_trie_words(node)

    def __iter__(self):
        return self

    def __next__(self):
        value = next(self.gen)
        if value is None:
            raise StopIteration
        return value


class Trie:
    def __init__(self):
        self.root = TrieNode()

    def add(self, word: bytes, tag: Optional[Any] = None):
        current = self.root
        for letter in word:
            current = current.children[letter]
            current.value = letter
        if not current.terminal:
            current.terminal = True
            if tag is not None:
                current.tag = tag

    def __contains__(self, item):
        current = self.root
        for letter in item:
            if letter not in current.children:
                return False
            current = current.children[letter]
        return current.terminal

    def __getitem__(self, item):
        current = self.root
        for letter in item:
            if letter not in current.children:
                raise KeyError(f"{item} not found in Trie!")
            current = current.children[letter]
        return current

    def __iter__(self):
        return TrieIterator(self.root)

    def __len__(self):
        return count_trie_words(self.root)

Is this code efficient? In my opinion, get_trie_words – a terrible crutch, but I have absolutely no reason to rethink the entire structure. Especially since I have no time limits for work…

I immediately decided that the bore needed to be visualized by means of Grafvisa. I won't tell you how to install it, there are quite clear instructions on the Internet. Fortunately, I have already worked with it through Python (pygraphviz allows you to emit *.gv-files), so it wasn't too hard this time. I just had to go through the tree and register the vertices (they're called “nodes” here, not “vertex”), and then do the same with the edges.

Usually, in a forest, you write symbols on the edges, and write whole words in the vertices, but I guessed that the vertices would be huge and the picture would not be very clear, so I also wrote symbols in the vertices. (Well, except for the terminal ones, of course… You still need to write words there). To make it really convenient, I gave all the terminal vertices the module name as a tag and placed it under the line. By the way, here already base64 won't help, we need to stuff our half-ASCII into the graph. I made my own renderer, because Python thinks that \x7F – printable. Maybe, DEL and in fact it is printable, but I don’t need squares in the console and empty space in the picture…

Study

Finally, all the instruments are ready, we can start the analysis! First, I decided to collect statistics. The script analyzer loads the libinf database and prints the following information on the screen:

  1. Libinf lengths

  2. Total alphabet of all libinfs

  3. Symbols found in the i-th position in libintfs

  4. Compressed form of the alphabet – in the sorted alphabet we collapse the ranges of consecutive symbols (say, a leaf ['A','B','C','D','M','X'] turns into a string "[A; D], M, X")

What about symbol frequencies?

I didn't bother with frequency analysis, because I wasn't sure that there was any language behind it. Especially since it's unclear what to compare it with… What if it's encrypted? Shift-JIS with hieroglyphs… And even if it is English, it is still unlikely to obey a known distribution, after all, the sample is very specific.

It turned out that all the lines have the same length of 0x18 bytes, and they have a common prefix of 8 bytes: \yr=`c`0. However, the last observation could have been made using the graph.

This is the trunk of my tree.
I traditionally export to SVG, because usually my graphs are VERY large and do not fit into PNG in acceptable quality...

I traditionally export to SVG, because usually my graphs are VERY large and do not fit into PNG in acceptable quality…

I would not have made further discoveries without the visual component. Looking closely at the SVG, I noticed that the libinfs corresponding to modules with the same names are grouped into subtrees. Moreover, the depth of such subtrees is 4 vertices, not counting the root.

A family of such similar, but so different libmd5.prx

A family so similar, yet so different libmd5.prx

Now I would like to consolidate the terminology: prefix libinf – first 8 bytes, body of the libintha – the next 12 bytes, suffix libinfy – last 4 bytes.

I had a hypothesis that the body is responsible for identifying the module, and the suffix is ​​for the version. Taking this into account, I generated a new SVG, where I already used the cut libinfs as words. I cut off the prefix and suffixes, and used the first one from the subtree as the module name. It became better, I would say, more minimalistic.

I was ready to give up and start packing up my scripts to be torn apart by the PSP Homebrew community, when suddenly it hit me.


Half past one in the morning… I stared with dazed eyes at the group of libraries libsfmt{цифры}when suddenly I noticed that the edges in the subtree had Latin letters written in caps. A little comparison… and I understood what it all meant!

I perceived the two as a kind of terminal symbol that is used to fill the body if there are still bytes left.

I perceived the two as a kind of terminal symbol that is used to fill the body if there are still bytes left.

That's when I realized that this was most likely Caesar cipher! In this case, however, the alphabet is a set of 256 values ​​of one byte. The first 127 are ASCII, the rest are generally nothing, but in practice, we have regional extended ASCII there… but let's not talk about sad things.

def _transform(word: bytes, shift: int):
        ans = bytearray(len(word))
        for index, letter in enumerate(word):
            ans[index] = letter - shift
        return ans.decode("ascii")

I wrote a decipherer for our cipher and ran it on all libinfs, taking the shift equal to -0x12. All the bodies were decrypted and turned out to be just shortened versions of the module names (the twos turned out to be spaces), but the prefix and suffix were not fixed. Then I assumed that they were encrypted with a different shift.

I started with the prefix… I went through all the shifts, found nothing.

Then he took all the suffixes and began to go through the shifts simultaneously for all of them.

The only shift that looked plausible was -0x14

The only shift that looked plausible turned out to be -0x14

I assumed that these were SDK versions. I went to read the PPSSPP resources and confirmed the theory! “500b” most likely means a beta version! Well, perhaps now everything falls into place – we may have few revisions of individual modules, but there were many SDKs themselves.

Then I ran the tests on the prefix again and realized that I had searched poorly…

It was all right before my eyes, but I missed it.
And the box was easy to open!

And the box was easy to open!

I shared the discovery with the creator of PPSSPP (Henrik Rydgård), after which he asked if I could PR the emulator with logging of the loading of such ~SCE modules. I smog. Here you can also see an example of fully decrypted library names (on the screen).

Conclusion

I am extremely glad that I was finally able to break through this secret that has been bothering me for several months!

  1. Firstly, I showed myself in the reverse of the PSP itself, although it would seem that there was nothing left there that was unknown and within my power.

  2. Secondly, it is now officially time to start hunting for static libraries in executables. While I was researching the games at hand, I came across Locoroco 1. There I found previously unknown supposedly standard functions (starting with “sce”, for example scesupMsiolRename). In general, there are now at least four libs about which there is no information on the Internet: supac2ms-SJ9, libwr2pt-SJ2, libmsiol-SJ5, spac2msR-PD0. Let's explore! We found a lib where the last symbol is a zero byte (and after decryption we fly away to the area 0xec), so I added a check via isprint before outputting to the console.

  3. Thirdly, my discovery provided evidence that Lib-PSP iplloader is not the official name of the PSP bootrom (and therefore on the wiki page removed Lib-PSP (from the alternative title).

If there's any interest, I'm planning to do something like a series of articles about Hydra and what to eat it with, and I also owe myself an article about ImHex.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *