Journey into the world of microprocessor emulators

Background

One evening in August 2023, I was scrolling through YouTube, which was still working steadily, and I, like probably many readers, had a recommendation feed filled mainly with technical material of various kinds: “GTA 3 Code Review”, “Lua in 100 Seconds”, “How NASA writes space‑proof code”, “6502 CPU Emulator in C++“… So, stop, but the last video seems interesting!

In this playlist of 35 videos, the author with the nickname Dave Poo showed the process of developing an emulator for the MOS6502 microprocessor in C++, as the title actually says. Apart from explanatory and introductory videos, the structure of this playlist is as follows: video “tests for instruction X”, video “implementation of instruction X”. This continues until the very end, until the author has described the complete set of MOS6502 instructions.

Having an education in the field of radio electronics and unrealized engineering ambitions, I was extremely interested in this project, and since I had absolutely zero experience in emulators and assembler before, I took this task as a personal challenge. I decided to try to simply repeat the project, but I didn’t think that this topic would fascinate me so much.

Who is this emulator of yours?

In order to answer this question, you need to understand how a real processor works and how it processes commands. There will definitely not be a lecture on microelectronics here, but by the end of reading the post you will understand this process a little better, but for now I will describe it in a nutshell: the processor has registers for storing values ​​(usually temporary) and an instruction decoder that understands what the instructions need to be followed, and exactly how to do it. The processor reads N bytes that encode the instruction and executes it within M clock cycles. As a result, the contents of registers and/or values ​​in memory cells change.

Modern programmers write code in fairly high-level languages: Python, Go, Rust, Ruby, C/C++ and others, and I think it’s no secret that the processor knows nothing about these languages. In order for you to see the coveted “Hello World!”, code in any programming language, even if it is not compiled, is converted into assembler. Even the Java virtual machine, no matter how virtual it may be, still does not work in a vacuum, but with a very real processor of your computer. The assembler is then converted into machine code, and this is where the work of the decoder and all its associated entities comes into play.

So what is a microprocessor emulator? This is a program that simulates the processing of instructions by a real processor, while guaranteeing the correctness of its behavior. The emulator is loaded with machine code, which is executed exactly as it would be executed by a real processor.

There are several classifications of emulators, among which there are, for example, functional ones, focused on the accuracy of executing instructions within a given time frame, or physical ones, which precisely follow the ongoing physical processes. In this case we are talking specifically about a functional emulator.

The journey begins

First of all, open ISA (Instruction Set Architecture). This is the instruction specification for a given processor. I followed Dave Poo and for the first time entered the coveted phrase “MOS6502 ISA” into the search engine. This is where my journey into the world of 8-bit computers began.

The minimum required functionality for our emulator is as follows: reading/writing to memory and decoding/executing instructions. You need to start with the skeleton: memory and processor. I implemented the Memory memory structure as a wrapper over an array (not even a std::vector), which provides an interface for read/write operations, so I don’t see much point in providing the implementation. The structure of the processor, in turn, completely replicates its “iron” counterpart:

struct MOS6502 {  
    WORD PC;                // Program Counter  
    BYTE SP;                // Stack Pointer (+ 0x100 offset)  
    BYTE A;                 // Accumulator  
    BYTE X;                 // X Register  
    BYTE Y;                 // Y Register  
    MOS6502_Status Status;  // Status Register
}

struct MOS6502_Status {  
    BYTE C  :1;             // Carry Flag  
    BYTE Z  :1;             // Zero Flag  
    BYTE I  :1;             // Interrupt Disable  
    BYTE D  :1;             // Decimal Mode  
    BYTE B  :1;             // Break Command  
    BYTE NU :1;             // Not Used  
    BYTE V  :1;             // Overflow Flag  
    BYTE N  :1;             // Negative Flag
}

The processor registers are described here, including the current instruction pointer register and the stack pointer register, as well as the status register. If you look at the Wikipedia page, we will see that the register scheme is identical to the written code, and differs only in the order:

BYTE And WORD – aliases for uint8_t And uint16_t respectively. Just for convenience.

The structure of the MOS6502 processor has not been fully disclosed. Details will be added as the story progresses.

A bit of assembler

Before we move on to the implementation details, it’s worth going back a little to the processor and looking at the operation of a specific instruction in order to better understand the mechanisms of interaction between the processor and registers and memory.
It is worth noting that each instruction has its own code, which can be found in the ISA. Moreover, each instruction has its own signature, which determines the number of arguments.
As an example, let's take something quite simple, for example, the AND instruction, which, as you can guess from the name, implements the “Logical “AND” operation, but if you look in ISA, you will see that, in fact, there is more than one variation entries for this instruction:

  1. AND

  2. AND LL

  3. AND LL,X

  4. AND HHLL

  5. AND HHLL,X

  6. AND HHLL,Y

  7. AND(LL,X)

  8. AND (LL),Y

An untrained reader may wonder why there are as many as eight recording variations for one operation. You shouldn't run ahead of the locomotive, these are just different addressing modes that communicate where take data.

In fact, eight addressing modes is not enough. Modern processors have dozens of different ways to access one or another memory location.

Description of the instructions taken from very convenient sitestates that “the instruction performs a bitwise logical “AND” of register A (accumulator) with a byte of data from memory,” thus, a programmer from back in 1975 (or a peerless Ben Eater) there are as many (or only) eight ways to choose where exactly to get the number with which the “Logical “AND” operation will be applied to register A. Each type of addressing serves its own tasks: some make it possible to conveniently work with data arrays, others allow you to use fast , the so-called zero page of memory, others allow you to work with pointer tables, but let's move on to examples…

Let's get specific

There are some problems that need to be dealt with from the end. While working on the project, I, unfortunately, did not use this method, but I have the opportunity to present the material correctly, at least on paper. Using the AND instruction as an example, we will begin to dive into the operation of the emulator and smoothly, you won’t even notice how, we will move on to the intricacies.

So, we need the operation “Logical “AND” of register A with a byte of data, so we’ll write it:

void GenericAND(MOS6502 &cpu, const BYTE value) {  
    cpu.A &= value;
}

That's all! Did you think it would be something more complicated? That's right, it was not in vain that we talked about addressing modes.

AND

The so-called Immediate mode, also known as “mode with an immediate operand”, when the operand (data byte) is located immediately after the instruction code, works only with constants, does not work with variables:

In code, this mode looks like this:

AND #$0A ; знак "#" используется как маркер Immidiate-режима

And in the emulator it is implemented like this:

void MOS6502_AND_IM(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.FetchByte(memory);  
    GenericAND(cpu, value);  
}

Method FetchByte returns the next byte of data after the instruction, which is passed to GenericAND. With this addressing mode, the instruction is executed in 2 machine cycles and occupies 2 bytes of memory (instruction code + data byte).
Pretty straightforward, isn't it? Let's move on!

AND LL

Mode with reading from the “zero page” of memory (ZeroPage). With this addressing mode, immediately after the instruction there is also a byte, but this is no longer the value of the operand for the “AND” operation, but address of this operand in page zero. Similar to accessing a variable on the stack.

The zero page is the first page of memory with addresses 0x00-0xFF. This mode is the most time-efficient (after Immidiate), because the address value occupies one byte (instead of two in absolute addressing mode), which means that only one additional read operation is required to access the final data. For this reason, frequently used data is usually stored here.

This addressing mode in code will look like this:

AND $03

And in the context of the emulator like this:

void MOS6502_AND_ZP(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.GetZeroPageValue(memory);  
    GenericAND(cpu, value);  
}

And again two lines… Although this time instead FetchByte there is a function GetZeroPageValue. Don't let this scare you, because under the hood it does exactly what is shown in the picture above: reads a byte address and gets the value at that address:

BYTE GetZeroPageValue(const Memory &memory) {  
    const BYTE targetAddress = FetchByte(memory);  
    return ReadByte(memory, targetAddress);  
}

Since an additional read operation from memory has been added, in this addressing mode the instruction is already executed for 3 cycles. It still takes up 2 bytes in memory (instruction code + byte address). Data in zero page not included in the deal do not participate in the calculations.

AND LL,X

The same zero page addressing mode, but with indexing by the X register (ZeroPage,X). This means that the offset from the X register will be additionally added to the address of the zero page. This mode, by analogy with the previous one, is somewhat reminiscent of an array on the stack. In such a situation, you just need to increment the X register, repeating the same instruction.

In assembler, this addressing mode will be written like this:

AND $04,X

In the emulator, as you might guess, it is also slightly different from the previous version:

void MOS6502_AND_ZPX(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.GetZeroPageValue(memory, cpu.X);  
    GenericAND(cpu, value);  
}

Method GetZeroPageValue in addition to obtaining the zero page address, performs the described offset operation on the X register:

WORD GetZeroPageAddress(const Memory& memory, const BYTE offsetAddress) {  
    const BYTE targetAddress = FetchByte(memory);  
    cycles++;  
    return targetAddress + offsetAddress;  
}  
  
BYTE GetZeroPageValue(const Memory& memory, const BYTE offsetAddress) {  
    const BYTE targetAddress = GetZeroPageAddress(memory, offsetAddress);  
    return ReadByte(memory, targetAddress);  
}

Someone has already noticed the mysterious increment cycles? This increment emulates an additional machine cycle for adding the address to the X register.

We'll talk more about the cycle counter a little later, but I'll describe its purpose in a nutshell: I largely strive to comply with the ISA, which, among other things, specifies the number of cycles per instruction. I know for sure that, for example, the AND instruction in Immediate mode is executed in 2 cycles: 1 cycle for Fetch and 1 cycle for decoding. In tests, along with the result of the operation, I also check the number of cycles spent. If everything matches, it means that the instructions were implemented correctly, and if not, an error has crept in somewhere.

In this embodiment, the instruction still occupies 2 bytes in memory (instruction code + byte address), but 4 cycles are executed, due to the additional cycle for the offset.
Let's move on, there's not much left!

AND HHLL

Absolute addressing mode (Absolute). In this mode, the operand is a 2-byte address from which a data byte will be read to perform the AND operation. In this mode, you can obtain data from any memory location, since 2 bytes are involved in addressing, however, this instruction takes longer to execute. It looks like accessing a variable on the heap.

In code, this addressing mode looks like this:

AND $2000

Implementation in the emulator should no longer raise unnecessary questions:

void MOS6502_AND_ABS(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.GetAbsValue(memory);  
    GenericAND(cpu, value);  
}

… just like access to the address:

BYTE GetAbsValue(const Memory& memory) {  
    const WORD targetAddress = FetchWord(memory);  
    return ReadByte(memory, targetAddress);  
}

With this addressing mode, the instruction occupies 3 bytes in memory (instruction code + 2 address bytes) and is executed within 4 cycles.

AND HHLL,X / AND HHLL,Y

I decided to describe these two instructions in one place because they use a common mechanism of action. As in ZeroPage,Xthe offset from the index register is added to the absolute address. Works like accessing an array on the heap.

The assembly form of the instruction in this mode looks like this:

AND $2000,X
или
AND $2000,Y

Implementation in C++ should also not raise any questions:

void MOS6502_AND_ABS(Memory &memory, MOS6502 &cpu, BYTE affectingRegister) {  
    const BYTE value = cpu.GetAbsValue(memory, affectingRegister);  
    GenericAND(cpu, value);  
}  
  
void MOS6502_AND_ABSX(Memory &memory, MOS6502 &cpu) {  
    MOS6502_AND_ABS(memory, cpu, cpu.X);  
}  
  
void MOS6502_AND_ABSY(Memory &memory, MOS6502 &cpu) {  
    MOS6502_AND_ABS(memory, cpu, cpu.Y);  
}

Memory access methods are trivial:

WORD GetAbsAddress(const Memory& memory, const BYTE offsetAddress) {  
    const WORD absAddress = FetchWord(memory);  
    const WORD targetAddress = absAddress + offsetAddress;  
    if (IsPageCrossed(targetAddress, absAddress))  
        cycles++;  
    return targetAddress;  
}

BYTE GetAbsValue(const Memory& memory, const BYTE offsetAddress) {  
    const WORD targetAddress = GetAbsAddress(memory, offsetAddress);  
    return ReadByte(memory, targetAddress);  
}

The only interesting detail here is the check IsPageCrossed. In absolute offset addressing mode, it is necessary to additionally check whether the memory page (0xXX00 – 0xXXFF) has been crossed, and if so, an extra cycle is spent. Thus, the instruction occupies 3 bytes in memory, and 4 cycles are executed (+1 if a memory page was traversed).

AND(LL,X)

This addressing mode is called Indexed Indirect. He is very similar to ZeroPage,X – we read the byte address of the zero page in exactly the same way, we also add to it the offset from the X register, but here another level of indirection is added – from the received address in the zero page we read the next address, from which we will already obtain the value of the operand. The picture makes me think of an array of pointers that is on the stack.

Let's repeat here in words: read the data byte after the instruction, add the value of the X register to the received byte, read two bytes of the absolute address at the received address and take the value from the received address for use in the “AND” operation. It's getting a little tricky, but don't get discouraged, we're almost done.
The assembler looks harmless:

AND ($04,X)

Implementation, as always, in two lines:

void MOS6502_AND_INDX(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.GetIndXAddressValue(memory);  
    GenericAND(cpu, value);  
}

Under the hood it’s a little more complicated:

WORD GetIndXAddress(const Memory& memory) {  
    const BYTE targetAddress = FetchByte(memory) + X;  
    cycles++;  
    return ReadWord(memory, targetAddress);  
}

BYTE GetIndXAddressValue(const Memory& memory) {  
    const WORD targetAddress = GetIndXAddress(memory);  
    return ReadByte(memory, targetAddress);  
}

In memory, such an instruction takes two bytes (instruction code + byte address), but is executed in as many as 6 cycles. Additional levels of indirection come at a price.

AND (LL),Y

Mode Indirect Indexed It’s not just that it differs from the previous mode only in the order of words – it has a slightly different algorithm of actions. To be specific, the offset is added not to the first byte read, but to the address read from the zero page. Well, register Y is involved here, not X. Indexed Indirect I drew an analogy with an array of pointers, but here it would be more correct to talk about a pointer to an array, and the Y register acts as an index.

In assembler this mode looks like this:

AND ($04),Y

And in the emulator code… well, as always:

void MOS6502_AND_INDY(Memory &memory, MOS6502 &cpu) {  
    const BYTE value = cpu.GetIndYAddressValue(memory);  
    GenericAND(cpu, value);  
}

When calculating the address, memory page traversal is also taken into account:

WORD GetIndYAddress(const Memory& memory) {  
    const BYTE ZeroPageAddress = FetchByte(memory);  
    const WORD EffectiveAddress = ReadWord(memory, ZeroPageAddress);  
    const WORD TargetAddress = EffectiveAddress + Y;  
    if (IsPageCrossed(TargetAddress, EffectiveAddress))  
        cycles++;  
    return TargetAddress;  
}

BYTE GetIndYAddressValue(const Memory& memory) {  
    const WORD TargetAddress = GetIndYAddress(memory);  
    return ReadByte(memory, TargetAddress);  
}

Thus, the instruction takes up the same two bytes (instruction code + byte address) and is executed in 5 cycles. Here, a check for crossing a memory page is also necessary, so 1 more cycle is added to the 5 cycles if a memory page has been crossed.

So we have looked at all the addressing modes for the AND instruction. It’s not in vain that I specify what exactly it is for, since there are several more modes that are used in other instructions. I will describe them briefly, without details:

  • Implicit – a mode in which the instruction already contains all the necessary data and does not require operands. Such instructions include, for example, instructions CLC – resetting the overflow flag.

  • Accumulator – addressing mode, which allows you to transfer the accumulator register as an operand. For example, this mode is in the instructions LSR – bitwise shift to the right.

  • ZeroPage,Y – different from the one described above ZeroPage,X only the register that is used as an index.

  • Relative – a special mode that is used in branch instructions and allows you to set the relative offset, including using labels. For example, BEQ LABEL means “go to the LABEL if the Zero flag is set”, and the jump can be done both forward and backward.

  • Indirect – addressing mode, which is used only in instructions JMP. Its essence lies in the fact that an absolute address is passed as an operand, from which another absolute address is read. For example, if 0xCD and 0xAB are written to addresses 0x4000 and 0x4001, respectively, then after executing the instruction JMP ($4000) program execution will continue at address 0xABCD (don't forget about the reverse byte order).

Using the AND instruction as an example, I wanted to show what building blocks the code in the project consists of. I hope I succeeded. Using this approach, I had to implement 55 instructions, each of which had from 1 to 7 addressing modes, resulting in an impressive set of 151 functions.

Note on Implementation Instructions

There are a number of “hidden” entities in the code. We have already seen one of them – the cycle counter. I promise that we will cover this topic in more detail very soon, but before we move on to the next stage of development, we need to mention one more such entity, which is called the “status register”.
This register was already mentioned at the beginning, when I just moved on to the description of the code base, but just in case I’ll duplicate it:

struct MOS6502_Status {  
    BYTE C  :1;             // Carry Flag  
    BYTE Z  :1;             // Zero Flag  
    BYTE I  :1;             // Interrupt Disable  
    BYTE D  :1;             // Decimal Mode  
    BYTE B  :1;             // Break Command  
    BYTE NU :1;             // Not Used  
    BYTE V  :1;             // Overflow Flag  
    BYTE N  :1;             // Negative Flag
}

This is a bit field structure. It, like in a real processor, takes 1 byte. Each bit in this register is responsible for the state of the processor and the result of the last operation. The names of the flags are disclosed in the code comments, but I will list a few of them:

  • Carry Flag – as a result of calculations in the last operation, the most significant bit was carried.

  • Zero Flag – the calculation result of the last operation is zero.

  • Interrupt Disable – state of the interrupt handler.

  • Decimal Mode – Operating mode BCD.

  • etc. Using information from site already known to usyou can find out which flags change their state as a result of the execution of the instruction. If we look at the AND instruction, we will see that the flags can change Zero (the result is zero) and Negative (the result is negative). In the previous section I showed the implementation of a generic function GenericANDwhich consisted of one line. Remember? I deceived you. In fact, it consists of two lines, and I deliberately left out this fact because I wanted to keep the flow of the story. I present to your attention the function GenericAND in her true form:

void GenericAND(MOS6502 &cpu, const BYTE value) {  
    cpu.A &= value;  
    cpu.Status.UpdateStatusByValue(cpu.A, MOS6502_Status_Z | MOS6502_Status_N);  
}

The function hasn't changed much. It added a call to a method for updating the status register based on the input value. In the AND instruction, the result is stored in the A register, so it is passed as the first argument. The second argument is a mask of the flags we are interested in. Well, the implementation, frankly speaking, is straightforward:

void UpdateStatusByValue(const BYTE &targetRegister, const BYTE mask) {  
    if (mask & MOS6502_Status_Z)  
        Z = (targetRegister == 0);  
    if (mask & MOS6502_Status_N)  
        N = (targetRegister & MOS6502_Status_N) > 0;  
}

You see that there is built-in processing of exactly two flags? Perhaps there is an architectural hole here, but there is a logical explanation for this. Having only the result of the calculation in hand, you can set only these two flags. The remaining flags have two paths:

  • they are set and reset without checks: for example, instructions CLC forcefully resets the flag Carry.

  • they need additional calculations: for example, flags Overflow And Carry in addition instructions (ADC) and subtraction (SBC) are calculated in the implementation and set manually. In these cases, the following function is used:

void SetStatusFlagValue(const BYTE statusFlag, const bool value) noexcept {  
    if (value)  
        *(BYTE *) (this) |= statusFlag;  
    else  
        *(BYTE *) (this) &= ~statusFlag;  
}

Yes, perhaps casting this to BYTE* is not the best idea, but we'll leave that for comments 🙂

About tests and subtleties

Dave Poo I used test-driven development in my videos. I didn’t use this approach in my home projects, so for the sake of variety I decided to go the same route.
The project uses GoogleTestand the test looks like a small memory snapshot, on the basis of which hypotheses are made about the final state of the processor after executing the compiled microprogram.
Since we started looking at the instructions ANDthen let's look at the tests associated with it. Here is the code to test the instructions AND ABS:

class MOS6502_ANDFixture : public MOS6502_TestFixture {  
public:  
	...
    void AND_ABS_CanDoAND(BYTE initialValue, BYTE memoryValue) {  
        // given:  
        cpu.A = initialValue;  
        mem[0xFFFC] = 0x00;  
        mem[0xFFFD] = 0xFF;  
        mem[0xFF00] = AND_ABS;  
        mem[0xFF01] = 0x80;  
        mem[0xFF02] = 0x44;  
        mem[0xFF03] = STOP_OPCODE;  
        mem[0x4480] = memoryValue;  
  
        cyclesExpected = 4;  
  
        // when:  
        cyclesPassed = cpu.Run(mem);  
  
        // then:  
        EXPECT_EQ(cpu.A, memoryValue & initialValue);  
        CheckCyclesCount();  
    }
	...
}

...
  
TEST_F(MOS6502_ANDFixture, AND_ABS_CanDoAND) {  
    AND_ABS_CanDoAND(0xFF, 0xF);  
}  
  
TEST_F(MOS6502_ANDFixture, AND_ABS_CanAffectZeroFlag) {  
    AND_ABS_CanDoAND(0xFF, 0x0);  
    EXPECT_TRUE(cpu.Status.Z);  
}  
  
TEST_F(MOS6502_ANDFixture, AND_ABS_CanAffectNegativeFlag) {  
    AND_ABS_CanDoAND(0xFF, 0xFF);  
    EXPECT_TRUE(cpu.Status.N);  
}

...

Clarification is clearly needed here.
Method AND_ABS_CanDoAND V Fixture-class contains the body of the test and the same microcode that we talked about just above. The method is needed simply to avoid duplication of code in similar tests with different input parameters – I think this does not raise any questions.
As arguments to the method, the initial value of register A and the value in memory with which the Logical “AND” operation will be performed are passed. I will say in advance that I am not particularly versed in the topic of writing tests, but I tried to design them using GWT.

Section given

In this section we set the initial state of the system.
First of all, let's write down the initial value of register A, and if there are no questions here, then we cannot avoid them further, because we are moving on to the “memory impression”, and this part needs to start with a little explanation.

The first two addresses were not chosen at all by chance, and here’s why: according to the documentation, at the moment power is supplied to the microprocessor, it reads the starting address of the program from a strictly defined address – this FFFC/FFFDaccordingly, the first thing we do is write into these cells any address we like, from which the execution of the program will begin. This is done in order to maintain compatibility with real programs.

  1. Let's write it in FFFC/FFFD meaning 00FFwhich will give us the address FF00 (remember about little endianness?).

  2. By address FF00 is written AND_ABSthis is the instruction code (implemented as an enum) that has the meaning 0x2D.

  3. Next in FF01/FF02 goes the address from where the value for the “AND” operation will be read, written in reverse order, this is 0x80 And 0x44 accordingly (forms the address 4480).

  4. To the address 4480 the value of the operand for the instruction is written, the second argument passed to the method is substituted here.

  5. Next in FF03 a mysterious one is coming STOP_OPCODE.

At the very end, we set the value of the expected number of cycles, with which we will subsequently compare the actual number of cycles produced by our emulator.

Wait, what is this STOP_OPCODE? In fact, there is no such thing. This is a kind of trick designed to stop the emulator from working. It so happened that there were empty spaces in the instruction table for the MOS6502, and without thinking twice, I reserved one of them for myself, as a signal that the emulator should finish its work.

As you can see, there are quite a lot of empty cells here. There is also an extended list, which already occupies the entire table, and in it the main set of instructions is supplemented with so-called “illegal” instructions. No matter how hard I tried, I could not understand what this meant, but at least I learned that such instructions are supported only by some specific chips, so at the moment I have not implemented their support.

When section

This section lists the actions that will change the state of the system. In our case, this is one single line with a method call Run for our processor:

...
// when:  
	cyclesPassed = cpu.Run(mem);
...

This method, as you might guess, starts the emulator and returns the actual number of cycles spent executing the program. We touch on the loop counter again, but it, like the method itself, Runwe'll look at it a little later, but for now let's just take it for granted. I ask this for the last time, honestly!

Then section

Here the final state of the processor is checked. In the generalized method AND_ABS_CanDoAND The correctness of the operation result and the equality of the expected and actual number of cycles are checked. Individual tests test for additional conditions. So, for example, in the test AND_ABS_CanAffectZeroFlag with arguments 0xFF And 0x00where the result is zero, the state of the flag is checked Z in the status register, which obviously should be equal to one.

Code Execution and Loop Counter

Congratulations! You've read to the point where everything gets even worse. more difficult more interesting! So let's start simple.

Cycle counter

It's no secret that the processor needs time for each operation. This time is expressed in machine cycles, which, in turn, consist of clock cycles. A non-obvious fact is that there is an unfixed number of clock cycles in one machine cycle. Why is this so? Because each primitive operation is described in one machine cycle, but each such operation requires a different number of clock cycles. Writing a byte to a register is a primitive operation, just like writing a byte to memory. They take one machine cycle, but a different number of clock cycles. Somehow…

You didn’t see the counter in the MOS6502 structure code, although I clearly refer to it as a field, did you notice? Yes, that's right. The code has already undergone a number of significant changes, and the structure of the microprocessor has become a class inherited from the base implementation, and our U32 cycles variable is already there.

This counter was introduced to somehow control the number of cycles, and, as expected, it is incremented in each elementary operation. Here are a few trivial examples:

BYTE FetchByte(const Memory &memory) {  
    const BYTE Data = memory[PC++];  
    cycles++;  
    return Data;  
}  
  
WORD FetchWord(const Memory &memory) {  
    const BYTE Lo = FetchByte(memory);  
    const BYTE Hi = FetchByte(memory);  
    return Lo | (Hi << 8);  
}  
  
BYTE ReadByte(const Memory &memory, const WORD address) {  
    const BYTE Data = memory[address];  
    cycles++;  
    return Data;  
}

I think explanations are unnecessary here. As already mentioned, each primitive operation is accompanied by an increment of the cycle counter. Out-of-order incrementation in addressing modes has already been demonstrated. Also, if during the execution of the instruction an additional action is performed, then I also increment the counter locally.
For example, an additional increment in a bit shift operation:

void MOS6502_LSR_ACC(Memory &memory, MOS6502 &cpu) {  
    const bool Carry = cpu.A & 1;  
    cpu.A <<= 1;  
    cpu.cycles++;  
    cpu.Status.UpdateStatusByValue(cpu.A, MOS6502_Status_Z | MOS6502_Status_N);  
    cpu.Status.C = Carry;  
}

And here is another example from a set of instructions for setting a specific status register flag:

void GenericSE(MOS6502 &cpu, const BYTE statusFlag) {  
    cpu.Status.SetStatusFlagValue(statusFlag, true);  
    cpu.cycles++;  
}  
  
void MOS6502_SEC_IMPL(Memory &memory, MOS6502 &cpu) {  
    GenericSE(cpu, MOS6502_Status_C);  
}  
  
void MOS6502_SED_IMPL(Memory &memory, MOS6502 &cpu) {  
    GenericSE(cpu, MOS6502_Status_D);  
}  
  
void MOS6502_SEI_IMPL(Memory &memory, MOS6502 &cpu) {  
    GenericSE(cpu, MOS6502_Status_I);  
}

Everything here is quite logical, but there are a couple of nuances that are worth mentioning.
There are a number of instructions that do not seem to require additional machine cycles, at least judging by the ISA and the official technical documentation. Remember the implementation of the instructions AND? There is no increment. You may not yet have enough knowledge to agree or disagree with me, but I will explain with an example:
Instructions AND V Immediate mode takes 2 machine cycles: 1 cycle for reading and decoding the instruction itself, and 1 cycle for reading a byte from memory. All! The cycles, judging by the documentation, have ended. Where is the loop for logical “AND” or at least for writing the result to a register? Another example is the comparison instruction CMP. The number of machine cycles spent in it is exactly the same as in the instructions ANDalthough in CMP For comparison, subtraction is used, which seems to be not free…

Theoretically, I can imagine an engineering solution in which the value is written to the register simultaneously with the calculation, but I have not found confirmation of this guess.

However! There are situations when cycles not enough! If we were able to develop, albeit theoretical, a solution for situations where the instruction does not seem to waste cycles, then there are at least two related places in the project where overspending cycles. I struggled with this problem for a long time, but could not find a solution. Let me show you what's going on here.
There is an instruction to call a software interrupt and an instruction to return from the subroutine of this interrupt:

void MOS6502_BRK_IMPL(Memory &memory, MOS6502 &cpu) {  
    cpu.PushProgramCounterToStack(memory);  
    cpu.PushStatusToStack(memory);  
    cpu.PC = cpu.ReadWord(memory, 0xFFFE);  
    cpu.Status.SetStatusFlagValue(MOS6502_Status_B, true);  
    cpu.cycles--; // temporary fix extra cycle  
}  
  
void MOS6502_RTI_IMPL(Memory &memory, MOS6502 &cpu) {  
    cpu.PopStatusFromStack(memory);  
    cpu.PC = cpu.PopAddressFromStack(memory);  
    cpu.Status.SetStatusFlagValue(MOS6502_Status_B, false);  
    cpu.cycles--; // temporary fix extra cycle  
}

Have you already seen this mysterious decrement with the corresponding comment? All my attempts to get to the bottom of the truth turned out to be a complete failure. I double-checked all the called functions in the hope that I just missed something somewhere, but everything is set correctly everywhere. It doesn’t add up, and that’s it! As they say, “let’s write it in the backlog.”

Why, in general, was the cycle counter made? At the moment, this counter is just a counter that is given to us after the program runs, and in the current implementation it is needed only for tests, and the emulator itself runs at full capacity of the host processor. In the future, I want to transfer the entire emulator to a clocking system so that I can run programs on it that rely heavily on timings, for example, games. You know that MOS6502 was the basis of many set-top boxes?

Code Execution

Before you is the tidbit of the presented epic. We've taken apart every little block that makes up the emulator, looked at the concepts around which all functionality is built, seen inaccuracies and shortcomings, but there's still some feeling of inferiority. It’s as if everything works well individually, but it’s unclear how it’s connected to each other. Have I described your feelings correctly?

Let's finally look at the method, which is the most important in the entire emulator, because it is it that implements the operation of the microprocessor:

U32 MOS6502::Run(Memory &memory) {  
    PC = (memory[0xFFFD] << 8) | memory[0xFFFC];  
  
    bool DecodeSuccess;  
  
    do {  
        const BYTE opCode = FetchByte(memory);  
        DecodeSuccess = DecodeInstruction(opCode, memory, *this);  
    } while (DecodeSuccess);  
  
    cycles--;       // revert fetch cycle  
    PC--;           // revert extra PC increment of last fetch
    return cycles;  
}

We already know that the processor first reads the starting address from FFFC/FFFD and adds the resulting value to the instruction pointer register PC. After this an endless loop begins Fetch->Decodewhich aborts if decoding fails.

Method FetchByte We've already seen it, but I'll repeat it just in case:

BYTE FetchByte(const Memory &memory) {  
    const BYTE Data = memory[PC++];  
    cycles++;  
    return Data;  
}

What happens here is that we read a byte from the address which is stored in the register PCafter which we increment the register and return the read byte. Thus, with each fetch we will move forward, which ensures sequential reading of the code.

Yes, reading occurs only forward, but no one has canceled JMP type jump instructions, so if we started execution from address FFF0, this does not mean that in 16-bit memory, where addresses end in FFFF, only 16 execution cycles are available to us .

The read byte is sent to the decoder, but before we get to that, it's worth explaining what a “decoding error” means. Remember STOP_OPCODE? It is needed precisely to monitor this situation: the emulator will work either indefinitely (if so intended by the loaded program) or until it encounters STOP_OPCODE (or any other opcode that does not match any instructions).
When exiting the loop, we roll back the last unsuccessful fetch and return the number of loops. Well, now the cherry on the cake!

Decoder

Do you have any idea how a decoder works? This is a kind of black box, to which an instruction code (number) is given as input, and inside this instruction itself is executed. It sounds easy. What if you figure it out?

In most emulator projects that I have seen, the decoder is made using switch-case. Surprising? I was surprised. For the MOS6502, I can still somehow imagine such an implementation, because there are only 151 instructions, taking into account all modes. Is it worth trying to convey my emotions when I saw a switch decoder in the Intel 8086 emulator, in which, taking into account the addressing modes, several thousand instructions are obtained ? I’m afraid to even imagine how it’s possible to support such a huge number of lines of rather tricky code!
Dave Poo in his videos was no different from the others and also started with a switch-case implementation. At first I followed him, and then suddenly I realized that I didn’t like this approach. I imagined the difficulties of debugging, the amount of code stored in one place, and decided to develop my own approach to solve this problem.
I didn't think long about the implementation, and my version survived exactly one iteration. It’s not hard to guess what the first one was, because then I thought like this: “I have a number, and this number corresponds to a function.” What might be the first naive solution? Of course, a hash table. That evening I proudly wrote the following:

std::map<BYTE, std::function<void(MOS6502&,Memory&)>> InstructionTable = { ... };

Do I need to describe the disadvantages of this solution? Speed, memory footprint, overhead of unrolling std::function… Now let's rework my first thought: we have not just a number, but a very specific list of numbers, each of which corresponds to a very specific function. Have you already guessed what the second and final iteration became? Challenge table!
For a year now this design has looked like this:

static void MOS6502_INVALID_OP(Memory&, MOS6502&) {}

using OpSignature = void (*)(Memory&, MOS6502&);

constexpr static OpSignature Ops[] =  
        {  
#ifndef ADD_CALL  
#   define ADD_CALL(call) MOS6502_##call  
#   include "MOS6502/MOS6502_OpCodesList.h"  
#   undef ADD_CALL  
#endif  
        };

In order:

  • function MOS6502_INVALID_OP is a placeholder for all unused opcodes, including STOP_OPCODE.

  • OpSignature – an alias for a pointer to a function that accepts a memory reference and a processor reference.

  • constexpr static OpSignature Ops[] – call table (array of function pointers).

Of course, I didn’t want to manually fill a table with 256 records, so I wrote a macro that generates this table for me at the preprocessor stage. The macro picks up the file OpCodesList and unfolds ADD_CALL in the array declaration. The file itself looks like this:

ADD_CALL(BRK_IMPL),   ADD_CALL(ORA_INDX),   ADD_CALL(INVALID_OP), 
ADD_CALL(INVALID_OP), ADD_CALL(INVALID_OP), ADD_CALL(ORA_ZP),   
ADD_CALL(ASL_ZP),     ADD_CALL(INVALID_OP), ADD_CALL(PHP_IMPL),
ADD_CALL(ORA_IM),     ADD_CALL(ASL_ACC),    ADD_CALL(INVALID_OP),    
ADD_CALL(INVALID_OP), ADD_CALL(ORA_ABS),    ADD_CALL(ASL_ABS),
...

Of course, I didn’t want to fill out this file manually either. It is filled in by a script written in Pythonwhich in turn parses the file with enumwhere the opcodes are listed. In general, I made my work as easy as my imagination allowed.

Where is the decoder itself? Yes, here he is!

bool DecodeInstruction(const BYTE opcode, Memory &memory, MOS6502 &cpu) {  
    const auto &instruction = Ops[opcode];  
    if(opcode == STOP_OPCODE || instruction == MOS6502_INVALID_OP)  
        return false;  
    instruction(memory, cpu);  
    return true;  
}

I hope no one now wonders why I wrote all these explanations. Yes, the decoder takes up five lines of code, but now we perfectly understand the purpose of each of these five lines: we took the pointer to the function (instruction) that interests us by index; determined whether the instruction code is valid; followed the instructions.
After reading all this text, it may seem that the decoder, the core of the emulator, looks terribly elementary, and this is exactly what I wanted. It's time to move on to the afterword.

Current status and plans

Now I am actively trying to attract caring people to the project, but due to the specifics of the direction, I am obviously experiencing some problems with this.
Some people are not interested in the project in principle, and I understand such people perfectly: it is difficult to be interested in bits and bytes in the era of metaprogramming. Others are interested, but cannot cope with the high enough entry threshold – you need to spend a certain amount of time, attention and effort to get involved in the project without losing interest. There are those who provide one-time assistance in this or that issue, but still my goal is to gather a community, albeit a small one, around the project.
At the moment (early October 2024), the project has been in development for just over a year. The project repository is located on the home GitLab with mirroring in GitHubthere is a small Trello board where tasks are jotted down.
I am working on the project in my free time from my main job, and this is the state of the project now:

  • MOS6502: full set of instructions, test coverage, running small programs (for example, calculating CRC 8/16/32 checksum), ability to run your programs using vasm.

  • I8080: full set of instructions, test coverage.

  • I8086: work has begun on the implementation of instructions, the decoder and processing of addressing modes are ready.

The development plans are ambitious, and allow for vertical growth (increasing the functionality of processor modules and building on their basis full-fledged systems such as PPU), and horizontally (adding modules of new microprocessors such as Motorolla 68000 or Zilog Z80), while I consider one of the key growth goals to be the development of a method for interactive interaction with the emulator, i.e. emulation of hardware input/output to run, for example, a BASIC interpreter.

This concludes my epic. Thank you for your time. I'm waiting for everyone in the comments! Telegram contacts: @dimanchique, and of course, good old mail.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *