How to distinguish a beginner professional from a sophisticated hobbyist in digital circuitry?


What is the main difference between an FPGA hobbyist nostalgic for the KR580IK80 in retirement – and an aspiring microarchitect focused on future employment in a cutting-edge processor company or a venture-backed startup?

Three words: understanding the conveyor concept… A young professional, not an old hobbyist.

This can be clearly seen if you google the texts about FPGA for beginners. If the text is written by a programmer who wants to touch the FPGA purely for a change, he usually does not reach the pipeline. Blinking lights, talking about state machines, and maybe starting to embed some FPGA implementation of an old 8-bit processor.

(Some of these people even write books – here a certain Robert Dunne implemented the processor as a state machine with fetch / decode / execute states, but did not reach the pipeline)

All of this happens because understanding the conveyor belt usually requires some kind of brain effort, such as a barbell push. And if the brain has already been installed by decades of programming in C and assembly, it resists, because it is counterintuitive to it.

But this must be overcome, because if you come to intervene in some AMD for the position of a young designer, you will not be asked how to blink the lights and put Radio RK-86 into Xilinx, but how, standing in front of the interviewer, write on the board a pipeline implementation of some kind of multiplication with addition. Or do it on a computer cut off from the Internet, so you can’t even Google the solution – that’s sadistic, huh?

This is what the next lesson will be about. Skolkovo School of Synthesis of Digital Circuits

This lesson will take place on December 4 at the Altair Technopark MIREA
(Prospect Vernadsky, 86, bldg. 2, Metro Yugo-Zapadnaya)

Here is a live link to the stream:

At one time I wrote test in the style of the exam for students at a university called NPU in Fremont, California, where I helped teach computer architecture to Timur Paltashev from AMD. In the test, three implementations of the arithmetic unit for calculating the raising of a number were given – combinational, sequential and pipeline.

I will give this test below, and before it I will give an allegorical explanation of the information needed to pass the test from my other text:

Explaining the concepts of combinational, sequential and pipelined computing

Imagine that you need to organize the work of the military registration and enlistment office. This can be done in several ways, depending on your goals.

If the goal is to minimize the number of rooms for inspection, then you can seat all members of the commission in one large room, into which you can enter one at a time. This is an example of combinational computing. Its disadvantage will be that the examination of all pre-conscripts will take a long time, and the members of the commission will be bored. [каждый из четырех членов комиссии соответствует схеме умножения на схеме ниже]…

How it looks in the form of code, as well as a scheme synthesized from it (before mapping, placement and routing). A combinational cascade of five multiplication operations:

module pow_5_implementation_1
(
    input  [17:0] n,
    output [17:0] n_pow_5
);

    assign n_pow_5 = n * n * n * n * n;

endmodule

If the goal is to minimize the number of specialists, then it can be arranged this way, but at the same time, a single military commissariat should have several specialties (eyeball, dentist, urologist), and it will be necessary to hire a special comrade who would temporarily take the pre-conscript out of the room while the military commissariat changes equipment in it – broadcasts an eye test sign on the wall, puts a dental chair (the situation is determined by the state of the state machine). And then the assistant on the bell (clock signal) would bring the pre-recruiter back for the next stage of the examination.

This is an example of the sequential organization of computation. Microcode, which was popular in the 1970s, is a special case of such an organization of computations – a trick that I will not consider in this post.

Accordingly (here we only have a multiplication operation, which is repeated five times, but you can build an example in which there will be ALU with different operations in each clock cycle):

module pow_5_implementation_3
(
    input         clock,
    input         reset_n,
    input         run,
    input  [17:0] n,
    output        ready,
    output [17:0] n_pow_5
);

    reg [4:0] shift;

    always @(posedge clock or negedge reset_n)
        if (! reset_n)
            shift <= 0;
        else if (run)
            shift <= 5'b10000;
        else
            shift <= shift >> 1;

    assign ready = shift [0];

    reg [17:0] r_n, mul;

    always @(posedge clock)
        if (run)
        begin
            r_n <= n;
            mul <= n;
        end
        else
        begin
            mul <= mul * r_n;
        end

    assign n_pow_5 = mul;

endmodule

Finally, if you have many rooms, then you can organize a pipeline that will provide the highest throughput, ideally determined by only one (slowest) person. In an imperfect case, the pre-conscript may start bickering and slow down the process for everyone (stall) or for everyone. who is behind it (slip). Against this, there are all sorts of out-of-order methods that I will not consider in this post – our “military registration and enlistment office” (the pipeline for calculating the function) will work perfectly.

In the photo above, there is only one room. You can imagine several rooms with doctors, which the pre-conscript crosses in the process, as the number n crosses five “rooms” (registers from D-triggers, in front of which the doctors sit – four multiplication operations.

module pow_5_implementation_2
(
    input             clock,
    input      [17:0] n,
    output reg [17:0] n_pow_5
);

    reg [17:0] n_1, n_2, n_3;
    reg [17:0] n_pow_2, n_pow_3, n_pow_4;

    always @(posedge clock)
    begin
        n_1 <= n;
        n_2 <= n_1;
        n_3 <= n_2;

        n_pow_2 <= n * n;
        n_pow_3 <= n_pow_2 * n_1;
        n_pow_4 <= n_pow_3 * n_2;
        n_pow_5 <= n_pow_4 * n_3;
    end

endmodule

And now you can yourself answer the questions that I asked students at Northwestern Polytechnic University about these three schemes:

7. Which implementation is likely to allow the highest maximum frequency (assuming that the outputs of the combinational implementation are connected to clocked register)?

a) pow_5_implementation_1; b) pow_5_implementation_2; c) pow_5_implementation_3

8. Which implementation is likely to use the smallest number of gates?

9. Which implementation is likely to have the highest throughput (number of calculated pow_5 (n) results per seconds)?

10. Which implementation is going to have the smallest latency in clock cycles (assuming that the outputs of the combinational implementation are connected to clocked registers)?

And the crown question:

11. The testbench instantiated all three implementations of pow_5.

module testbench;

    reg         clock;
    reg         reset_n;
    reg         run;
    reg  [17:0] n;
    wire        ready;

    wire [17:0] n_pow_5_implementation_1;
    wire [17:0] n_pow_5_implementation_2;
    wire [17:0] n_pow_5_implementation_3;

    initial
    begin
        clock = 1;

        forever # 50 clock = ! clock;
    end

    initial
    begin
        repeat (2) @(posedge clock);
        reset_n <= 0;
        repeat (2) @(posedge clock);
        reset_n <= 1;
    end

    pow_5_implementation_1 pow_5_implementation_1
        (n, n_pow_5_implementation_1);

    pow_5_implementation_2 pow_5_implementation_2
        (clock, n, n_pow_5_implementation_2);

    pow_5_implementation_3 pow_5_implementation_3
        (clock, reset_n, run, n, ready, n_pow_5_implementation_3);

    integer i;

    initial
    begin
        #0
        $dumpvars;

        $monitor ("clock %b reset_n %b n %d comb %d seq %d run %b ready %b pipe %d",
            clock,
            reset_n,
            n,
            n_pow_5_implementation_1,
            n_pow_5_implementation_2,
            run,
            ready,
            n_pow_5_implementation_3);

        @(posedge reset_n);
        @(posedge clock);

        for (i = 0; i < 50; i = i + 1)
        begin
            n   <= i & 7;
            run <= (i == 0 || ready);

            @(posedge clock);
        end

        $finish;
    end

endmodule

An engineer simulated the testbench and got the following waveform. However he forgots the order he added the last three signals to the waveform. Can you determine which signal is the output of combinational implementation, sequential non-pipelined implementation and sequential pipelined implementation?

a) The order is (from upper n_pow_5 … to lower n_pow_5 …): combinational, sequential non-pipelined implementation, pipelined

b) combinational, pipelined, sequential non-pipelined implementation

c) pipelined, combinational, sequential non-pipelined implementation

d) pipelined, sequential non-pipelined implementation, combinational

e) sequential non-pipelined implementation, combinational, pipelined

f) sequential non-pipelined implementation, pipelined, combinational

If you figured out this exercise, then it will be much easier for you to understand how the pipeline works in the processor, graphics processor, network chips – the principle is the same, but there are a lot of bells and whistles around it – pauses, queues, backpressure, credits, etc. – for knowing all this in fact, they pay wages in the chip development groups in Apple, NVidia, Cisco, Syntacore, NIISI (cool Russian processor Komdiv-64), etc.

The next lesson at school will be about the schedule. Actually, here is the whole program:

  • October 30, 2021: 1. Introduction to the design route and exercises with combinational logic.

  • November 13, 2021: 2. Architecture: processor view from the programmer’s point of view.

  • November 20, 2021: 3. Sequential logic and finite state machines.

  • November 27, 2021: 4. Analysis of the training project: recognition and generation of sounds and melodies.

  • December 4, 2021: 5. Conveyors and systolic arrays, with an artificial intelligence app.

  • December 11, 2021: 6. Analysis of the educational project: a modular graphic game with sprites.

  • December 18, 2021: 7. Microarchitecture of a single-cycle processor.

  • December 25, 2021: 8. Microarchitecture of the pipeline processor.

  • January 15, 2022: 9. Designing the processor cache and measuring its performance.

  • January 22, 2022: 10. Building blocks and design techniques: FIFO queues and credit counters.

  • January 29, 2022: 11. Building blocks and design techniques: arbiters, banks and memory sharing.

  • February 5, 2022: 12. Trying the RTL2GDSII route: how mass microcircuits are developed. Part I.

  • February 12, 2022: 13. Trying the RTL2GDSII route: how mass microcircuits are developed. Part II.

  • February 19, 2022: 14. Simulated interview for the position of a digital microcircuit designer.

  • February 26, 2022: 15. Analysis of imitation interviews with the presentation of incentive prizes.

Join online and offline – there are even free (paid for by Syntacore / Core Microprocessors, Maxim Maslov and Cadence Design Systems) boards with FPGA chips. See the post Preparing for the Skolkovo School of Synthesis of Digital Circuits: literature, FPGA boards and sensors – for details:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *