1. Introduction
  2. Testing
    1. Test Case Generation
    2. Bluespec Testbench
    3. Dummy Implementation
  3. Designs
    1. First Implementation
    2. Radix-2 Signed Multiplication
    3. Radix-4 Multiplication
    4. Radix-8 Multiplication
    5. Radix-16 Consideration
  4. Reduced Adder Design for Radix-8
    1. Lower Area, Higher Delay
    2. Experimentation
    3. Exercise for the Reader
  5. Built-in Multiplier
  6. Next Time

Introduction

Let’s write a simple multiplier that we can test and eventually put into our processor. The eventual aim is to implement the RISC-V “M” Standard Extension for Integer Multiplication and Division. It’s a small extension that takes only a couple paragraphs per operation in the RISC-V specification.

This post is the first in a series of worked examples and explorations using Bluespec. It’s also an opportunity for me to use the Bluespec lexer and Bluespec extension for VS Code I worked so hard on.

I begin by discussing how we create our testbench (in this house, we believe in test-driven development) and continue by going through a sequence of multiplier designs from the non-functional to decent. We don’t go into advanced multipliers; that’s more for a later post.

For each design, I present a Bluespec excerpt and some critical-path, area, and cycle numbers. The synthesis-related numbers come from using the Minispec synth tool. The numbers are probably worse than if we implemented directly in Verilog, but that’s alright for now.

All the code here was written from scratch and is hosted on this GitHub repo. I didn’t keep all the different implementations in the most recent commit, but you can go through the commit history for earlier stages.

Testing

It’s both quicker and easier to debug a little module like a multiplier by running it through unit tests than by putting it into a huge processor and seeing whether the processor breaks. In the hardware-world, I believe this is called verification.

In this case, I chose to do it as a two-step process. We first generate the tests (in whatever way we want), then we write a Bluespec testbench that consumes the tests and probes an instance of our multiplier-to-be-tested.

Test Case Generation

I chose to generate test cases as a hexadecimal file using a C script. If you’re unfamiliar with compiled languages like C, I just write my multiplier.c file, run it through a compiler with gcc multiplier.c -o multiplier, and execute the binary ./multiplier. I’ve tucked most of the compilation commands I’ll be using into a Makefile.

My C script outputs into a text file with one test per line, and each line as hexadecimal that my testbench will consume. Each line is basically a packed [a, b, ab], where a times b equals ab. There are 32 bits allocated to each of a and b, and 64 bits for ab, since we might need to supply the upper 32 bits as per the RISC-V “M” extension specification.

Here’s what generating each test case looks like. We perform the multiplication in C, then log both operands and the output in hex. Later, our testbench will feed these operands into our Bluespec multiplier and check that the outcome is equal to the outcome we got in C.

#include <stdint.h>
#include <stdio.h>

// in main(), we instantiate the file pointer with something like this:
// FILE *fptr = fopen("test_cases.vmh", "w");

// Log the case
void log_multiply(FILE* fptr, int32_t a, int32_t b) {
    int64_t ab = (int64_t)a * (int64_t)b;
    #ifdef DEBUG
    printf("%x times %x is %lx\n", a, b, ab);
    #endif
    // format is a, b, ab_upper, ab_lower
    fprintf(fptr, "%08x%08x%016lx\n", a, b, ab);
}

We use two batches of tests: special cases, then random cases. I also cap off the test cases with a sentinel value so our testbench knows when to end.

void end_case(FILE* fptr) {
    printf("deadbeef cap\n");
    fprintf(fptr, "%032x", 0xdeadbeef);
}

Special Cases

The first batch is to check a set of hand-picked test cases. I selected these particular test cases because they were either simple cases that we could check by eye (so we know that our test generation itself works) or edge cases like multiplying INT32_MIN by itself, which yields the largest signed product.

Between these test cases, we should be covering the breadth of possible inputs (e.g., multiply by 0, multiplying different signs, multiplying same signs). Another way of saying it is that if we pass this small set of tests, we have a pretty strong sign that our multiplier is fully correct.

In our case, our special cases are:

void specific_multiply(FILE* fptr) {
    // Non-negatives first
    log_multiply(fptr, 0, 0);  // 0 identity, easy
    log_multiply(fptr, 0, 1);  
    log_multiply(fptr, 1, 1);  // 1 identity, easy
    log_multiply(fptr, INT32_MAX, 0);
    log_multiply(fptr, INT32_MAX, 1);
    log_multiply(fptr, INT32_MAX, INT32_MAX);
    // Signed below
    log_multiply(fptr, -1, 0);
    log_multiply(fptr, -1, 1);  // sign extended
    log_multiply(fptr, -1, -1);  // negative times negative?
    log_multiply(fptr, INT32_MIN, 0);
    log_multiply(fptr, INT32_MIN, 1);
    log_multiply(fptr, INT32_MIN, INT32_MIN);  // biggest possible result
    log_multiply(fptr, INT32_MAX, INT32_MIN);
}

I know there’s a lot of repeated boilerplate here. If we wanted to, we could store all the test pairs in a separate array or file, or in any number of ways, and then iterate through calling log_multiply(fptr, a, b) for each a, b.

I think a two-step process is good enough for this worked example, so we’re keeping it a little simple at the cost of some repeated code. If we had a more complicated design that we were testing with many more test cases, then we might want to encode our human-readable cases separately from the C code.

In the test_cases.vmh hex file, these cases look like this:

00000000000000000000000000000000
00000000000000010000000000000000
00000001000000010000000000000001
7fffffff000000000000000000000000
7fffffff00000001000000007fffffff
7fffffff7fffffff3fffffff00000001
ffffffff000000000000000000000000
ffffffff00000001ffffffffffffffff
ffffffffffffffff0000000000000001
80000000000000000000000000000000
8000000000000001ffffffff80000000
80000000800000004000000000000000
7fffffff80000000c000000080000000

Each 32-bit segment is represented by eight characters. For example, look at the third line, which corresponds to 1 times 1 equals 1. On the left half we have a 00000001 for each operand, and on the right, we have a 1 with 15 leading zeros as the product.

Most of our bugs should be caught by our well-selected test cases.

Randomized Tests

The second batch is much larger and consists of randomized tests. For a simple design like a multiplier, we shouldn’t be getting any surprises in this batch. Randomized tests are more significant for complex designs where the input space is bigger or bothersome to write special cases for (e.g., testing correct functionality of a cache). But it’s good practice to do it here too.

Because our multiplier is simple, these are mostly to measure performance (in terms of cycles per multiplication, asymptotically) and partly to give us the confidence that our multiplier works even when we don’t know the inputs.

It doesn’t so much matter here since I’m both testing and developing the multipliers, but randomized tests can also protect from bad multipliers that just hardcode special cases. Or maybe that’s just why they did it in school.

#include <stdlib.h>
#include <stdint.h>

void rand_multiply(FILE* fptr) {
    int32_t a = rand() - rand();  // covers full input space
    int32_t b = rand() - rand();
    log_multiply(fptr, a, b);
}

These test cases are a lot less pleasant to read than our earlier test cases, but we put them in the test_cases.vmh all the same. Because we’ve checked our previous test cases, we can be confident that these work.

f2161e7bfbd104f5003a3531831017b7
f69277035113f1d3fd039b9ee4faea79
f8e5e6de58f1fdd3fd8850c001a4aefa
19d0ac8848a2aa54075317c0791aeca0
aacbcda9489a4485e7d5f9a31e2cbccd
05257b1e54e6803e01b4eea67196d144
c0fc893844312d6cef36f99af260bba0
076ec261eba76b3bff68c66bda0c575b

For my testing, I wrote 13 special test cases and generated 900 randomized test cases. We can measure performance by dividing our total cycles by 913 as soon as we pass all tests.

For completeness, this is what my main() looks like (though you can look for yourself in the repo):

int main() {
    const int SEED = 2;
    const int cases = 900;

    srand(SEED);
    FILE *fptr = fopen("test_cases.vmh", "w");

    specific_multiply(fptr);

    for (int i = 0; i < cases; i++) {
        rand_multiply(fptr);
    }

    end_case(fptr);
}

Other Means of Generation

Alternatively, I could’ve also generated test cases in Bluespec directly, or using any other language that can write to a file.

If I had a reference Bluespec or Verilog implementation of a multiplier, I could’ve also used that to generate test cases during runtime inside of the testbench. That would’ve been a fine option, considering Bluespec actually has a built-in multiplier with the * operator, which I discuss later. We could’ve produced a pair of operands and fed them into both our multiplier-under-test and our reference multiplier. Then, we could’ve seen if they produced the same results.

I figured it would be easier to write our test cases using C since it has higher-level constructs than Bluespec. Also, I haven’t yet learned how to write a series of sequential test cases in Bluespec like I did in C.

Bluespec Testbench

Once we have our test cases in our test_cases.vmh, we can make our Bluespec testbench consume them. The way I did it is having our testbench instantiate a BRAM that loads from our hex file. Then, we can send read requests to the BRAM for each test case, one line at a time.

module mkMultiplierUnitTest(Empty);
    let cfg = defaultValue;
    cfg.loadFormat = tagged Hex "test_cases.vmh";
    BRAM1Port#(MaxTestAddress, TestPacket) tests <- mkBRAM1Server(cfg);

    MultiplierUnit dut <- mkMultiplierUnit;
    // ... everything else ...
endmodule

My testbench mkMultiplierUnitTest uses five rules:

  • puts submits a new read request for every line of our BRAM
  • question receives the BRAM response, queries our multiplier-under-test, and enqueues the expected result.
    • Terminates if it detects our sentinel value deadbeef.
  • answer receives the multiplier response and compares it to our expected result, which it then discards.
    • Terminates if we receive a wrong answer.
  • tick (minor) increments our cycle counter.
  • terminate (minor) ends our simulation if we end up stalling.

We can’t test something whose interface we don’t know, so here’s our MultiplierUnit interface below. We only worry about two methods: one to start computation, and one to free the multiplier and get the result. Internally, the implementation should have these methods guarded so that they’re only callable when the module is ready to perform each method.

typedef Bit#(32) Word;
typedef Vector#(2, Word) Pair;

interface MultiplierUnit;
    method Action start(Pair in);
    method ActionValue#(Pair) result;
endinterface

Our entire testbench then looks like this:

typedef Bit#(10) MaxTestAddress;  // 10 bits for max 1024 tests
typedef Vector#(4, Word) TestPacket;

module mkMultiplierUnitTest(Empty);
    let cfg = defaultValue;
    cfg.loadFormat = tagged Hex "test_cases.vmh";
    MultiplierUnit dut <- mkMultiplierUnit;  // design under test

    BRAM1Port#(MaxTestAddress, TestPacket) tests <- mkBRAM1Server(cfg);
    Reg#(MaxTestAddress) request_index <- mkReg(0);
    FIFO#(Pair) expected <- mkFIFO;
    Reg#(Word) cycles <- mkReg(0);
    Reg#(Word) last_solved <- mkReg(0);

    function Action conclude;
        action
        $display("Ended at %0d cycles after solving the %0d test", cycles, last_solved);
        $finish;
        endaction
    endfunction

    rule tick;
        cycles <= cycles + 1;
    endrule

    // This rule keeps us requesting
    rule puts;
        request_index <= request_index + 1;
        let request = BRAMRequest{
            write: unpack(0),
            address: request_index
        };
        tests.portA.request.put(request);
    endrule

    // This rule queries the dut
    rule question;
        TestPacket current_test <- tests.portA.response.get();
        current_test = reverse(current_test);
        // Reverse order because of the way the vmh is written/read

        Pair operands = unpack({current_test[0], current_test[1]});
        Pair results = unpack({current_test[2], current_test[3]});

        dut.start(operands);
        expected.enq(results);

        if (current_test[3] == 'hdeadbeef) begin
            $display("deadbeef detected; finishing at %0d cycles", cycles);
            conclude;
        end
    endrule

    rule answer;
        Pair result <- dut.result;
        last_solved <= last_solved + 1;
        if (result != expected.first) begin
            $display("Result was %x but expected %x", result, expected.first);
            conclude;
        end
        expected.deq;
    endrule

    rule terminate if (cycles > 'hFFFF);
        $display("Emergency exit");
        $finish;
    endrule
endmodule

We should test our testbench so we know it actually works. For that, we can write a dummy MultiplierUnit implementation that can receive requests and serve responses without the burden of producing correct answers. It should fail tests and trigger the failure message.

Dummy Implementation

Just as above, we’re using these definitions.

typedef Bit#(32) Word;
typedef Vector#(2, Word) Pair;

interface MultiplierUnit;
    method Action start(Pair in);
    method ActionValue#(Pair) result;
endinterface

typedef enum {
    Idle,
    Busy,
    Ready
} MultiplierState deriving (Bits, Eq, FShow);

This is a “multiplier” that does nothing. Its sole purpose is to let us run our testbench and fail without stalling. Which it does, beautifully.

(* synthesize *)
module mkMultiplierUnit(MultiplierUnit);
    Reg#(MultiplierState) state <- mkReg(Idle); 
    FIFO#(Pair) last_inputs <- mkFIFO;

    method Action start(Pair in) if (state == Idle);
        last_inputs.enq(in);
        state <= Ready;
    endmethod

    method ActionValue#(Pair) result if (state == Ready);
        last_inputs.deq;
        state <= Idle;
        return last_inputs.first;
    endmethod
endmodule
Area: 1208.08 um^2
Critical-path delay: 145.5 ps
Cycles: N/A
Coverage: Only when inputs = outputs

The large area despite having almost nothing can probably be chalked up to the FIFO that contains a Pair, or equivalent to Bit#(64) of space.

A stopped clock can be correct. Since our dummy implementation returns its inputs as outputs, we pass the 0 times 0 = 0 test, albeit nothing else. Now that we know our testbench works, we can start worrying about implementing the multiplier.

Designs

First Implementation

Let’s use the Hennessy and Patterson designs for integer multiplication. Let’s start with unsigned multiplication, and we’ll handle signed multiplication later. We’ll use their Radix-2 Multiplication and Division.

This works for non-negative integers. We would need to add some complexity for accommodating negative integers.

(* synthesize *)
module mkMultiplierUnit(MultiplierUnit);
    Reg#(MultiplierState) state <- mkReg(Idle); 
    Reg#(Word) a <- mkRegU;
    Reg#(Word) b <- mkRegU;
    Reg#(Word) p <- mkRegU;
    Reg#(Bit#(5)) index <- mkRegU;  // only need 0 through 31

    rule work(state == Busy);
        Bit#(33) new_p = (a[0] == 'b1) ? {0,p} + {0,b} : {0,p};  // (1); may need carry
        p <= {new_p[32:1]};
        a <= {new_p[0], a[31:1]};
        index <= index + 1;

        if (index == 31) state <= Ready;
    endrule

    method Action start(Pair in) if (state == Idle);
        a <= in[0];
        b <= in[1];
        p <= 0;
        index <= 0;
        state <= Busy;
    endmethod

    method ActionValue#(Pair) result if (state == Ready);
        state <= Idle;
        return unpack({p, a});
    endmethod
endmodule

Our multiplier is a state machine that goes between Idle, Busy, and Ready. It waits for a request in Idle from start, then it sets a bunch of registers and transitions to Busy. It will stay in Busy for 32 cycles worth of work, incrementing the index by 1 each time. When it’s on its last cycle, it will transition to Ready, where it waits for our testbench to pick up the product through a result call. Once result is called, the multiplier transitions back to Idle, ready for the next request.

Algorithm

The main operation is in the line labeled (1). We’re basically doing long multiplication in binary, one digit at a time. Hennessy and Patterson explain it better in their textbook.

We have 3 registers: a, b, and p.

a <= in[0];
b <= in[1];
p <= 0;
  • a and b are our initial operands. b stays constant, and a changes as we run the algorithm. The lower bits of a correspond to the operand, and the upper bits will gradually be filled with the lower bits of our product. We make the space by consuming and throwing away the lowest bit of a each cycle.
  • p starts as 0 and is part of our running sum.
  • At termination, a will hold the lower bits of our product and p will hold the upper bits.

Each cycle, we perform two steps inside of work.

Step 1, we insect the LSB of a (a[0]) to determine our new_p.

  • If it’s 0, we do nothing and just use new_p = p.
  • If it’s 1, we add b to p and use new_p = p + b.
  • We do this addition in Bit#(33) in case of overflow.
Bit#(33) new_p = (a[0] == 'b1) ? {0,p} + {0,b} : {0,p};  // (1); may need carry

Step 2, we shift new_p and a right by one bit, and store the resulting new_p >> 1 into p. We can think of p and a as stuck together, with p on the left and a on the right. The total 64-bit product will eventually be {p, a}. We make the space for the MSB of new_p by shifting out the a[0] that we consumed. Think of long multiplication.

p <= {new_p[32:1]};
a <= {new_p[0], a[31:1]};

I believe there are some startup cycles because it takes our BRAM a little bit of time to get set. Asymptotically, I can see when running many tests that the cycles per test tends to 34. Internally, only 32 of those cycles correspond to work happening in the multiplier. The other 2 have to do with input/output.

Area: 1273.08 um^2
Critical-path delay: 249.57 ps
Cycles: 272 cycles to the 7th test. (38.86 cycles per multiply (internal 32, external 34))
Coverage: Non-negative only.

I generate the area/delay numbers with synth, and the cycles and test numbers and my $display statements. Since we pass the non-negative tests, let’s extend our multiplier to work for signed multiplication.

Radix-2 Signed Multiplication

We can focus on implementation because we already have tests that cover multiplication with signed values.

We use Booth recoding to multiply with signed numbers. The main difference is that at each step, we inspect the current and previous bit of a rather than just the current bit. Almost all of the change is just to our work rule, with almost everything else staying the same.

It’s a little tricky to explain the idea behind the algorithm, but it gets clearer as we have more cases with higher Radix. Essentially, we read our two-bit case {a[0], last_a} as a two’s complement number where the a[0] is a two’s complement number with one bit (so it just corresponds to -1) and the last_a is the carry from the previous operation.

With each case:

  • 00 is 0(-1) + 0(1) = 0 + 0 = 0
  • 01 is 0(-1) + 1(1) = 0 + 1 = 1
  • 10 is 1(-1) + 0(1) = -1 + 0 = -1
  • 11 is 1(-1) + 1(1) = -1 + 1 = 0

We use these cases to determine what to do with b in our new_p addition. Depending on the case, we mux between new_p = p, new_p = p + b, or (new!) new_p = p - b.

Reg#(Bit#(1)) last_a <- mkRegU;
rule work(state == Busy);
    // Booth recoding
    Bit#(33) p_ = signExtend(p);
    Bit#(33) b_ = signExtend(b);
    // This turns into a mux choosing between 3.
    Bit#(33) new_p = case ({a[0], last_a}) matches
        2'b01: {p_ + b_};
        2'b10: {p_ - b_};
        default: {p_};  // to massage compiler
    endcase;

    last_a <= a[0];
    p <= {new_p[32:1]};
    a <= {new_p[0], a[31:1]};
    index <= index + 1;

    if (index == 31) state <= Ready;
endrule
Area: 1612.23 um^2
Critical-path delay: 309.79 ps
Cycles: 443 cycles for 13 tests (34 cycles per multiply - 2 overhead = 32)
Coverage: All 32-bit integer multiply

Our job is complete if we’re okay with 34 cycles per multiply. Internally, there’s only 32 cycles worth of work; the last 2 cycles come from constant testbench overhead. But let’s just worry about the internal work.

But we might want something like a quicker multiplier. We don’t necessarily need a critical path of 310 ps if the rest of our arithmetic units make us run on a slower clock, and we may be able to spare more than 1600 um^2 of area if it means we can do more multiplication.

Let’s try and get better performance. There are a few ways to do better multiplication. This time, let’s just focus on higher-radix multiplication. Next time, we can explore other options.

Radix-4 Multiplication

We can make things faster by inspecting two bits at a time instead of one. We can then finish in 16 internal cycles rather than 32, since we’re doing 16 cycles of 2 bits each.

Our case statement becomes a little uglier, muxing between 5 options rather than 3. Most of the change happens in the work rule. That’s because we need to add up to +/- 2b, instead of just +/- b like last time.

For a syntactical aside, I wrote the variable for 2*b as b2 instead of 2b because the Bluespec language specification requires that variable identifiers start with a lowercase letter. I raise syntax errors when identifiers start with numbers in my extension for VS Code, though not currently in my lexer for Rouge.

The Bluespec language specification is a little contradictory though, since identifiers can start with a $ or _ instead of a letter. For spots of ambiguity, I defer to the compiler. The Bluespec compiler raises a compile-time syntax error when it encounters a variable identifier starting with a number.

rule work(state == Busy);
    // Booth recoding
    Bit#(34) p_ = signExtend(p);
    Bit#(34) b_ = signExtend(b);
    Bit#(34) b2_ = signExtend({b, 1'b0});  // 2*b = b << 1

    // Mux that chooses between 5; the default is optimized out.
    Bit#(34) new_p = case ({a[1:0], last_a})
        3'b111, 3'b000: {p_};
        3'b001, 3'b010: {p_ + b_};
        3'b101, 3'b110: {p_ - b_};
        3'b011: {p_ + b2_};
        3'b100: {p_ - b2_};
        default: {0};  // never happens
    endcase;

    p <= {new_p[33:2]};
    a <= {new_p[1:0], a[31:2]};
    last_a <= a[1];
    index <= index + 2;

    if (index == 30) state <= Ready;  // on last step
endrule

Our cases follow the same rules as before, where a[1:0] is now a 2-bit two’s complement number and the last_a is still the 1-bit carry.

e.g., 3'b011 corresponds to 0(-2) + 1(1) + 1(1) = 2, so we add 2b.

Compiler Optimizations

Initially, I made available a b2 register that uses a precomputed value of 2b at the start, but then I realized it didn’t save any work because we can just do a single bit shift to double b, which is negligible in hardware.

Something interesting is that in Bluespec, the following statements result in the same circuit:

// This is if we store `2b` in a register
b2 <= 2*{0, in[1]};  // "multiply"
b2 <= {0, in[1]} + {0, in[1]};  // addition
b2 <= {in[1], 1'b0};    // simple fixed shift

// Total Circuit in Multiplier:
// Area: 2575.68 um^2
// Critical-path delay: 322.36 ps

The compiler sometimes performs these types of optimizations, but sometimes it doesn’t. For a Bluespec developer, there’s a careful balance between writing beautiful (potentially-) optimizable code and ugly (pretty-sure-is-) optimal code. The optimization part of the Bluespec compiler is not so mature as software compilers like Clang.

Because a multiplication by 2 only requires a fixed bit shift, we don’t need to store the value of 2b. In terms of synthesis, choosing to perform the bit shift every time to save state reduces the area (because we save a Bit#(34) register) but very slightly increases the path. This is the design we’ll go with for Radix-4.

Area: 2328.83 um^2
Critical-path delay: 324.62 ps
Cycles: 16435 cycles for 913 tests

Compared to radix-2 multiplication, we cut our internal cycles in half (now 16) while increasing the area by 44% and critical path only by less than 5%. The speed-up is close to double. It seems like this was well worth it.

Radix-8 Multiplication

Just as we went from 1 bit to 2 bits, we can inspect 3 bits at a time to further reduce the cycles. Of course, 32 is not a multiple of 3, so we need to handle the last cycle as a special case. We would get from 16 to about 11 cycles and hope it doesn’t cost us much in terms of delay and area.

You’ll notice that the cases become slightly more complicated. In Radix-2 we muxed between 3 choices. In Radix-4 we muxed between 5 choices. In Radix-8, we also work with 3b and 4b, so that gives us 9 choices (one for plus and one for minus). That’s just in terms of outcomes.

In terms of cases, in Radix-2 we had 2^2=4 cases; in Radix-4 we had 2^3=8 cases; in Radix-8 we’ll have 2^4=16 cases. Quite a lot to keep track of, but it’s fine once you know the pattern.

Like before, each case is itself like a two’s complement number that tells us what to do with b. In Radix-8, we look at 4 bits of a: a[2:0] and last_a. The last_a acts like a carry bit, so we just read a[2:0] as a two’s complement number, keeping in mind the carry. You’ll see if you inspect the cases carefully in the upcoming code excerpt.

We do need to take care of our loop terminating condition. With Radix-2 and Radix-4, we could just stop at previous index being 31 and 30 respectively because the current step would’ve brought the next index to 32 (or 0): a full reset. 32 is cleanly divisible by 2 and 1.

With Radix-8, we might want to stop at index 29, because 29 + 3 = 32. But 29 isn’t an index we would stop at because it’s not a multiple of 3. We can either stop at previous index being 30, bringing our next one to 33, or at 27, bringing our next one to 30. Both cases require us to do a special case at the end, either backtracking by a turn of 1 bit or doing one more turn of 2 bits.

I think it’s easier to do one more turn of 2 bits (which is just one step of Radix-4) than it is to do a backtrack. To backtrack, we’d need to also subtract b according to the last digit, as well as restore last_a. But either should work.

Since not much else is going on when we’re calling result, I’m putting the last step in the result method, though you could also put it in a separate rule or the same work rule. The tradeoff is that putting the step in result may spread out the critical path, which can help, hinder, or do nothing depending on whether processor’s critical path contains work, contains result, or contains neither. In exchange, we can shave off a cycle from work, giving us only 10 internal cycles.

There’s also the concern of instantiating new adders in result, but hold onto that thought for now.

Computing 3b

We should precompute 3b because rather than only requiring a shift like b2 and b4, it requires an addition. If we do it in work, then we may have two layers of adders: one to compute 3b, then another to add it to new_p. Conceptually, it might wreck our critical-path delay, so we precompute 3b in start. I also run the numbers later in a small experiment.

Unlike computing b2, computing a b3 seems to give the compiler some trouble giving us an efficient circuit, so we need to tune by hand.

Check the differences in synthesis:

b3_ <= 3*signExtend(in[1]);
// Area: 5222.38 um^2
// Critical-path delay: 411.48 ps

Bit#(35) b2 = signExtend({in[1], 1'b0});
b3_ <= b2 + signExtend(in[1]);
// Area: 5261.75 um^2
// Critical-path delay: 341.64 ps

b3_ <= signExtend(in[1]) + signExtend(in[1]) + signExtend(in[1]);
// Same as the b2 + b1 (second one)

Well, that’s interesting that the 3*signExtend(in[1]); gets the compiler to produce something much heavier than repeated addition or our “clever” addition. The repeated addition seems to get transformed into the clever addition, but I’m surprised the Bluespec compiler doesn’t optimize the first into either the second or third. I’m not sure what it’s doing.

New Rules

We precompute b3, but b2 and b4 are once again results of shifts. We add in the many new cases for work. Hopefully you can see the pattern.

// Constants; we can pull them out of the `work` rule to share with `result`
Bit#(35) b_ = signExtend(b);
Bit#(35) b2_ = signExtend({b, 1'b0});  // compiler expands lone 0
Bit#(35) b4_ = signExtend({b, 2'b0});
Bit#(35) p_ = signExtend(p);

rule work(state == Busy);  // handles 10 cycles of 3 bits each
    // Booth recoding
    // Mux choosing between 9; default is optimized out.
    Bit#(35) new_p = case ({a[2:0], last_a})
        4'b100_0:           {p_ - b4_};
        4'b101_0, 4'b100_1: {p_ - b3_};
        4'b110_0, 4'b101_1: {p_ - b2_};
        4'b111_0, 4'b110_1: {p_ - b_};
        4'b000_0, 4'b111_1: {p_};
        4'b001_0, 4'b000_1: {p_ + b_};
        4'b010_0, 4'b001_1: {p_ + b2_};
        4'b011_0, 4'b010_1: {p_ + b3_};
        4'b011_1:           {p_ + b4_};
        default: {0};  // never happens
    endcase;
    p <= {new_p[34:3]};
    a <= {new_p[2:0], a[31:3]};
    last_a <= a[2];
    index <= index + 3;

    if (index == 27) state <= Ready;  // on last step
endrule
method Action start(Pair in) if (state == Idle);
    // ... other boilerplate ...
    b3_ <= signExtend(in[1]) + signExtend(in[1]) + signExtend(in[1]);
endmethod
method ActionValue#(Pair) result if (state == Ready);  // handles last 2 bits
    state <= Idle;

    // Radix-2 case for last 2 bits
    Bit#(35) new_p = case ({a[1:0], last_a})
        3'b111, 3'b000: {p_};
        3'b001, 3'b010: {p_ + b_};
        3'b101, 3'b110: {p_ - b_};
        3'b011: {p_ + b2_};
        3'b100: {p_ - b2_};
        default: {0};  // never happens
    endcase;

    let p_ = {new_p[33:2]};
    let a_ = {new_p[1:0], a[31:2]};

    return unpack({p_, a_});
endmethod

Area Analysis

Area: 5261.75 um^2
Critical-path delay: 341.64 ps
Cycles: 10957 for 913 tests (10 internal cycles; 12 external cycles per)

Compared to Radix-4, we once again have about a 5% increase in critical path but a 125% increase in area. In exchange, we’ve gone from 16 internal cycles to 10 internal cycles, improving our time by over a third.

The large increase in area is because in Bluespec, we instantiate an adder with each call of +, except in very special cases where we have repeated subexpressions (e.g., two adders always perform identical calculations every cycle). So we’ve instantiated more adders. You may have noticed a similar effect when we did Radix-4 from Radix-2.

If we wanted, we could even instantiate a full multiplier with *.

For an area-efficient multiplier implementation, we may want to write one that uses only one adder. We would mux in the proper multiple of b instead of muxing between the results of several adders. So far, we’ve been generous with the amount of area we’re prepared to use. I discuss ways around it later.

In fact, Hennessy and Patterson’s description of higher-radix multiplication (what we’re doing) is in their section Speeding Up Multiplication with a Single Adder. Nine adders is starting to get a little high for a single adder design.

It’s interesting that our area increased by 125%, since an initial count of our adders (just going from +) suggests we now have 8 + 1 + 4 = 13 adders to our initial 4. It’s such a clean number that I would guess the compiler optimized out a few adders. Cycle-by-cycle, our result adders perform the same calculations as a subset of the work adders, so it makes sense for the two components to share the same adders and just wire the sums to both work and result.

Not Precomputing 3b

What happens if we don’t precompute 3b? Let’s do a little experiment and compare to the precomputed numbers:

// Precomputed
Area: 5261.75 um^2
Critical-path delay: 341.64 ps

// Not precomputed
Area: 4830.56 um^2
Critical-path delay: 481.22 ps

As you can see, we increase our delay by 40% because work must have two layers of adders if we compute b3 at runtime.

Radix-16 Consideration

If we continue onto Radix-16, we can expect more adders (unless we start sharing adders between multiples of b, as discussed above).

For work, we’d need to account up to +/- 8b, meaning 16 adders in that rule alone.

For start, we previously had one adder because we needed to compute b3. Now, we also need to compute b5, b6, and b7. Let’s say we save b6 by performing a shift on b3. Rather than having a second layer of adders in start, we can add one more cycle in between start and work to precompute b5 and b7 by adding b2 + b3 and b4 + b3, which also saves us having to compute it every cycle. The precomputation cycle might be helpful if Bluespec doesn’t have a good 3-way single-layer adder as a built-in.

Total, that means we need 16 + 3 = 19 adders, over double our current 9 adders.

What would we get in return? As long as we can mux between 17 (versus previous 9) values without an issue, and as long as our p_ can tolerate the fanout to all those adders, our critical path might not increase too much. We would just get a much higher area.

Our internal cycles should be 32/4 = 8 + 1 for setup, for 9 internal cycles. Our earlier Radix-8 implementation saved an internal cycle by bringing work from work to result, so we can think of it as going from 10 to 9 internal cycles, or 11 to 9 if we discount the saved cycle. It seems like a lot of added complexity and area to save one or two cycles per multiplication.

Let us not do Radix-16 for now.

The calculus might be different if we were doing 64-bit multiplication. In that case, we would be going from 21 cycles to 17 cycles. If we did this for 128-bit multiplication, we’d go from 42 to 32 cycles. It’s the saved-cycle in Radix-8 and the precomputation cycle in Radix-16 that results in a lot of overhead when we’re working with 32-bit multiplication.

Reduced Adder Design for Radix-8

All the above Radix-n designs are supposed to, in theory and as according to Hennessy and Patterson, use only a single adder. Because of the way we’ve implemented them, we’ve instantiated way more than a single adder.

In Bluespec, using the + operator, like calling any function, often results in the compiler inlining and therefore synthesizing a brand new adder, unless it can determine it can get away with fewer, like if you have two adders that always compute the same value like what happened when we did Radix-8.

In theory, it could be the compiler that optimizes for things like area and critical-path delay, but the Bluespec compiler isn’t nearly as sophisticated as the C compiler. The developer often needs to optimize by hand for area and delay.

I would bet that there are more people who have contributed to C optimizing compilers like GCC and Clang than there are people who have used Bluespec period, so it isn’t quite fair a race. Perhaps optimizing RTL output for things like use of + isn’t even within scope for current Bluespec compiler engineers.

Let’s say we don’t want to instantiate so many adders. Let’s say we only want one.

If we want to add without synthesizing adders everywhere, we can instantiate our own Adder submodule and use that. Then, we can add by calling the corresponding method, and as long as we’re using a synthesized module, the compiler will throw a compilation error if it detects we’re overusing the single add port.

interface Adder#(type t);
    method t add(t a, t b);
endinterface

module mkAdder(Adder#(t)) provisos (Bits#(t, t_bits), Arith#(t));
    method t add(t a, t b) = a + b;
endmodule

(* synthesize *)
module mkAdder35(Adder#(Bit#(35)));
    Adder#(Bit#(35)) adder <- mkAdder;
    return adder;
endmodule

The first module is polymorphic, which gives us freedom in producing different adders for different values. To massage the compiler for synthesis, I put in the second module and hard-coded it to work for 35 bits, which allows us to force a single port for the adder.

The overall multiplier still gets synthesized either way, but I want to mandate a single port for the adder. Non-synthesized modules can “grow” new ports, and then we’re right where we started.

Let’s drop this into place for our Radix-8 design like so:

Bit#(35) operand = case ({a[2:0], last_a})
    4'b100_0:           {- b4_};
    4'b101_0, 4'b100_1: {- b3_};
    4'b110_0, 4'b101_1: {- b2_};
    4'b111_0, 4'b110_1: {- b_};
    4'b000_0, 4'b111_1: {0};
    4'b001_0, 4'b000_1: {b_};
    4'b010_0, 4'b001_1: {b2_};
    4'b011_0, 4'b010_1: {b3_};
    4'b011_1:           {b4_};
    default: {0};  // never happens
endcase;

let new_p = adder.add(p_, operand);

Lower Area, Higher Delay

We get the following synthesis changes:

Area: 5261.75 um^2 -> 2960.05 um^2 (remember our Radix-4 was 2328.83)
Critical-path delay: 341.64 ps -> 520.25 ps  (big increase if we can't afford)

It’s strange that the critical path is even higher than when we had two layers of adders from our non-precomputed b3 experiment.

My first guess for the much greater critical-path delay is because of fan-in from the 9-way muxing on the b operands into the adder. Although, maybe that’s not it since we previously must’ve had fan-out from p_ going into eight different adders. Hmmm. Unless fan-in is much more harmful than fan-out, that shouldn’t be it.

My second guess is that our “negative” operands require a lot of hardware before going into the adder, versus all this happening in one go with a built-in adder. We might effectively have two layers of adders, with an implied unary 0 - b_ for the first four cases to generate our negative operand, then going into the actual adder. That is, maybe in Bluespec, a - operator doesn’t instantiate an a + (-b), but a genuine subtractor.

Experimentation

I don’t have the diagnostic tools to determine whether my second guess is correct visually or from the command line, so let’s test that guess experimentally.

When I try to replace the more elegant operand case statement with two separate statements that go into a mux for an adder and a sub, we save some marginal area and path at the expense of a lot uglier code. I plan on reverting.

Bit#(35) operand = case ({a[2:0], last_a})
    4'b100_0:           b4_;
    4'b101_0, 4'b100_1: b3_;     
    4'b110_0, 4'b101_1: b2_;     
    4'b111_0, 4'b110_1: b_;     
    4'b000_0, 4'b111_1: 0;
    4'b001_0, 4'b000_1: b_;    
    4'b010_0, 4'b001_1: b2_;     
    4'b011_0, 4'b010_1: b3_;     
    4'b011_1:           b4_;
endcase;

Bool is_add = case ({a[2:0], last_a})
    4'b100_0, 4'b101_0, 4'b100_1,
    4'b110_0, 4'b101_1, 4'b111_0,
    4'b110_1, 4'b000_0, 4'b111_1: False;
    4'b001_0, 4'b000_1, 4'b010_0,
    4'b001_1, 4'b011_0, 4'b010_1,
    4'b011_1:                     True;
endcase;

let new_p = (is_add) ? adder.add(p_, operand) :
                       adder.sub(p_, operand);
Area: 2948.08 um^2 (from 2960.05)
Critical-path delay: 494.37 ps (from 520.25)
Critical path: ~b[0] -> p[31:0][30](DFF_in)

It’s a rather marginal change though, so I don’t think that was it. A third guess: maybe it’s because our adder is shared between too many spots: work, start, result.

If we then only use a shared adder in work, instantiating standalone adders for start and result, we get a delay of 432 ps, still much higher than 342 ps from the 9-adder implementation (and with obvious cost of higher area from more adders). But that changes our critical path to going through last_a -> p[31:0][31](DFF_in), so it could be an issue with our selector fan-out. Maybe that was it?

Exercise for the Reader

I’m starting to run out of guesses. This is starting to creep into a lower level of abstraction than I’m accustomed to. (I got most of my expertise in big processor things.) Where’s all the delay coming from? Consider this an exercise for the reader.

If you have an idea why our delay has ballooned so much, contact me and I’ll see about following up in a later post. You can play around with the source on the corresponding GitHub repo as long as you have the Bluespec compiler and the synth tool. Otherwise, you can let me know and I can test it later.

It might be an issue with our synthesis tool. It could be falling into a local minimum. Maybe we could also draw diagrams and compare the two designs.

My current method of synthesis is primitive and uses the synth tool in the Minispec repository. There may be a better way that can get us more information about the area and delay, but I haven’t worked through such a process yet. I would only be a little surprised if nobody’s made an easy way to get timing and area information for a Bluespec design.

Since we don’t have great tools for diagnosing timing and area issues, we can still think conceptually and look at the numbers we have, but we can’t yet draw strong conclusions about area and delay directly from our Bluespec.

Built-in Multiplier

Bluespec offers a built-in multiplier as a primitive through the * operation. Here’s an implementation of our MultiplierUnit interface using it:

(* synthesize *)
module mkBuiltInMultiplierUnit(MultiplierUnit);
    FIFO#(Bit#(64)) data <- mkFIFO;

    method Action start(Pair in);
        data.enq(signExtend(in[0]) * signExtend(in[1]));
    endmethod

    method ActionValue#(Pair) result;
        data.deq;
        return unpack(data.first);
    endmethod
endmodule

It can do one multiplication per cycle. What’s the catch? You can probably guess: area and delay.

Area: 16272.28 um^2
Critical-path delay: 858.07 ps
Cycles: 914 cycles for 912 tests (basically one cycle per multiply)

Surprisingly, the penalty doesn’t seem untenably high. Wow. I only found this at the end of experimenting with the other multipliers.

For built-in multiplication compared to our many-adder interpretation of Radix-8, we have a 150% increase in delay and 200% increase in area. Compared to our single-adder interpretation of Radix-8, we have a 450% increase in area and 65% increase in delay. It’s a bit of a big increase, but the single cycle is crazy; we save 9 internal cycles, but we also save the external cycles too. One multiplication per cycle!

It almost seems worth the cost. With my slow processor and its 1000-1200 ps critical-path delay, maybe I could even embed the built-in multiplier as-is as a single cycle multiplier. I would only need to replace it if I started speeding up the rest of the stages.

The efficiency probably comes from the built-in multiplier being most likely programmed directly in Verilog, or a very low-level use of Bluespec. If we’re willing to give up the single cycle latency (e.g., and either wait or pipeline), we can definitely do better in Verilog, and we might be able to do better in Bluespec.

Maybe in a later post I’ll attempt to write a better multiplier in Verilog. In that case, it might still be useful to sketch things out first in Bluespec, just to check that the algorithm works cycle-wise. Once we have a Verilog module, we can still embed it in a Bluespec design for simulation and synthesis, just like an IP block.

Next Time

Next time, we’ll look at some more advanced multiplier designs. I’m particularly interested in increasing throughput, either through using faster adders or by pipelining. And I’m planning on looking at designs that intentionally use many adders.

Our aim right now is to increase multiplier throughput while assuming that our compiler gives us reasonable but not necessarily optimal hardware. Because we would be trying to fit this multiplier as a functional unit in our processor, we don’t have to worry all that much about the critical path as long as it’s below the critical path of all the other stages of the processor, and as long as the area isn’t prohibitively large.

Something else the reduced adder design allows us to do is to implement multipliers that use more special adders without just relying on the Bluespec built-in + operator. That might come in handy later if we look at designs that use particular adders, like carry-save adders. It might also be worth taking a peek at what state-of-the-art Bluespec processors actually use, in designs like Flute or Toooba.

If we do implement adders, then we’ll want to construct a testbench similar to the one we constructed for multipliers. We might need to focus a post on adders in particular before focusing on applying them toward multipliers. All in due time.

At some point, I’ll write a similar post about basic division, since it’s the other half of the RISC-V “M” extension.