Martin Chan

Bean Stew

2024-01-16T00:00:00-05:00

Some people like to fish out the carrots and celery from the long cooking process and replace them with fresher vegetables for eating. When I’m cooking for one, I try to adhere to the 80/20 principle. What’s the least effort that can give us the best results?

I haven’t posted in a while (almost two months!) since I committed to my MEng. So here is a low-ish effort post to get me back into the swing of things. I moved back into Cambridge last week after a few months in Philadelphia. Shout out to the United States Postal Service for their role in shipping most of my stuff all the way here.

My spring room is still mostly empty. I have most of my furniture sans mattress standing by. I’m just waiting for friends to help me do some heavy lifting to prep the room for living. I have a housemate who graciously let me live in her room while she’s away for an internship, so I’ve set up shop there for January.

I’ve mostly settled in. We live near a bunch of restaurants, but I thought it would be prudent to start cooking for myself again, especially in January when I have more time. I made a big pot of bean stew and chicken a few days ago.

Bean Stew Background

I very loosely “followed” (and skipped half the steps of) the cassoulet recipe from Serious Eats. It’s a fancy name for a rustic beans and chicken dish. They include a lot of little optimizations, but I stuck to the low-hanging fruit.

The biggest step I skipped was braising the poultry. I started with searing the chicken like they recommend, but I think my heat was so low that the chicken ended up cooked by the end of searing. I decided to just dice the seared chicken and set it aside as a stew topping.

My bean of choice was the pinto bean. You can probably use other beans instead. I’m just more accustomed to pinto from my days of eating Chipotle. When I feel more adventurous, I’ll probably try other beans. I know they’re out there.

Altogether, this made about 3750g of bean stew. I used 1250g to split between 5 servings (so 250g each) and stored 2500g in the freezer for later. I topped each serving with 100g of diced chicken and some bacon (didn’t bother to measure). This is what I call a meal kit, good for one meal.

And this is what each meal kit looks like. To reheat, I usually microwave it for a couple minutes to warm up the beans, then I finish it off in the air fryer to crisp up the chicken and bacon. When I’m feeling fancy, I add some shredded cheese on top. I use exclusively borosilicate glass to make sure the glass doesn’t explode on me.

The remaining 2500g of bean stew in jars in my freezer. I’ll thaw them out when I need them. Each one is good for about 3-4 meal kits.

I forgot to note the raw chicken weight, but I ended up with 800g of cooked chicken. I’ll need to cook more meat to accompany the stew later on, and I’m still investigating easy but tasty ways to do that. I’ve been searing by hand, which does the job and is tasty, but it takes a lot of time and causes a lot of oil splatter. I wonder if I can get similar results from the oven. Or maybe I just continue what I’m doing and just crank up the heat.

I think the stew’s alright. It’s a little celery-forward, which I wasn’t expecting since I only used like three stalks. It’s otherwise a pretty boring neutral stew. It would probably benefit from some change, but I don’t have enough food experience yet to know what change it needs. Maybe it doesn’t have enough umami (whatever that is) or acid (e.g., lemon juice or vinegar). I’m a little hesitant to use acid early on when cooking beans because it stops them from softening, but there’s no harm in adding it when they’re done cooking. Or maybe it needs more salt or more fat. Who knows. This wasn’t meant to be for experimentation.

One of my goals for the new year is to figure out how to use acid and other food science principles in cooking. It’ll take some reading (what are some good resources?) and some dedicated experimentation. I’ll probably need to compare recipes with more or less acid added to taste what the difference is.

Recipe

Two pounds dry pinto beans, soaked the night before in salted water.
One pound bacon
Mirepoix
- Three large onions
- Some celery
- Some carrots
Better than Bouillon (or similar) and gelatin
Bunch of chicken thighs.
Miscellaneous herbs and spices (I added black pepper and thyme) and optionally cooking oil.

Process

There’s more than one way to do it. I first seared the chicken in some cooking oil. It was only meant to be a sear, but I ended up cooking it through, so I figured I could just leave it at that and skip the braise. It’s not the same, but we’re going for minimal effort.

While I waited for the chicken to sear, I chopped up my vegetables and sliced my bacon. You could skip the step of adding cooking oil if you cook the bacon before the chicken, since the bacon releases fat. I did the bacon second because I wanted to cut my vegetables while the chicken seared, and I didn’t want bacon residue on my cutting board touching my vegetables. But it’s really no big deal since it all gets cooked in the end.

After searing the chicken (no pictures), I cooked down the bacon and fished it out. The bacon releases a lot of fat. I probably could’ve left all the fat in the pot, but recipes suggested removing some so as not to overdo the fat. I like cooking with leftover fat, so I use the bacon fat elsewhere like making quesadillas or later searing something else.

I like to use whole packets when I can. It’s easier to keep track of (less counting) and you don’t need to worry about having half a packet of bacon leftover. It’s probably only an issue for people who cook as infrequently as I do.

Boy dinner.

I lazily chopped my vegetables. You could do something nicer, but these vegetables will shrivel up after a few hours of cooking anyway. Serious Eats likes recommending that you do much coarser chops and fish them out once they’ve given their essence to the stew. I didn’t bother. Since I was leaving the vegetables in, I think I should’ve given it a finer dice just so it looks nicer with the beans.

My baking sheet of seared chicken on the right (also acts as a snack while cooking) and my mirepoix cooking on the left. The mug is full of bacon fat. The red dutch oven is one of my most beloved cooking vessels.

After giving the vegetables some time in the pot to meet each other, I added in half the cooked bacon and my raw beans. The beans will take a while to soften up and cook. I don’t know if adding the bacon back in does anything, but it’s kind of like pork and beans.

My pot right after I added the beans and half the bacon back in. It’s a lot of beans relative to the pot. But the vegetables will melt down and give the beans more space.

I didn’t use pre-made chicken stock. Instead, I used a couple spoons of Better than Bouillon and a packet of gelatin. That’s a step I didn’t skip, since it’s meant to give the stew some body. Does it work? I don’t know. But it’s no big deal to put a packet of gelatin powder in warm water. Also, it’s easier to carry a jar of bouillon than it is to carry cartons of stock. Once I added the liquid, I put the pot into the oven at 180 degrees to slowly cook. I didn’t put on the lid, but it doesn’t really matter.

My pot after one hour in the oven. I forgot to take a picture before putting it into the oven. You can see some of the thyme.

After an hour in the oven, I let it have another hour and a half before putting on a lid and turning off the heat. I kept the whole pot in the oven for a few more hours with the heat off. It’s not a critical part of the process. I just knew that the beans needed more time and I had dinner planned with friends. If I’d used the oven before, I probably would’ve kept the oven on while I was away (don’t tell the fire marshal), but I wanted to be extra careful.

What’s it mean for food safety? Well, the lid had some time to warm up in the oven, and the whole thing was at 180 degrees for at least a little while. I like to think it was food-safe. The beans certainly got soft enough, and there was no overcooking of meat.

My pot after six hours in the oven. I turned off the heat after 2.5 hours, so it sat in the oven (lid closed) in the residual heat for the remaining 3.5.

A small crust developed on top of the beans, which is sort of the point of the gelatin in the stock, but I went against the spirit of cassoulet by using a deep pot rather than a shallow dish. It’s not very much crust relative to the rest of the beans. I finish my meal kits off in the air fryer, so maybe more crust will develop later.

My meal kits after adding some diced chicken on top and some of the bacon.

After some assembly, we’re done. Once I get through the initial wave of meal kits, I’ll prepare more meat and thaw some of the frozen stew. I don’t want to prep too many and have them go bad in the fridge.

Hopefully I’ll have fancier dishes to share once the semester starts rolling again. It’s hard to say how busy the next few months will be for me. But since cooking gets easier with practice, maybe I’ll be able to make fancier things with less effort. Something like a real braise.

Early Literature Review

2023-11-26T00:00:00-05:00

Overview

In anticipation for my upcoming MEng project, I’ve been doing some reading on the ecosystem around language servers and related topics like compilers. Admittedly, there doesn’t seem to be much out there on language server implementation apart from @matklad’s work with rust-analyzer.

The MIT EECS website recommends the substantial work of the thesis be done while in residence, so I’ve mostly been starting a literature review so I can hit the ground running when the semester starts.

I’ve also been trying to brush up on my software engineering skills, which have gotten a little rusty from disuse. I’ll need them when I start implementing.

Language Servers and Compilers

A language server is similar to a compiler in that it takes in a set of source files as input and derives meaning from those files. The main difference is that while compilers use their powers for code generation, language servers use their powers to help the programmer be more productive.

Unlike compilers, language server design is not yet an established academic discipline. They haven’t started teaching these things in school. There is no counterpart to the dragon book. Language servers barely existed ten years ago, and only as proprietary pieces of IDEs.

Today, language servers are everywhere as productivity tools for programmers. The rise of language servers can be tied directly to the rise of Visual Studio Code and TypeScript in the past decade, both by Microsoft. The Language Server Protocol itself is an open standard started by Microsoft and used to allow code editors and language servers to talk to each other. We could further trace language servers’ lineage to the IDE work pioneered by IntelliJ and Eclipse, a decade before VS Code.

Every modern programming language now has its language server, or several. Language servers have become so common that shipping a language without at least a modest language server is like shipping a language without a compiler.

Scoping

There is rich potential for research into language servers at the intersection of HCI, compilers, and static analysis. Developers are actively investigating ways to add productivity in the code editor, especially for language-specific features. Recent developments with generative AI have brought tools like NVIDIA’s ChipNeMo or GitHub Copilot, but by no means is machine learning the only frontier left.

With computing where it is in 2023, there is a very, very high ceiling for how sophisticated smart editing can get. All that research is at the bleeding edge of the field. For now, most of that is beyond my reach.

My project is not so ambitious. I’m just looking to make a usable language server for Bluespec, a language that doesn’t have one yet. I’m doing it to improve the classroom experience of students who are using Bluespec as a learning tool.

When students are using sophisticated code editing features in all their other classes, e.g., with Python, TypeScript, or C, it can be disheartening if they aren’t provided the same quality of tools for Bluespec. If hardware development is as important as we make it out to be, it should have dignified tools. That’s the same reason why I wrote the Bluespec syntax highlighter for VS Code.

I’m striving to provide simple, quality of life features that are common across languages, like go-to-definition, hover, signature help, and similar features. These would be built on top of the core of the language server, which I would have to architect, build, and (re)write a grammar of Bluespec for.

If time remains, there is ample room for useful features specific to Bluespec. A language server for Bluespec, like for any programming language, can take advantage of unique language-specific semantics. One such optimization might be to embed scheduling information from the Bluespec scheduler into the editor, such as with conflicting rules. I’m sure there are plenty of other opportunities.

I was first considering writing the language server in TypeScript, but I’m increasingly leaning toward writing it in Rust thanks to the ecosystem of language server implementation tools built by @matklad and the rust-analyzer team, as well as their exemplary documentation.

It’s common for software languages to have their language servers written in their own language. It should be obvious why it’s not an option for Bluespec.

Resources

The closest thing I’ve found to a canonical source of knowledge on language servers is the writing of @matklad, or Alex Kladov. He leads the rust-analyzer project, the cutting-edge language server for Rust. And he’s been very generous with sharing his experience online. I’m learning most everything I can from his writing (both from his site and the rust-analyzer blog) and videos.

I’ve also been supplementing with documentation from other well-made language server projects like the TypeScript Compiler¹ and others from the growing set of language servers.

And of course, there are the official Microsoft documents specifying the Language Server Protocol and a Visual Studio Code quick-start guide for language server development. The issue with these respectively is that the LSP is only a small part of the challenge of writing a language server, and the quick-start guide doesn’t cover any of the details required for a useful language server.

Our best bet is relying on examples of good language servers to know what to build and how to build it.

Code editors, IDEs, Smart Editing

I plan on ignoring the whole “what is a code editor versus what’s an IDE (☝️🤓)” discourse. The main idea is that language servers are the programs that enable smart editing for code.

Language servers enable features like hover, go-to-definition, code folding, type hints, autocomplete, and all the good things one expects from a modern development experience. We give the engineer all the useful information that we can from statically analyzing the code. Features like these used to be exclusive to fancy programs called IDEs, but now they’re supported by all sorts of code editing programs.

The Language Server Protocol is one way (currently, the standard way) for language servers to communicate with different editors like Visual Studio Code. It’s up to the code editor how to expose smart editing features to the developer, and it’s up to the language server to provide the information to the editor.

A good approach, like rust-analyzer’s, may be to isolate the Language Server Protocol part of the project from the language server itself, so that we can handle other protocols should they come along.

Compiler (Non-)Reuse

Can we use the existing Bluespec Compiler? It sure seems like it’d save a lot of work! My current reading suggests probably not. We may be best served starting from scratch.

In particular, there are fault tolerance and latency demands on language servers that aren’t imposed on compilers. A language server is expected to be useful even (maybe especially) on code that doesn’t compile. It’s also expected to be responsive, sometimes at keystroke frequency. That responsiveness involves thinking about both latency and energy consumption.

From what I’ve read by @matklad, compilers and language servers are similar only enough to get you into trouble. While compilers and language servers appear similar, their tasks are very different, and that has extreme implications on the way they need to be implemented.

He writes on his experience writing rust-analyzer in “Why an IDE?”,

LSP did achieve a significant breakthrough — it made people care about implementing IDE backends. Experience shows that re-engineering an existing compiler to power an IDE is often impossible, or isomorphic to a rewrite. How a compiler talks to an editor is the smaller problem. The hard one is building a compiler that can do IDE stuff in the first place. Check out this post for some of the technical details. Starting with this use-case in mind saves a lot of effort down the road.

He writes further in “Why LSP?”,

Before LSP, there simply weren’t a lot of working language-server shaped things. The main reason for that is that building a language server is hard.

The essential complexity for a server is pretty high. It is known that compilers are complicated, and a language server is a compiler and then some.

and

And, when compiler authors start thinking about IDE support, the first thought is “well, IDE is kinda a compiler, and we have a compiler, so problem solved, right?”. This is quite wrong — internally an IDE is very different from a compiler but, until very recently, this wasn’t common knowledge.

Language servers are a counter example to the “never rewrite” rule. Majority of well regarded language servers are rewrites or alternative implementations of batch compilers.

and, most vindicating,

[LSP] moved us from a world where not having a language IDE was normal and no one was even thinking about language servers, to a world where a language without working completion and goto definition looks unprofessional.

It was a joy to find @matklad’s writing because it was consistent with everything I observed when using Bluespec, back when I was a student accustomed to fancy code editing from software languages and put off by Bluespec’s lack of similar support. And @matklad’s writing is so informative in lighting a path forward.

It’s not even that Bluespec’s editor support was always bad. Back when Bluespec was shiny and new, editor support was bad for everything. The world just moved really fast while Bluespec stayed the same. It’s time to catch up.

I know I take pains to distinguish between compilers and language servers. TypeScript’s compiler is new enough and the relationship between TypeScript and JavaScript is such that they made the language server a first class concept in designing the language and compiler. ↩

MEng Thesis Options

2023-11-19T00:00:00-05:00

Is this list going to get any longer?

I recently accepted a position to TA for Constructive Computer Architecture (6.192) for Spring 2024.¹ It would be my first semester as an MEng, so I’ll be expected to submit a thesis proposal by May as part of the MEng thesis component of the degree.

It’s typical for people who have their MEng funded partially or fully by RAships to align their thesis with the research they conduct for funding. Those types of projects often fit into the typical framework of incremental research, like pieces of projects that more senior researchers are directing. I think it’s generally the case that the MEng thesis follows the RA research, rather than vice-versa.

Since I’m not reliant on RA funding², I have more flexibility on my choice of MEng project. That doesn’t mean I should just do anything, but I feel like I should take advantage of that flexibility to do a project that might not conventionally be funded given researchers’ budgets. Some people in my position take the chance to execute on a passion project.

And, because I’ll be taking quite a few classes throughout my MEng (at least 5 out of the 6 maximum³), I think it’s especially important that I find a thesis topic that I’d like to spend a significant amount of time on. The high amount of coursework I’ll have means I will have plenty of opportunity for career-relevant learning and I’ll need to ration my energy. It’ll be easier to muster up my remaining energy for an MEng project that I’m excited about.

The degree includes 12 units of thesis work per semester as an MEng, so spending three semesters would mean about 14 hours x 12 units x 3 semesters ≈ 500 hours of work. You could build quite an impressive project with 500 hours of work, between design, building, documenting, and testing. I’d like to spend that time on something I can be proud of.

One type of MEng project I find very attractive is the kind that is practical and has immediate pedagogical benefits. One prime example is Adam Hartz’s MEng project, which was CAT-SOOP, a learning management system that, a decade later, is now used by almost 20 MIT EECS classes⁴. Another example is Katy Kem’s project, Laboratory Assignments for Teaching Introductory Signal Processing Concepts, which is exactly the title and has use for teaching. The Usable Programming Group has a series of projects on programming education, several of which continue to be used in offerings of 6.102.

These are amazing projects that amplify the efforts of students. And most of them were built by and for MIT students, who (as we are all told) are the future. I think I’d like to do that sort of thing for my MEng project.

Preliminary Idea

A couple months ago, I wrote a Bluespec extension for VS Code. It’s being used by many current students in 6.191 as part of their code editing setup since almost every lab assignment in the class uses Minispec, which I cover with my extension. I like to think it’s being used by a few researchers as well. Next semester, I have no doubt that students will be coming into 6.192 with my extension already installed.

My current idea for an MEng project is to add fuller support for Bluespec in code editors by implementing the Language Server Protocol (LSP) for Bluespec. The idea is that adding this support will drastically increase the productivity of Bluespec students, researchers, and if or when they exist, those in industry.

My plan is to add all the typical bells and whistles from a modern language in a code editor, but eventually add more Bluespec-specific and hardware-specific functionality like annotations for scheduling, ports, or state-change. Already, many programming languages support the LSP, and it’s a growing expectation for modern programming languages. Just because the hardware industry is a little behind the times doesn’t mean we have to be.

I can slot these into my VS Code extension, but the LSP can be used by other editors like vim/neovim, Emacs, Atom⁵, and many others. I still anticipate most programmers in 2023 to be using VS Code. I don’t see another editor dethroning it for the near future, but the appearance of editor-agnosticism is a pleasant property of the LSP. If ever the mandate shifts, the new editor can still use the LSP.

One worry about the project is that it’s kind of a bet on Bluespec. Will it take off or not? The hardware industry seems kind of ossified around Verilog, SystemVerilog, and in some places VHDL, even despite the existence of shiny new HDLs like SpinalHDL and Chisel.

While I don’t know if Bluespec is the future, I do know that Bluespec is the present (at least at MIT). 6.191 was redesigned to use Minispec only a few years ago. 6.192, I hope, will stick around for as long as there is someone interested in teaching synthesizable computer architecture.

And I’m certain there is zero way Bluespec is taking off without the sort of mature programming language support that the LSP provides. And I’m not sure anyone but me is going to implement that LSP. I’m cautiously optimistic, but I think MIT can be a fount for Bluespec, especially if we continue to teach it consistently. So maybe it is a bet on Bluespec.

It’s more than likely that I will graduate and never use Bluespec again, but I don’t think it’s any more tragic than most MEng projects. At least my project would continue to be used as long as Bluespec and Minispec are used for classes and research. It feels more impactful (certainly more based) than most MEng theses.

Probably the biggest cost for the project is that the skills I’ll be gaining will be more pertinent to software development or programming languages than to computer architecture or hardware design. I think there’s value in being a generally well-rounded engineer, and I hope that the things I learn from TAing for 6.192 and taking all my other classes will cover any weak spots.

The Alternative

While I don’t currently rely on RA funding from a computer architecture lab, I could still select a project that has more direct relevance to the sort of work that I’d like to do. Something I was looking at when I was doing my processor project was the Riscy-OOO project led by Sizhuo Zhang. It’s a sophisticated multicore out-of-order processor written in Bluespec, and a worthy base for other projects.

Most of the people who worked on the Riscy-OOO project are no longer at MIT, but there have been some recent MEng projects that used it for research. My 6.192 TA last year wrote his MEng thesis on adding secure shared memory to the processor as a continuation of the MI6 project. Doing something similar would probably require me to self-lead to a similar degree as with an LSP project, but with the reward of gaining more practice with hardware design and computer architecture (albeit with the cost of using shoddy existing Bluespec infrastructure).

The rub is that if I focus my attention on a more explicitly hardware-related project, then the infrastructure would probably never be built. But I know that a project like implementing the LSP would make Bluespec projects, especially for large code-bases like Riscy-OOO or even the 6.192 class projects, much more tractable for the next person. The lack of editing infrastructure is a significant barrier to productivity for all users of Bluespec, whether student or researcher.

The infrastructure is something best done sooner rather than later. And if not by me, then who? Not to make myself out like a saint, but I feel like I’m in the exact right spot to implement the LSP, as someone who already wrote a basic VS Code extension for Bluespec and who is about to gain access to a bunch of potential test users in the form of 6.192 (and maybe 6.191) students, already accustomed to modern features from their other classes. It seems like exactly the thing a long-term-thinking TA at MIT should do.

Frankly, my impression is that most MEng projects don’t really have much direct relevance to the students’ future careers. I’d be proud enough leaving a project of value to the MIT and Bluespec communities. I’d just need to use my other experiences to get a job in hardware.

Last semester it was two instructors and two TAs. Not sure how we’ll make it with one instructor and one TA. Maybe someone else will come aboard. ↩
It’s still kind of up in the air how I’ll find funding for Fall 2024. 6.192 is typically only offered in the spring. I’ll probably either attempt to TA for 6.191 or seek RA funding from Arvind or one of his colleagues. I’m hoping that my project will be seen as worth it, or I’ll help more directly with theirs. ↩
Some students took their graduate level classes or math electives during undergrad. I didn’t originally plan on MEnging when I was planning my undergrad classes, so I need to take all 4 of my AAGSes and 1 more math elective. ↩
Of course it helps that Adam was a lecturer for one of the biggest classes in the school for almost as many years. ↩
Whoever is still using Atom in 2023. ↩

Write-After-Write Bug

2023-11-12T00:00:00-05:00

Could the whole processor project be wrong? Well not the whole thing. But I did find a bug that results in incorrect execution for a small set of cases.

Introduction

The minimal out-of-order processing in my processor project has a bug that I didn’t catch until now. Not only is the out-of-order processing minimal, but it isn’t even fully correct because it ignores write-after-write (WAW) hazards.

Most of the time, it isn’t an issue, and in fact I didn’t catch it with existing tests, but we can write a sequence of instructions that result in our processor giving the wrong answer. It’s the specific case where instructions are sliding past each other but we write to the same destination register in the wrong order.

In this post, I write a test case that demonstrates the write-after-write (WAW) hazard. To make the out-of-order part correct, I would need to either add in proper reordering or induce more stalls for when out-of-order commits may produce incorrect results. I’ve plugged this post into my project write-up as a correction.

Background

I found the bug while reflecting on a recent job interview where I was discussing my processor project. It was the first time I had the opportunity to talk about my project’s out-of-order-ness in any detail with anyone else, let alone an engineer in the field. It’s not the best of circumstances to find a bug, but it’s probably better than finding a bug post-fabrication.

In my processor, the register-read stage induces stalls so that instructions wait for their source registers to be ready (solving read-after-write hazards), but not for their destination registers to be ready. This was perfectly fine when every instruction would complete in order, but we end up with write-after-write hazards when instructions commit out of order and those instructions write to the same destination register.

It’s a little embarrassing that I plastered these pipeline visualizations all over my write-up that might look obviously wrong to an expert at a glance, but in context it’s not too too embarrassing. I only ever learned about out-of-order processing on my own during the summer, and it’s hard to get feedback when I don’t have instructors or office hours to draw on like I did for things I learned in class. If I was doing this for school, it would’ve been caught in office hours or after turning in an assignment, or I would’ve been reminded during lecture. It just so happened that I was reminded recently during a job interview.

I do recall a case during office hours last semester when I showed my professor Thomas a visualization where the instructions appeared to be committed out of order. He expressed concern then, like it wasn’t supposed to happen. In that case, I knew it was just because of a bug in my Konata instrumentation, not really the functionality of the processor itself.

I discounted the out-of-order commits when working on the project during the summer because I was expecting my processor to be doing some things out of order. I just forgot this was a case that might give incorrect results. I probably could’ve caught it myself earlier if I had been more careful with the design or did a closer reading of Hennessy and Patterson during the summer while I was self-studying.

This and similar bugs might be more obvious to me if I had more experience with sophisticated out-of-order processors like the MIT RiscyOO project or during study before attempting to implement similar features myself. On the other hand, I might not have internalized the lesson as well if I didn’t find the issue in a firsthand project like this one. It’s hard to say!

Deficiency in Existing Tests

I didn’t set up a unit test for write-after-write, but I am a little surprised my integration tests didn’t catch the issue. I can think of some reasons why.

My main guess is that the degree of out-of-order processing in my processor was so limited that the bug just didn’t show up in my tests. That, and my integration tests might just be too small for the issue to show up anyway.

The criteria for the bug to appear is where we have two instructions that write to the same register and were given permission to slide past each other during the register read stage. The only case that this would happen is when it’s happening across two separate functional units, like the ALU functional unit and the memory functional unit.

Because my processor’s instruction window was something like 2-3 instructions (very modest compared to the dozens or hundreds of instructions in high performance processors), and because all instructions spend only 1 cycle (in the case of ALU) or 2 cycles (in the case of memory) in execution, it was rare that I would have two instructions at the head of their respective queues that want to write to the same register.

Equivalently, I think the bug maybe would’ve shown up if the instruction window was wider or if the execution stage took longer. Either implementing my improved instruction issue buffer or lengthier functional units like a multiplier or floating point unit maybe would’ve caused the issue to appear.

Furthermore, most of my test cases were simple C programs optimized using GCC’s O2 optimization flag. We saw previously that the compiler would rearrange our instructions in such a way that we could maximize use of instruction-level-parallelism, such as when we needed to save registers to the stack.

One of these optimizations may be that the compiler tries not to give us instances where two nearby instructions try to write to the same register. Sometimes that isn’t an option due to restrictions in RISC-V calling convention, but my instruction issue window is so small and my test programs are so simple that it just wasn’t an issue.

An example of saving registers to the stack, taken from my processor write-up

I had more tests available to me through RISCOF, but I had stopped supporting it when I migrated the processor from a rigid execution pipeline to the flexible functional unit set-up. I wonder if those tests would’ve caught them. I would need to re-add support to find out.

Test Case

We can write a test so that we can see that the issue truly exists and, when we eventually fix it, to help verify that we fixed it.

Conceptually, we want to store to a register in two consecutive instructions, one that goes through the ALU functional unit, and another that goes through the memory functional unit. We might later also want coverage to check that our two ALU functional units don’t also display the WAW hazard, e.g., if we have two consecutive li to the same register. But let’s start small.

In C, the test might look like this:

int main() {
    int a;
    a = 1;
    a = 0;

    if (a == 0) {
        exit(0); // test passed
    } else if (a == 1) {
        exit(1); // test failed because we used the old value of `a`
    } else {
        exit(2); // test failed for another reason (maybe `a` is uninitialized)
    }
}

For the sake of this test case, I’m going to write it directly in RISC-V. I don’t know if the case I described is something that a compiler would produce from C code, but I do know the situation in RISC-V assembly where it could happen.

All of my existing tests are complied from C. This new test doesn’t exactly fit into my testing framework as-is, since it’s not doing any of the MMIO to indicate whether the test passed or failed. Instead, I’m judging visually from the pipeline visualization. It would take a bit more boilerplate before I could integrate it into my existing battery of correctness tests for automated testing.

Still, I want to do a quick mockup to illustrate the issue. The following test should fail with our current design. A pass should result in a forward branch, while a fail should result in no branch.

In practice, I don’t think a compiler would ever generate code like this. In this mockup, the lw a0, 0(sp) is dead code because the value is never used before the following li a0, 0. A compiler would probably remove it altogether (though maybe not with the O0 flag). But this arrangement is useful to isolate the bug, since it’s a very short sequence of instructions.

; I'm using ';' because the Rogue syntax highlighter only supports NASM assembler
  
test:
  ; these two instructions put 1 into 0(sp) (it might violate calling convention)
  li a1, 1
  sw a1, 0(sp)  ; save 1 to 0(sp)
  
  ; these instructions both use a0 as a destination. The second should overwrite the first.
  lw a0, 0(sp)  ; load 1 to a0 (from memory)
  li a0, 0      ; load 0 to a0 (from immediate)
  
  beqz a0, pass ; branch to pass if a0 is zero (otherwise continue to fail)

fail:
  li a0, 1  ; kind of like the exit(1); we fail
  unimp  ; repeated several times for cosmetic reasons (processor exits)

pass:
  li a0, 0  ; kind of like the exit(0); we pass
  unimp  ; repeated several times for cosmetic reasons (processor exits)

The test should go to pass, executing li a0, 0 and unimp, which triggers an exit on my processor. But I believe with my implementation that the beqz instruction is going to use the lw value instead, causing the test to fail.

And indeed it does fail, corresponding to Frame 2 of the following animation. We should be getting a result similar to Frame 3, where we exit with 0 in a0 after a forward branch.

Three frames showing the li-lw, lw-li, and nop-li cases respectively, which differ only in the instructions at pc a4 and a8. We should certainly not get the same results in Frame 2 as in Frame 1.

This animation uses three different pairs of instructions to set the value of a0. Going frame by frame, this is what’s happening:

Frame 1 has li-lw. It’s true that the li commits earlier than it should, but it doesn’t mess with the correctness of the program’s output because the value from li is never used before being replaced by the value from lw. In this case, the bug is benign.
Frame 2 has lw-li, corresponding exactly to our assembly test case above. Our processor gives the wrong result, since it should branch but doesn’t.
Frame 3 has nop-li. We branch because we load a0 with the li instruction only, without an lw instruction.

When I say the result is correct or not, I’m only looking at the value we get at the end of the test. In reality, the state of our processor for all three frames can be incorrect if you stopped it at particular cycles because the register file and memory are updated as soon as the W stage concludes for a given instruction. In more sophisticated processors, they might require changes to the processor state to occur in the correct order through something like a reorder buffer. We don’t currently worry about interrupts or traps, so the out-of-order commits can be benign.

Looking at the visualizations, it’s no mystery why both orderings of li-lw and lw-li behave like li-lw with our processor. In lw-li, while lw is waiting on the previous sw to complete, our processor (mistakenly) allows the li to slide before the lw. The value that ends up in a0 is therefore the one entered by the lw instruction, even if lw was supposed to happen before li.

Another Hint

Another hint that my processor was bugged is that the Konata pipeline visualizer doesn’t really support instructions committing in the wrong order, as if it expects in-order commits. It looks fine when I make my visualizations without hiding flushed operations, but the bug appears when I do hide flushed operations.

Two frames illustrating the same nop-li case with flushed operations shown and hidden.

Hiding flushed operations makes Konata sort the instructions by when they commit. Because my processor has them (incorrectly) commit out of order, Konata “reorders” my instructions for me in the visualization.

Not only that, but Konata also removes the final li and unimp instructions. I think that must be because the processor terminates simulation in Writeback without the final Konata commits, which maybe were supposed to happen in a later cycle.

This also suggests that the way Konata measures whether an instruction is flushed might be by seeing whether it was committed to Konata, rather than seeing if it had been flushed per se. These are subtly different according to the Konata log format. The distinction only really matters in corner cases like this, since most of the time all instructions are either flushed or committed. These last two instructions just happened to have neither.

How do we fix it?

We have two main options for the fix.

We can add in register renaming with a reorder buffer (like prescribed in Hennessy and Patterson) so we can execute instructions that use the same destination register without dependencies between the inputs. This is the harder fix but should make our processor more capable of bona fide out-of-order processing.
- For the reorder buffer, we could either use Bluespec’s CompletionBuffer package or write one ourselves.
- I suspect the register renaming part would be tougher to implement, but maybe not so if I can find a decent strategy elsewhere. For all I know, there might already be an idiomatic implementation in Bluespec.
We can introduce stalls so that instructions are stalled from issuing if their destination register is being used as an earlier instruction’s destination. It might slow down our processor but it should be a simpler fix.
- We would still be able to allow some out-of-order commits as long as they don’t affect the same destination registers.
- A more conservative fix would be to fully disable out-of-order issuing altogether, if you want to call that a fix.

Neither of these I can do in time for this week’s post, so I’ll defer it to some other time.

Prose that I Like

2023-11-05T00:00:00-04:00

Diane in Bojack Horseman S02E09 12:44: You can do anything you want in life! I mean, not everyone can write for the New Yorker, but there’s always the Atlantic!

I’ve been a long time reader and admirer of the New Yorker. It’s a rather fashionable magazine that publishes some of the best written essays I’ve read. As someone with some interest in writing, I look upon the New Yorker’s best pieces as models for the craft.

I set up this blog in part to practice writing so I can get closer to the quality I love to read. I’ve been trying to post weekly with some consistency because I think a steady trickle of practice will get me farther than irregular bursts. It also helps keep time while I’m not employed, though I don’t intend on stopping when I am.

I’ve generally written these posts by the week, but I plan on accumulating a backlog of content to release over time. I’d like to be able to share something modest each week and occasionally release large, well-prepared posts like my processor write-up. I don’t think I’ve got enough energy at this point in my life to write something of that caliber on a weekly basis, nor do I have enough content to justify it.

This week, I’d like to share a short list of pieces I admire. They have a neatness that I try to emulate in both prose and code. Each of these pieces were written deep into a celebrated writer’s career. While I’m trying to make my career in engineering, I can still admire their art.

John McPhee has a column called “The Writing Life” in the New Yorker. My favorite piece is “Omission,” on leaving some details unwritten.
Ann Patchett wrote an essay collection called These Precious Days. I read the eponymous essay after first encountering her New Yorker piece “How to Practice,” on living. Both are solid reads.
Annie Dillard wrote many things, but I’ve only read her 1989 memoir The Writing Life. It speaks to living and writing. I think it’s worth reading in its entirety. There are no copies I can link to, but those of you who know me can ask me for mine. My phone wallpaper is a screenshot of an excerpt with one line highlighted: “How we spend our days is, of course, how we spend our lives.”

Supermarket Fried Chicken

2023-10-29T00:00:00-04:00

A picture I took recently from a walk back from ShopRite after picking up a set of dark fried chicken (thighs and drums). I often start eating on the way back, since the chicken doesn’t get any fresher.

I’ve long been a proponent of getting fried chicken at supermarkets. I can’t speak for more local establishments, but of America’s big three fried chicken chains (Chick-fil-A, Popeyes, KFC), I think only Chick-fil-A is worth the money. More often than not, grocery stores that offer fried chicken do it better and at far better prices than the other mainstream chains.

One of the tipping points that turned me off from mainstream fast food fried chicken is a bad experience I had at the KFC in Allston when I was at MIT. In my childhood, KFC used to be synonymous with fried chicken, and indeed it used to be at the top. But earlier this month, it fell to #3 behind Popeyes in October 2023, and further behind Chick-fil-A at #1. KFC does well internationally, just not so great domestically anymore.

My single visit to the location in Allston left no questions as to why. The brand has fallen off, and it has fallen off hard.

Somehow KFC manages to be the most expensive and the worst in quality and portion size. Other locations might be doing better, but my visit to the Allston location of KFC only confirmed my earlier impressions.

The drumstick was smaller than some drumettes you see from chicken wings. Frankly, I don’t think Colonel Sanders would be proud of this product.

To get to the KFC in Allston, me and my dear friend Syd had to bike four miles away from the MIT campus.

I’ve had Popeyes a few times, but I don’t really like the way it tastes. I think it’s something in their frying oil. Even if Popeyes (or other chains like Jollibee) are actually better than supermarket fried chicken, it’s by such a thin margin that it’s not worth the increased expense. The only exception is Chick-fil-A, which truly does do it better.

Of the big three, I think there’s no mystery as to why Chick-fil-A has gotten so successful compared to the other chains. I’m no business analyst, so I can’t say anything about their methods in logistics or business strategy, but I do know that the product they offer is miles better than its competitors. I have yet to have a fried chicken sandwich anywhere else that holds up against Chick-fil-A’s. It’s just so good.

There’s only one Chick-fil-A within 10 miles of MIT’s campus, and I would often bike there and back for their spicy chicken sandwiches. A previous Boston Mayor Thomas Menino had released a letter in 2012 pushing against the company’s attempt to open a Boston location. This location only opened in January 2022, and I went there frequently from Spring 2022 and through the 2022-2023 school year.

Chick-fil-A has had some baggage in the past with regard to charitable contributions to anti-LGBT groups, and part of that is rooted in their uncommonly religious background compared to other large American chains. It’s not enough baggage to shake it from its #1 position on customer satisfaction for the 9th year running on the American Customer Satisfaction Index, nor is it unadulterated enough that conservative Christian groups won’t take the opportunity to accuse it of pandering to liberal America.

I’ve heard rumors that the company has reformed its charitable contributions in recent years to become more palatable to liberal consumers. It’s a natural approach when a company is trying to expand outside of the American South, especially into the massive liberal urban coastal markets. The fact that Chick-fil-A has gotten so far despite its past baggage is because it has no real competition on quality of product.

In terms of whether it’s moral as a consumer to support Chick-fil-A as it is today, I wouldn’t sweat it any more than supporting any other big company like Nike or Nestle. You know how they are.

But enough about fast food restaurants. For a fraction of the price, you can get fried chicken at or above the caliber of non-Chick-fil-A chains at your local supermarket, at least in my experience living in Cambridge and Philadelphia. I often like to get a batch of fried chicken once every few visits to the supermarket.

As a policy, I only ever get dark meat (thighs and drums). A chicken breast cannot survive hours under a supermarket heat lamp the same way a chicken thigh can, and it will not measure up as favorably against its restaurant counterpart.

While I’ve been here in Philadelphia, I’ve been getting fried chicken from my local ShopRite. I haven’t sampled supermarket fried chicken from other stores, mainly because I’ve been staying put in my neighborhood. But it’s as good here as it’s been anywhere else, with delicious golden skin and acceptable meat.

4-piece dark (much rarer) sold hot in December 2019 at my local ShopRite when I visited for winter break freshman year. I imagine it must’ve been a mislabeling for the chicken to be sold so cheaply, since the 8-piece was usually $6.49 (now it’s $9, not a big increase). The quality hasn’t changed much in the years since. It’s just as tasty as ever. If you want a good picture of the chicken today, scroll all the way up.

When I was at MIT, I would get supermarket fried chicken at several nearby supermarkets (there, Shaw’s and Star Market). I don’t know if the ingredients vary by location, but I know that the manner of preparation certainly does. Each location had its own style. I never kept track of how it varied.

Here’s a recent picture of a fried chicken thigh my dear friend Syd got from a Star Market in Somerville. We often used to get groceries and supermarket fried chicken together. Admittedly, it doesn’t look very pretty. It kind of looks like an amateur fry job that you might see at a high school bake sale with fried Oreos.

Over the years, I’ve tried frying my own chicken several times according to a copycat Chick-fil-A recipe on Serious Eats by (MIT alum) J. Kenji López-Alt. It’s an excellent recipe, but fried chicken really is one of those foods that both benefits from scale and can’t be feasibly meal prepped. I think most people are better off buying than making.

Whether you’re frying a few servings or many servings, there’s a comparable amount of preparation and cleanup because of the large quantity of oil. Meanwhile, you can’t exactly prepare ten servings of fried chicken the same way you might prepare a big pot of chili and expect the quality to remain steady over a week.

I fried this batch of chicken according to the Serious Eats recipe when I was still living at my dorm East Campus at MIT. I used a communal deep fryer that 5e got in either 2019 or 2020, but I think this occasion was the only time the fryer had ever been used.

Fried chicken connoisseurs might claim that supermarket fried chicken is too mushy or too soft on account of the way it’s made and stored. It can certainly be true, but I would say the quality can vary dramatically by location and time of day. I’ve only ever had good experiences with my local ShopRite in Philadelphia, but I have had bad experiences with some supermarkets in New England.

Here’s some so-called fried chicken at a no-name supermarket in Connecticut. I bought it at a rest stop on the way back to Philadelphia at the start of winter break from when I was living on Cape Cod for the pandemic academic year 2020-2021. It was tolerable but not particularly good, kind of like most food that was sold on the Cape in the off-season. It was the worst supermarket chicken I’ve ever had. Still better than my visit to KFC.

The quantities in which the chicken is sold can also vary. In Cambridge, Star Markets tend to sell fried chicken by the piece, but my experience in Philadelphia has been that they’re sold in sets of four to eight pieces, most commonly eight. Here, they’re offered in both 8-piece regular (2 each of breasts, wings, thighs, and drums) and 8-piece dark (4 each of thighs and drums). As mentioned, I always get the dark.

On an empty stomach, I’m unable to eat more than three or (pushing it) four dark pieces in a single sitting. The leftovers, I tend to either eat cold straight from the fridge or reheated in an air fryer. Surprisingly, air frying leftover supermarket fried chicken can get pretty close to quality while fresh.

Gallery

Some ShopRite fried chicken I bought during January 2021 when I was visiting Philadelphia from Cape Cod for winter break my sophomore year.

A picture I took of a 8-piece dark set sold cold in June 2021 at my local ShopRite when I was living in Philadelphia for the second pandemic summer. I assume that hot fried chicken from the previous day (or days) is relabeled and sold cold in the following days. It can be enjoyed as-is or reheated. I typically only buy fried chicken hot.

Sometimes, instead of getting fried chicken, I like to get the roasted leg quarters. It’s like rotisserie chicken but just the yummy dark meat, and it’s even cheaper. It’s too much to eat in one sitting so I like to eat the skin first and then keep the meat for later meals.

One Hundred Installations

2023-10-22T00:00:00-04:00

Introduction
Caveat
Impact
Effort vs Impact
Who Else?
Who Else, Really?

Introduction

Earlier this week, my Bluespec extension for VS Code hit 100 installations, slightly under a month from when I released it. The number of actual users must be lower, but I’ve heard from multiple contacts in MIT’s 6.191 (both students and TAs) that people are finding it useful.

Thanks to my extension, there are now a bunch of folks who have industry-standard syntax highlighting for their Bluespec homework assignments and projects instead of plain black-and-white.

I remember going into it thinking, “well, if the only person this helps is me, that’s enough.” I’m very proud that it’s ended up helping others.

Caveat

While reaching 100 installations is something I’m pleased with, it’s a coarse measure of impact that would’ve happened even if my extension was complete garbage.¹

Installations are something that users can’t take back. If someone installs an extension, sees that it sucks, and immediately uninstalls it, there’s no mechanism that accounts for it in the popularity count. It still counts as an installation.

The number also reliably goes up because of how empty the field is. There are only three results that show up from searching Bluespec on the Marketplace, and installing VS Code extensions is so easy that people can just install all three and stick with the one that’s best.

I think my extension’s quality (both in implementation and documentation) has a positive effect for retention, but I have no way of measuring it. Practically nobody leaves ratings or reviews. The most popular extensions on the Marketplace with their 50M+ installations only have hundreds of ratings. Mine has three ratings.²

Impact

In all, I know a good handful of students taking 6.191 who are using my extension for their Bluespec (or I suppose Minispec). Some of them switched from the two barebones extensions on the Marketplace, and some of them switched from plaintext black-and-white.

When I released the extension, I told two current 6.191 TAs that I made my extension available, just in case it’d be useful for them or their students. One of them told me that they already had students asking for a VS Code extension for Minispec in office hours.

More recently, I was told by a friend that one of the other TAs (whom I’d never seen or heard of) was helping their students install my extension for their VS Code setups.

That was a sublime feeling, knowing that people I hadn’t met were recommending my tool to others and helping them install it.

It also feels nice, like in an artisanal way, to produce something beautiful (and I would consider my extension beautiful) and see it appreciated by others. It feels like what a carpenter might feel after having made a nice stool appreciated by themselves and others.

Effort vs Impact

In all, I didn’t spend a herculean effort on the extension, and yet it’s paid off handsomely. I spent about a week of effort and now that effort’s been amortized across dozens of my peers at MIT and Bluespec users at other institutions. I genuinely think the extension has already saved enough cognitive effort cumulatively across its users to outweigh the amount of time I spent developing it, and the degree by which that’s true is only going to grow with more users.³

Is my Bluespec extension technically impressive? Not really. It’s basically a set of regex rules in a trench coat.⁴ But has it filled a niche that has already made dozens of people happier to write Bluespec in VS Code? Yes.

Firehose (now Hydrant) was built in a weekend in 2017. Every semester, practically every undergrad at MIT uses it to plan their classes. It’s brought immeasurable comfort (even joy) to the undergrad community at MIT, and it has become quintessentially MIT as the Great Dome. I want to give appropriate credit to the maintainers from SIPB like CJ without whom the tool would no longer exist, but I’ve always thought it was remarkable that such an impactful tool was first created in a single weekend.⁵

It especially feels weird having this extension as a project next to my months-long-and-technically-difficult (but in-hindsight-kind-of-unimpressive⁶) project working on my processor.⁷

Who Else?

These things bring to mind the apocryphal story of Columbus’s egg. It’s like, well anyone could’ve done it. The thing is (and not to toot my own horn), nobody did do it.⁸ Why not?

My extension still has a ways to go to be a truly great extension⁹, but it’s a little mind-blowing that the niche it filled was so, so empty considering the amount of positive impact per unit effort. It’s by no means a massive user base, but definitely at least a thousand Bluespec users per year stand to benefit, if the other extension gives us any indication.

Before I did it, I would’ve expected maybe course staff in 6.191 (with their instructors or army of TAs) to provide a modest syntax highlighter the same way that other introductory classes at MIT have built up their class infrastructure.¹⁰ It would’ve taken a week, maybe two, of a single TA’s hours out of their ten TAs, and it would’ve paid dividends for several semesters because such a tool can be reused.

But I think the course staff hasn’t, as an institution, hopped onto the VS Code hype train yet.¹¹ Or, just as likely, they just haven’t sought to take on that sort of pedagogical burden of providing syntax highlighting tools to students, especially when there are so many other pedagogical burdens to consider.

Instructors have many, many, choices on where to expend their energy to improve a class, and syntax highlighters might not be on their radars the same way it was on mine, especially when placed against improving lecture materials or operational aspects like running office hours. The instructors might not have known (or recognized the need), and the TAs might not have wanted to assume that sort of extra responsibility on top of their existing duties.

If not 6.191 course staff, then I would’ve expected Bluespec Inc. or the related B-Lang organization to develop and publish a VS Code extension for Bluespec. They seem to be doing a lot of marketing for Bluespec as a tool, including open-sourcing their Bluespec compiler, but none of that outreach seemed to include adding Bluespec support for the world’s most popular IDE, Visual Studio Code.

It’s a rather small company though, so it’s unsurprising if they’re focused more on other tooling, compiler enhancements, or toolchain integrations that make Bluespec more attractive to hardware companies. They have internal syntax highlighters for Emacs and Vim, but it could also be a generational¹² difference where nobody there uses VS Code, or none of their main customers in the hardware design industry do.

every time I have to review software source code that was written by a hardware company I come away with the feeling that I have encountered an alien form of life
— badidea 🪐 (@0xabad1dea) October 17, 2023

I don’t have data, but I would also be unsurprised if VS Code’s market share was greater among software engineers than it is among hardware engineers. There’s less of a distinction in the EECS undergraduate program at MIT, because everyone who’s writing hardware is probably writing software in another class. I can imagine the industry being a different world.

In this way, VS Code support for Bluespec might be far more important for college students than for industry engineers, but Bluespec Inc. is exclusively focused on the latter, as per their business model.

Who Else, Really?

It kind of all comes down to, who is in a position where they would care enough about Bluespec syntax highlighting to make it better? It’s not like there are syntax highlighting engineers running around looking for languages to develop extensions for. Course staff are generally focused on teaching their class material, and syntax highlighting is a ways off the beaten path. Bluespec Inc. is focused on developing their EDA tools for enterprise, whatever it is that they do.

And hardware engineers (like what I’m aspiring to be) generally want to be working on hardware, not the sort of infrastructural tools that involve playing with JSONs and writing regular expressions. But who else but the hardware engineer would care to work on tools to make hardware engineering more comfortable? It’s weird!¹³

These considerations presume that syntax highlighting for Bluespec is important. It’s a difficult case to prove to instructors or seasoned Bluespec engineers who have forgotten what it’s like to be new. I just know subjectively that, as a student who was learning unfamiliar concepts in an unfamiliar language, the familiarity of high-quality syntax highlighting would’ve made the learning curve easier and freed up some cognitive resources to focus on the technically difficult parts of writing Bluespec.¹⁴

In no place is syntax highlighting more important than for introducing people to Bluespec, whether it is MIT undergraduates in 6.191 or engineers considering picking up Bluespec. These people have not yet developed the Bluespec visual recognition skills that come with experience. Of course, syntax highlighting can remain useful after you do develop those skills.

I could’ve also been caught up thinking about how it should’ve been someone else making the tool, rather than me, a joe shmoe looking for a job after graduation. But I don’t know. It would’ve been nice if someone else made it, but it ended up being me. Someone had to do it. No great harm came to me from doing something useful that I wasn’t obligated to do.

Intellectual work, its proper distribution and compensation, and humanity’s open-source project are hard issues to think about in this digital age. Maybe I’ll think more about it another time. I don’t think the answer is necessarily to kick the can around waiting for someone else to pick it up, but I also don’t think the answer is for a self-selected few to pick up cans while the world looks on. But what can you do? Not everything fits within a transactional framework.

This syntax highlighting is a very small-stakes case for a very small audience, but it reminds me of the many actually important projects that really are thanklessly made and maintained by volunteer labor. The highest profile example that comes to mind, while not necessarily code, is the content on Wikipedia. There are countless examples in open-source software, but it’s not a world that I’m very steeped in on the developer side.

As a user, like everyone else, I’m constantly benefiting from software (and more generally, intellectual work) made by the volunteer labor of others. This website itself was generated using Jekyll, an open-source static site generator with really not all that many contributions compared to the value it provides.

I know in the age of hyperscale and impact (especially in software), a hundred installations is small potatoes compared to tools that millions of people use. But it’s my first time having such tangible (if modest) positive impact through something digital I created. It’s a profound feeling to have.

Case-in-point, the only other true Bluespec extension has nearly 10k installations since its release five years ago. Almost all its functionality is from out-of-the-box VS Code language support (C-like // comment recognition, bracket matching) and very basic keyword recognition. My extension had better syntax highlighting within half an hour of development. However, the other extension’s redeeming feature (or context) is that there truly wasn’t anything better on the Marketplace, and this is a case where something is better than nothing. Kudos to them for taking action, however small. ↩
Three ratings is already more than the other true Bluespec extension, which has two ratings across its nearly 10k installations. And if you, dear reader, enjoy the extension, please leave me a rating too. ↩
Of course, not that I necessarily value my time the same as I do anybody else’s time, but in terms of Improving Efficiency™️ for humanity (taken at face value without all the moral baggage that attends whether brute efficiency is good), it feels like I’ve helped people! ↩
For now. We need to start somewhere. It is a nice base from which to add better quality-of-life features. ↩
I’ve long been captivated by the idea of beginnings. For example, why did EC Build start in 2004, why did no code editor command a supermajority of developers until Visual Studio Code, why did nobody make Firehose until Firehose was made? Every beginning must come from some context, but that context isn’t obvious for everything. It’s no surprise that rideshares or bikeshares didn’t take off until the proliferation of smartphones. The beginnings that are most surprising are the ones where there doesn’t seem to be an obvious accompanying shift that made them possible. Although, that might just be me not inspecting closely enough. ↩
Don’t tell my prospective employers that I called it unimpressive. It represents a lot of focus, effort, and difficulty, even if it doesn’t look all that fancy. It’s just subjectively anticlimactic because of what it is. An aspiring but solitary architect would make a dud of a cathedral by themselves. ↩
You could also say that whereas the Bluespec extension was built for impact, the processor was built as a learning exercise, which is true. With three months of solitary effort, I didn’t expect to go toe-to-toe with processors built by teams of hundreds of engineers. ↩
Though, it’s not like anyone but me is saying “anyone could’ve done it.” I just happened to do it and I’m thinking, anyone really could’ve done it, if only they just did it. ↩
I still need to figure out (if I’m still working on the extension) how to have truly useful snippets, and other subtle, quality-of-life changes that make writing Bluespec more comfortable. For that, I’ll need either user data (good luck getting that) or personal experience, which my slow but steady technical blog post series using Bluespec provides. ↩
Some MIT classes have developed impressive pedagogical tools, like 6.101 and Adam Hartz’s CAT-SOOP, 6.102 and its Praxis Tutor, or (the very same) 6.191 and Daniel Sanchez’s Minispec HDL. It’s hard to pick one example, but when I took Software Performance Engineering: 6.106 in Fall 2021, I considered it to be a master class in pedagogy (or at least classroom infrastructure).

Outside of MIT, I know there are some who use Bluespec at IIT Madras, University of Cambridge, UC Irvine, and so forth. I don’t know their standards for classroom (or laboratory) development infrastructure, so I don’t hold it against them. ↩
When I took 6.191 and when many of my friends took it, we were only taught to use the terminal. I still shudder thinking about it. In this year 2023? ↩
VS Code was only released in 2016, and even though it has taken over the world for several years running, it’s probably not super obvious to people who don’t actively interact with young developers writing code. I had a millennial professor who lectures hundreds of students every year and hadn’t even heard of VS Code when I mentioned it Fall 2022. ↩
When I was writing the syntax highlighter, I was reminded of the ice vendor that was featured recently on Eater’s YouTube channel. The guy was a bartender who wanted access to high-quality ice, but there was nobody in New York who could provide it. He started an ice manufacturing business that now produces clear ice for both his bar and other bars across the city. ↩
Though, the verbosity of Bluespec and most programming languages is nothing compared to the open-close tag system of HTML. The usefulness still holds. ↩

Experimenting with the Synth Tool

2023-10-15T00:00:00-04:00

Introduction
Overview
Background
Synth Tweaks
Verilog Full Adder
Bluespec Full Adder
Verilog Wrapped with Bluespec
Next Time

Introduction

In a series of upcoming posts, I will be presenting worked Bluespec and Verilog examples of different adders for eventual use in my RISC-V processor project. I’ll be using these adders to replace both existing adders and as components in future functional units like my integer multiplier or floating point unit.

Before all that, I need to perform some tests and set up some infrastructure. It’s no good to blindly implement components, so I spend this post experimenting with synth to identify quirks and see how it interacts with Bluespec and Verilog when we involve wrappers, which are required to import Verilog into Bluespec.

I also tweak synth to accept Verilog directly, which will be helpful to evaluate Verilog components in the same way I evaluate Bluespec components. Some upcoming posts will see whether we actually get any performance gains from implementing modules in Verilog rather than Bluespec.

This post also serves as a visual walkthrough of using the Minispec synth tool. There’s sparse documentation anywhere on its use, so I figured I may as well write some here.

Overview

I begin by discussing some tweaks I made to my fork of Daniel Sanchez’s synth tool for Minispec. These tweaks enable the rest of this post.

Then, I demonstrate the use of synth on a Verilog implementation of a full adder. I also show an example using boolean and bitwise operators where quirks in our downstream synthesis tools can create suboptimal circuits, so we should take synthesis results with a grain of salt. Because a full adder creates only a simple circuit, I also include gate-level logic circuit visualizations created using synth with several cell libraries.

I also demonstrate the use of synth on Bluespec implementations of full adders, including showing the resulting Verilog files from compilation and some strange properties that emerge when we nest Bluespec wrappers, including losing and gaining efficiency in the resulting circuits.

Afterward, I demonstrate Bluespec’s ability to directly use Verilog implementations in Bluespec designs, which will be helpful if we find Verilog implementations to be more efficient than our Bluespec ones. However, I found no performance difference with simple circuits like full adders, so that would require more testing with more complex circuits to see whether implementing in Verilog is worth the trouble. We’ll explore these things and more next time.

Background

For the past couple weeks, I’ve slowed down on technical blogging because I’ve been practicing my Verilog with the wonderful exercises on HDLBits. I’m starting to exhaust their Verilog material, so it’s about time to apply what I’ve learned. With all this practice, I’m now able to do two things:

I can now inspect and understand the .v that result from compiling my .bsv files. Simple Bluespec modules can give us legible Verilog. With complex modules, it takes more effort but can be done, especially when side-by-side with the Bluespec source code.
I can now write .v files directly and import them as IP blocks into my Bluespec designs through the import "BVI" feature. This works best for simple modules that are done more efficiently in Verilog.

I like Bluespec for its high-level constructs and abstractions. One common criticism of the language is that the Verilog outputted by the Bluespec compiler might not be performant enough to supplant writing Verilog by hand. The trade-off is acceptable for complex top-level modules that can’t be prototyped quickly in Verilog, but in small, reusable components like adders and FIFOs, it can make sense to go lower in abstraction.

(In this post, I found no evidence with the simple full adder example that Bluespec produces any less performant circuitry than Verilog. It’s too soon to draw conclusions on this front, since we’d need more complex modules.)

This is especially the case when the optimizing compiler isn’t mature enough. There was probably a point in history when C compilers didn’t produce performant enough assembly for developers to program exclusively in C. Bluespec may very well be at that point right now with producing performant Verilog. In an ideal world, the Bluespec compiler should be able to automatically make the same optimizations a human designer would.

To understand how our Bluespec turns into Verilog, we can refer to the BSC User Guide. People interested in greater detail should check out the chapter “Verilog back end” and especially the subsection “Bluespec to Verilog mapping”, which describes how .bsv files are transformed into Verilog .v files.

You can also read the chapter “Embedding RTL in a BSV design” in the BSV Reference Guide where they discuss importing Verilog modules into Bluespec for use in the Verilog backend. As per the User Guide, the Bluesim backend is currently incapable of using Verilog directly. When we import, we’d need to use Verilog simulators or write Bluespec implementations for simulation in Bluesim. This makes it a little less convenient to import Verilog when we use Bluesim for simulation, like I currently do.

Synth Tweaks

The synth synthesis tool we use from Daniel Sanchez’s Minispec compiles our Bluespec .bsv files into Verilog .v files, then does a bunch of processing with yosys and ABC to determine our area and critical-path delay.

It’s a nicely designed tool, but I need to make a series of tweaks to make it work better for my purposes. The main change is that I’d like to be able to synthesize Verilog files directly, but I also make a bug fix and a cosmetic change. You can see my modified version on my fork on GitHub. I don’t know how widely applicable my changes are, so I don’t plan on making a pull request.

Accepting Verilog Inputs

The synth tool was built to consume Minispec and Bluespec, but internally it compiles both into Verilog .v files for synthesis with downstream tools with yosys and ABC. There might be established tools for generating area and delay numbers for Verilog designs, but I both like Minispec’s synth and I have trouble finding off-the-shelf synthesis tools. (I suspect many of them are proprietary.)

I modified synth to be able to accept Verilog modules directly for synthesis. It’s just a matter of being able to skip the Minispec/Bluespec compilation step of the synth tool and using the Verilog .v files directly.

It’s also a matter of moving the .v files in the current directory into the synthDir so that they can be consumed as needed by other modules (specified by the .use files). This is especially important for Bluespec import "BVI" statements because the .v files from the compilation will assume that the imported .v files will be available for synthesis.

When we eventually do Verilog simulation, we’ll also need to ensure that our .v files are moved to build for simulation.

Alternatives

When I was thinking about how to measure the performance of both Bluespec and Verilog modules, I briefly considered using the wrapper-only route. I wouldn’t need to modify the synth tool as long as all my Verilog modules were presented as Bluespec modules.

I decided that it would be a little too roundabout to need to wrap all my Verilog modules in Bluespec just to synthesize them. I may want to synthesize separately even before importing these modules into a Bluespec design. It’s not much trouble, but it requires writing a bit of boilerplate.

It wasn’t so hard to modify the synth program. It’s written in Python, so I just needed to read through it and figure out what to change.

Buffer Configurations

I had already tweaked my installation of synth during my processor project. During the step where the program synthesizes with three buffer configurations, one of them would suddenly require much, much, much more computation than the other two. It’s no problem for small designs, but it would take so much computation for synthesizing my L1 caches that the synthesis would crash.

To locate the issue, I looked at the different output logs from synth to see where the tool was stalling. I found that synth would generate several configurations and select the best one. synth would crash because one of these sets would stall.

There are 6 different outputs because the tool tries 3 buffer configurations and both -O0 (ox) and -O1 (ob) optimization parameters.

I “fixed” the issue by making synth skip the configuration prone to stalling. I don’t know whether it’s a true fix because it might result in worse generated circuits for some designs. I checked it makes no difference for my full adder implementations.

SVG Tweaks

I also adjusted the color scheme of the svg generator to output dark mode circuit visualizations, just because that’s what I use for everything, including this blog.

If I was submitting a pull request, I would want to make it configurable from the command line. But because I only ever need dark mode, I just changed the color values in the svg file in synth.

Verilog Full Adder

In this section, I wanted to test out my changes to synth by synthesizing a simple full adder module written in Verilog.

I also run a little experiment with using different operators. Below, I choose to use boolean operators (e.g., &&, ||) even though I could use bitwise operators (e.g., &, |). I explain more soon.

The synth tools customarily requires us to have every module accept a CLK, which can remain unused.

module FullAdder(
    input CLK, a, b, c_in,
    output sum, c_out
    );
    always @(*) begin  // generally I would prefer always_comb in SystemVerilog
        sum = a ^ b ^ c_in;
        c_out = (a&&b) || (a&&c_in) || (b&&c_in);
    end
endmodule

With my above tweak, I can run synth FullAdder.v FullAdder to generate synthesis logs.

Basic Cell Library

Synthesizing FullAdder from file FullAdder.v as a Verilog module.
Synthesizing circuit with std cell library = basic, O1, target delay = 1 ps

Gates: 14
Area: 10.11 um^2
Critical-path delay: 51.75 ps (not including setup time of endpoint flip-flop)

Critical path: b -> sum
               Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
               ---------   ------        ---------------  --------------------- 
                       b        3                    7.6                    7.6 
                   NAND2        3                   14.3                   21.9 
                     INV        1                    8.4                   30.3 
                    NOR2        1                    6.1                   36.4 
                   NAND2        1                    8.6                   45.0 
                   NAND2        1                    6.7                   51.7 
                     sum        0                    0.0                   51.7 

Area breakdown:
               Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
               ---------    -----       ----------------       ----------------
                     INV        4                  0.532                  2.128
                   NAND2        8                  0.798                  6.384
                    NOR2        2                  0.798                  1.596
                   Total       14                                        10.108

The synth tool includes an svg diagram visualizer for circuits made with the standard (basic) cell library. We get that by using the -v flag, e.g., synth FullAdder.v FullAdder -v.

Let’s see what this looks like.

Notice the synthesis mostly uses INV, NAND2 and a couple NOR2 gates, whereas a textbook full adder might only use NOR2, AND2, and an OR2. Modern physical design (or at least the kind that they teach in schools) preferentially uses NAND gates because they result in an overall cheaper circuit.

Boolean Quirks

By accident, I noticed there’s a quirk that happens when I use bitwise versus boolean operators. I think it must be an issue with the downstream optimization because semantically, it shouldn’t matter whether we’re using boolean operators or bitwise operators when each operand is a single bit. Indeed, we’ll see later that the downstream gate placement can vary unpredictably.

We get a different circuit when we use c_out = (a&b) | (a&c_in) | (b&c_in);, even if semantically we should get the same thing.

It’s technically up to the engineer whether this circuit is better or worse. It results in 16 rather than 14 gates, but we shave off half a ps of delay. I would probably go with the original 14-gate circuit since it’s only 0.7% faster (51.4 ps vs 51.7 ps) but 15% larger (11.704 um^2 vs 10.108 um^2).

Critical-path delay: 51.39 ps (not including setup time of endpoint flip-flop)
  Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
  ---------   ------        ---------------  --------------------- 
          a        4                    9.8                    9.8 
      NAND2        2                   12.2                   22.0 
        INV        1                    7.7                   29.7 
       NOR2        1                    6.3                   36.0 
      NAND2        1                    8.6                   44.6 
      NAND2        1                    6.8                   51.4 
        sum        0                    0.0                   51.4 

  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        4                  0.532                  2.128
      NAND2       10                  0.798                  7.980
       NOR2        2                  0.798                  1.596
      Total       16                                        11.704

In some cases, we can use the --retime flag with synth to re-generate a more efficient and logically equivalent circuit. For whatever reason, it didn’t work with this one.

Extended Cell Library

We can also get different results with different cell libraries. I generally stick with basic, but there’s no reason why we can’t use the other ones. They just give us different gates. The main difference with this library for the full adder is that we gain access to NAND3 gates, which we use for c_out.

I synthesize using the -l option with a cell library name, e.g., synth FullAdder.v FullAdder -l extended -v. I trimmed the following log for conciseness.

[Extended]
Critical-path delay: 49.98 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        3                  0.532                  1.596
      NAND2        8                  0.798                  6.384
      NAND3        2                  1.064                  2.128
      Total       13                                        10.108

Multisize Cell Library

Here, we use a few different gates other than NAND2, but we still stick mostly with NAND2.

[Multisize]
Critical-path delay: 48.84 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
     INV_X1        1                  0.532                  0.532
   NAND2_X1        5                  0.798                  3.990
   NAND3_X1        1                  1.064                  1.064
     OR2_X2        1                  1.330                  1.330
   XNOR2_X1        1                  1.596                  1.596
      Total        9                                         8.512

Full Cell Library

We can synthesize with a more diverse full cell library, but synth doesn’t currently support generating circuit diagrams for it. It’s probably just a matter of adding in the svg components for all the different gates.

[Full]
Critical-path delay: 47.63 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
    AND2_X1        1                  1.064                  1.064
     INV_X1        1                  0.532                  0.532
   NAND2_X1        2                  0.798                  1.596
   NAND3_X1        1                  1.064                  1.064
    NOR2_X1        1                  0.798                  0.798
   OAI21_X1        2                  1.064                  2.128
     OR2_X2        1                  1.330                  1.330
      Total        9                                         8.512

Bluespec Full Adder

In this section, I wanted to synthesize a simple Bluespec full adder and inspect the resulting Verilog files and synthesis outputs. I also wanted to test whether the choice in boolean or bitwise operators made a difference in the resulting circuit like it did for the Verilog full adder.

Implementing in Bluespec gives us some more design choices. Bluespec’s richer type system distinguishes between booleans Bool and bits Bit#(1). Typically, I would prefer the bitwise implementation because semantically, the bits of a full adder generally represent parts of larger bit vector operands and sums.

But like in the above Verilog case, there may be performance implications in our downstream tools for using boolean versus bitwise operators. Until such a time that the performance quirk gets optimized out, I need to weigh the trade-offs between a more performant circuit with the boolean implementation, versus semantic accuracy with the bitwise implementation.

It may even turn out that it’s easier to work with the bitwise implementation, or that the quirk only appears when we’re synthesizing the full adder directly and not as a component. Because it’s only two gates, I’m leaning toward using the bitwise implementation for future components. In this section, we test both.

Switching between boolean and bitwise in Bluespec is a little trickier than in Verilog because I need to not only change the operators, but also the types. If you want the bitwise implementation, just replace Bool with Bit#(1) and the operators !=, &&, and || with ^, &, and |.

typedef struct {
    Bool sum;
    Bool c_out;
} FullAdderResult deriving (Bits, Eq);

interface FullAdder;
    method FullAdderResult exec(Bool a, Bool b, Bool c_in);
endinterface

(* synthesize, always_enabled, no_default_reset *)
module mkFullAdder(FullAdder);
    method FullAdderResult exec(Bool a, Bool b, Bool c_in);
        return FullAdderResult {
            sum : a != b != c_in,  // no logical xor
            c_out : (a&&b) || (a&&c_in) || (b&&c_in)
        };
    endmethod
endmodule

For such a simple design, the Bluespec generates identical circuits as the corresponding (bitwise or boolean) implementations in Verilog, so I don’t bother reproducing the synthesis logs.

There are some minor differences in the visualizations:

The ordering of the operands (doesn’t matter in a full adder),
The {sum, c_out} are bused into a 2-bit output, and
If we don’t include no_default_reset and always_enabled attributes, there would be an unused RST_N and RDY_exec driver on the visualization.
- In the following visualizations, I omitted the attributes, so they don’t correspond exactly with the above excerpt. So, imagine there’s only the synthesize attribute.

Notice that the operands are prefixed with exec. That’s because this whole circuit corresponds to the exec method of the module. We’d have a different looking circuit if we had other methods or rules to synthesize.

Bitwise Implementation

Boolean Implementation

You may also notice the unused RDY_exec. We can remove it by adding the always_enabled attribute next to the synthesize attribute, and it’ll be gone. It wouldn’t change the resulting circuit’s delay or area, since the unused RDY_exec signal gets optimized out anyway.

We could further remove the unused CLK and RST_N ports with the attributes no_default_clock and no_default_reset. We won’t remove the clock since the synth tool requires a clock port to synthesize a module. But there’s no reason why we can’t remove the RST_N.

I add the no_default_reset and always_enabled attributes into the Bluespec excerpt above, but I’ve kept the drivers in the visualizations so you can see what I’m talking about.

Resulting Verilog Files

For the above visualizations, I didn’t add any attributes other than synthesize. To generate the following Verilog, I added the always_enabled, no_default_reset attributes (just like the Bluespec excerpt above).

These Verilog files are generated by the Bluespec compiler for use in downstream tools like synth, other Verilog synthesis tools, or Verilog simulators.

Note that I present these files in the reverse order as the visualizations above.

Boolean Implementation

The compiled Verilog for such a simple circuit as the boolean implementation of the full adder is very legible, though it uses Verilog 1995 style declaration. The calculation of the carry also uses a boolean simplification.

module mkFullAdder(CLK,

		   exec_a,
		   exec_b,
		   exec_c_in,
		   exec);
  input  CLK;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;

  // signals for module outputs
  wire [1 : 0] exec;

  // value method exec
  assign exec =
	     { (exec_a != exec_b) != exec_c_in,
	       exec_a && (exec_b || exec_c_in) || exec_b && exec_c_in } ;
endmodule  // mkFullAdder (boolean implementation)

Bitwise Implementation

Unfortunately, the bitwise implementation doesn’t result in as legible a Verilog file. The compiler makes liberal use of internal signals and wire instantiations.

There’s no boolean simplification like above. I would’ve originally guessed the lack of simplification is why the design costs more gates, but we saw earlier that this happens even when we write directly in Verilog, and we’ll see later that we sometimes regain efficiency with some strange wrapping.

module mkFullAdder(CLK,

		   exec_a,
		   exec_b,
		   exec_c_in,
		   exec);
  input  CLK;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;

  // signals for module outputs
  wire [1 : 0] exec;

  // remaining internal signals
  wire x__h20, x__h37, x__h40, x__h52, x__h54, y__h53, y__h55;

  // value method exec
  assign exec = { x__h20, x__h40 } ;

  // remaining internal signals
  assign x__h20 = x__h37 ^ exec_c_in ;
  assign x__h37 = exec_a ^ exec_b ;
  assign x__h40 = x__h52 | y__h53 ;
  assign x__h52 = x__h54 | y__h55 ;
  assign x__h54 = exec_a & exec_b ;
  assign y__h53 = exec_b & exec_c_in ;
  assign y__h55 = exec_a & exec_c_in ;
endmodule  // mkFullAdder (bitwise implementation)

Wrappers around Bluespec

In Bluespec, we can wrap a module’s implementation in another module. It looks like this:

(* synthesize *)
module mkFullAdderWrapper(FullAdder);
    FullAdder _adder <- mkFullAdder;
    return _adder;
endmodule

The underlying Verilog instantiates the inner module and connects its ports with the external module’s ports. It’s all done in wires, so we might expect no difference in the resulting circuit.

In this section, I investigate whether there’s any overhead in synthesizing wrapped Bluespec. For thoroughness, I check nested wrappers too, like when we wrap a wrapper.

Losing Efficiency

When I experimented using synth, I saw using a wrapper can (but might not) affect the resulting circuit. Wrapping our boolean implementation gives us a 16-gate circuit (like with the bitwise implementation) instead of our original 14-gate circuit.

We might chalk this up to overhead from wrapping, but we shouldn’t be getting any overhead from just connecting wires.

It must do with the downstream tools. Similar to the boolean versus bitwise case, there’s something preventing the synthesis tool from optimizing the resulting gate placements.

module mkFullAdderWrapper(CLK,
			  RST_N,

			  exec_a,
			  exec_b,
			  exec_c_in,
			  exec,
			  RDY_exec);
  input  CLK;
  input  RST_N;

  // value method exec
  input  exec_a;
  input  exec_b;
  input  exec_c_in;
  output [1 : 0] exec;
  output RDY_exec;

  // signals for module outputs
  wire [1 : 0] exec;
  wire RDY_exec;

  // ports of submodule _unnamed_
  wire [1 : 0] _unnamed_$exec;
  wire _unnamed_$exec_a, _unnamed_$exec_b, _unnamed_$exec_c_in;

  // value method exec
  assign exec = _unnamed_$exec ;
  assign RDY_exec = 1'd1 ;

  // submodule _unnamed_
  mkFullAdder _unnamed_(.CLK(CLK),
			.exec_a(_unnamed_$exec_a),
			.exec_b(_unnamed_$exec_b),
			.exec_c_in(_unnamed_$exec_c_in),
			.exec(_unnamed_$exec));

  // submodule _unnamed_
  assign _unnamed_$exec_a = exec_a ;
  assign _unnamed_$exec_b = exec_b ;
  assign _unnamed_$exec_c_in = exec_c_in ;
endmodule  // mkFullAdderWrapper

I also tried adding a second layer of wrapper. If the first wrapper reduced performance (for unknown reasons), maybe a second wrapper would reduce performance even more.

(* synthesize *)
module mkFullAdderWrapper2(FullAdder);
    FullAdder _adder <- mkFullAdderWrapper;
    return _adder;
endmodule

But we didn’t lose performance! The resulting circuit is back to 14-gate, which is the same as the unwrapped boolean implementation.

At first, I found that wrapping three times gets us the 16-gate, and wrapping four times gets us the 14-gate. There was a cycle of gaining and losing performance, even when the Verilog for each layer of wrapper was practically identical to the last.

(When I went back to verify, the results changed, which I soon discuss.)

Gaining Efficiency

I ran the same wrapper experiment with the bitwise implementation. If synth gave us the 16-gate for bitwise, maybe we’d get 16-gate no matter the wrapper.

Surprisingly, adding a wrapper actually gave us the 14-gate circuit. The tool was telling us that our full adder was more performant with a wrapper. Adding more wrappers resulted in several 14-gate, and one 16-gate. There didn’t seem to be any pattern.

This is only if we don’t specify no_default_reset; otherwise they’re all 16-gate. (Don’t ask me why.)

Nondeterminism

The day after, I found that each arrangement of wrappers didn’t necessarily result in the same circuit as the day before. I don’t believe I really changed anything, so I wonder if it’s a nondeterministic bug.

It’s interesting that the mere action of adding more wrappers can be enough to massage the synthesis tool into giving us the more efficient 14-gate circuit. It shows that the downstream bug isn’t just restricted to the kind of operator you use.

The main takeaway is that we should be wary about how much stock we put into our synthesis numbers. Even for a circuit as simple as a full adder, there seems to be inefficient gate placement. For much more complex designs, we should consider the synthesis numbers to be only approximate, at least until we secure more sophisticated downstream synthesis tools.

Verilog Wrapped with Bluespec

Wrapping Bluespec modules in Bluespec can be useful, but the real use comes with wrapping other languages in Bluespec.

Bluespec offers support for bindings between Bluespec modules and Verilog modules (going down in abstraction, at the cost of productivity) or Bluespec functions and C functions (going up in abstraction, at the cost of performance).

For us, I’m focusing on wrapping Verilog because it might allow us to write more performant components to use in our Bluespec, like adders.

According to the BSC User Guide:

Using the import "BVI" syntax, a designer can specify that the implementation of a particular BSV module is an RTL (Verilog or VHDL) module, as described in the BSV Reference Guide. The module is treated exactly as if it were originally written in BSV and then converted to hardware by the compiler, but instead of the .v file being generated by the compiler, it was supplied independently of any BSV code. It may have been written by hand or supplied by a vendor as an IP, etc.

The main thing I’d like to see is whether the synthesis of a Bluespec-wrapped Verilog module is identical to a Verilog module synthesized directly. Given the above description, it should be, since it’s exactly what we practiced by playing with synth and Bluespec-wrapped Bluespec.

Let’s take our Verilog full adder and wrap it in Bluespec. Remember that each of the boolean implementations, in Verilog and in Bluespec, resulted in a 14-gate circuit. But with the capriciousness of the downstream synthesis, I would accept a 16-gate circuit too. This is especially true because we got 16-gate circuits from wrapping implementations that would’ve given us 14-gate circuits.

An import "BVI" statement also requires us to declare the mappings between the Bluespec interface and the Verilog ports. I’ve modified my Verilog full adder to output {sum, c_out} as a single reg [1:0] to be consistent with my Bluespec exec method, which packs the two values together. In Bluespec, FullAdderResult is a struct, but we implicitly pack/unpack to bits as necessary when we’re working with foreign modules.

module FullAdderVerilog(
    input CLK, a, b, c_in,
    output [1:0] out
    );
    always @(*) begin
        out[1] = a ^ b ^ c_in;
        out[0] = (a&&b) || (a&&c_in) || (b&&c_in);
    end
endmodule

import "BVI" FullAdderVerilog =
module mkFullAdderVerilog(FullAdder);
    method out exec(a, b, c_in);
endmodule

We can’t directly synthesize foreign modules, but we can wrap them and synthesize the wrapper.

(* synthesize *)
module mkFullAdderVerilogWrapper(FullAdder);
    FullAdder _adder <- mkFullAdderVerilog;
    return _adder;
endmodule

After synthesis, I found there’s no overhead to wrapping the Verilog, but the the same quirks from wrapping Bluespec reappeared to give us either 14-gate or 16-gate circuits. We should be good to go in terms of embedding Verilog into our Bluespec designs.

The main drawback of this is that while importing Verilog is fine for using the Verilog backend for Bluespec (e.g., to run simulations with Verilog tools), it doesn’t work for using the Bluesim backend, which requires all modules to be implemented in Bluespec and compiled into .ba files. We would need to either re-implement the Verilog modules in Bluespec with conditional compilation, find a Verilog simulator, or not use Verilog implementations at all.

If using the Bluespec-recommended method of conditional compilation, we need to be extra careful that our Verilog implementation of a module is cycle-equivalent to our Bluespec implementation of that same module. Otherwise, we may run into trouble with correctness when we simulate with Bluesim and find our results to be different than our results in, say, Vivado. However, I think whatever can be implemented in Verilog can usually be implemented more easily in Bluespec.

If it turns out that implementing in Verilog gives us no benefit over implementing in Bluespec, then I might just stick with Bluespec implementations for use in Bluesim. The full adder example gave no evidence of greater overhead in Bluespec, so at least it’s clean enough for simple modules.

Next Time

This time, we tweaked synth to work better for our goals, and we did some investigation on the interplay between Bluespec, Verilog, and the synth tool.

Next time, we can see about implementing adders in both Bluespec and Verilog, which synth allows us to quantitatively evaluate. For correctness, we’ll check against the built-in + operator (it looks like Bluespec’s + just wraps around Verilog’s +) as we implement a simple ripple-carry adder and several types of carry-lookahead adder.

As we implement adders, I’ll continue to evaluate synthesis differences between Bluespec and Verilog. If performance permits, we might end up not actually needing to use any Verilog implementations in our processor, allowing us to maintain a strictly Bluespec code base.

We’ll see about using these adders later on in our multiplication unit and in other places.

East Campus Build 2023

2023-10-01T00:00:00-04:00

Preview

A view of this year’s EC Fort 2023 from the track, with the tennis bubble and the Prudential Center behind.

Introduction

Every year since 2004¹, the students of MIT’s East Campus dormitory (EC) have built and disassembled a large wooden structure as a monument to the community. Last year, it was a climbing wall fort and rollercoaster.² This year, it was a double climbing wall and a three-story fort.

Usually, the EC build happens in the courtyard between the two parallel buildings of East Campus. It happens during the few weeks before fall classes start, and it’s partially meant to drum up excitement among the incoming first-year students. What a marvelous sight, moving onto campus for college and seeing students building something so grand next to their dorm, of their own design, and all by themselves!

Things are different in 2023 because the physical dorm itself has become a construction site, having just started its 2-year renovation project that runs from Summer 2023 through Summer 2025. Instead, the community was granted permission to use the field next to the tennis bubble as a build site.

One up-side is that the build site is much closer to dorm row. It’s easier to do external outreach this way since students moving into most of the dorms can see and hear the build. They might otherwise only find out about it through word-of-mouth, since East Campus is on the other side of campus.

One major down-side is that the people primarily working on the project need to commute way farther to get to it. Mustering up student labor to complete the project has historically been hard enough in normal times. It was much easier to get masses of people involved when the build was less than two hundred feet away from their bedroom.

I only helped out for a few days during my visit to Cambridge this year. Many of my friends were working on build, and I expected that they would appreciate the extra hand. I was much more involved with build for 2021 and 2022, and I always remember us being shorthanded.

This year, I helped out for about 15-20 hours total between building, monitoring for East Side Party, and deconstructing.

Background

I’ve never been very involved in the planning process for build. Usually there are a few head engineers who do much of the planning. They write up the designs and these get signed off on by several school offices and a professional engineer. They also do procurement for the materials and equipment. Some others raise funds through sponsors, though much of the money sometimes comes from the dorm house tax.

This year, like last year, the head engineer was my friend Anhad Sawhney ‘25.³ There are loads of pictures and CAD models available on his website, both of this year’s and of last year’s builds. There was also a team of build leads who were formally in charge of leading the build effort during the weeks when we’re cutting wood, assembling the build, and taking it down. I don’t remember all of them, but the most significant one I remember was Jordan P.A ‘24, who was on site day-in, day-out, putting in major sweat and hours to keep the project on track.

It takes so many person-hours to successfully complete the EC Build. It takes the student engineers months of preparation, but it also takes many students many hours to do the actual labor of carrying, cutting, screwing together, and eventually unscrewing the wood. It’s not particularly glamorous work, but you really can’t simply think up a fort and have it manifest itself. It takes major hours and major elbow grease. And it can be hard to get that from a labor pool of MIT students who might not be accustomed to getting their hands dirty with physical labor, especially on hot August days.

Construction

Here are some of the materials and the storage pod where we put the tools away at night. In the foreground is one of our pallets of plywood. You can see some good progress in the background already.

The state of the fort on the day I arrived. Pretty much the entire skeleton was built before I got there. The front has two climbing walls that will eventually be painted with murals and have climbing holds screwed in. Each plywood sheet is 4x8 feet, so each wall is about 14x16 feet.

This year’s theme was Revolution. The red burning man on black background is the East Campus mascot.

This year, I helped with the stairs, some miscellaneous tasks that simply needed doing, and, as a reasonably experienced builder, some minor people support.

Stairs

The stairs were a pain point in this year’s build, so I was part of the team working on getting them up. I helped screw in about half the tread 2x4s and began cutting the plywood risers that would stop people’s toes from going too far past each stair. A full build is many, many medium sized tasks that are split into little tasks. And every single one of them needs to be done to complete the build.

I helped screw in many of the supports beneath the treads, which are the horizontal pieces you see here. We needed to alternate which side got screwed into the bottom and which got screwed into the side, which was a little funny-looking. I’m told it was structurally optimal, like screwing in your tires in a star shape. It has something to do with load distribution.

I wasn’t directly involved in this part, but this was when the stairs were farther along. Here, they’ve started adding the tread tops. I cut a few of the risers here, which stop your toe from going underneath the next step. A small team was responsible for the outer railing, and it looks like they were doing a pretty good job.

Here’s a back view of the stairs as they got further along. I passed cutting out the risers to someone else as I went to work on other things.

People Support

I don’t have any pictures for this part, but an understated part of the build process is being able to make good use of the available labor. Many times, first-years would come up, interested in the project and maybe even interested in helping out. But they don’t stick around for long unless they have the appropriate guidance.

Several times during the build, I stopped to make sure that the people who were showing up were being taken care of. If they wanted to help, I would find them someone more experienced who could do with a hand. Optimally, it would be for tasks that the newcomer would be excited about. On the other side, sometimes when a team was short-handed, I would go out and try to find the people who could come help. There were no formal structures in the build for these people-support kind of things; it was more holistically seeing where people were and where people needed to be. In the weeks before classes, there are people floating around for all sorts of reasons. And I like meeting with people.

In previous years, I would spend some time doing some outreach trying to get more people to build. Sometimes, I would be like a little foreman. People would be part of my crew and we would collectively work on a task like building trapdoors or building railings. There’s another thing: I could keep my nose down and just do the physical tasks that needed doing, but I could also get more person-hours onto the project by making sure that the right people were in the right places. It’s kind of like the capital versus consumer goods that they teach in microeconomics with the production-possibility frontier. There’s a balance between time spent directly on building versus on expanding the team.

We also had a scaled-down grilling operation this year, in part because this year’s labor force was scaled down. In previous years, I would also be one of the runners getting food (what we call rush burgers) from the grills to the builders. It was a good way to keep morale up when I was too tired to continue hauling or screwing things in.

Miscellaneous Tasks

I helped with a few things that just needed doing.

One of the things that’s almost always needed is hauling: just carrying things from point A to point B. Wood needed to be carried from the pallets to the saws; mattresses needed to be carried from the loading point to the climbing walls; just all sorts of things needed to be carried from somewhere to somewhere else.

Some things don’t need as much time or effort as much as someone with the right know-how in the right place. At some point, we needed to switch out the jigsaw blades. I was the only one on-site with any experience maintaining jigsaws, so that was on me to swap out and quality-test the jigsaws to make sure they were still fit for construction.

The jigsaw blades we were using. I took this to keep track of which ones we had left.

One of the days, we needed to continue spray painting the steel plates we had. Each coat needs time to dry, so I got there early one day and did a couple coats. These would go into this year’s iteration of wAo, managed by Lili. Some of the coats were meant to protect the steel from corrosion.

A few spray-painted pieces of steel that were going to be used in the wAo ride.

My dear friend Kat Jander ‘25 was on the lighting team and needed some extra hands assembling and transporting her awesome cannonball letters, so I left the build site for a few hours to help her with that. She had already done all the design work and production work in the months before, but the project just needed some tedious screwing things together to get it across the finish line.

Kat working on putting together the letters for suspension from the cable.

East Side Party

The deadline for construction is the night of the East Side Party, where one or several EC residents DJ from the third floor of the fort. All students are invited to hang out, dance, and see the product of the builders’ hard work. It’s also a time when we run the rides like Space Trainer and wAo, and open up the climbing wall.

I think it’s meant to kick off REX, where dorms host a bunch of cool events to welcome the first-years. I was an orientation leader in 2021, and I could swear that the attendance of the East Side Party that year was far higher than any orientation event. And this is happening before most upperclassmen are invited back to campus.

During the East Side Party this year, I accompanied my friends who were watching the climbing wall to make sure things were staying safe during the event.

The lights designed and build by my dear friend Kat. Read more about the particular project on her website.

The view from the third floor of the fort during the East Side Party overlooking the rest of the site. Buildings in the background include New Vassar, the Metropolitan Storage Warehouse, the Z Center, and the new music buildings, still under construction.

A view of the fort from the materials pile. This was as we were closing up for the night right after East Side Party. We still needed to put away a bunch of stuff in preparation for deconstruction, which was happening very soon. On the right is the tennis bubble. Also notice the Simpson Strong-Tie banner. They’ve been good sponsors of the EC Build for many years, providing us with their wonderful fasteners.

Deconstruction

On my last full day in Cambridge, I helped out for a little bit in the morning and a few hours at night in the deconstruction effort. It was the day before Registration Day, and the team (especially hardworking Jordan P.A.) was getting stressed out that not very much progress was being made on deconstruction. It would be even harder to work on getting the build down once classes start, so Registration Day was a soft deadline.

The people who were showing up were working very hard, but there was severe difficulty getting people to show up. It’s been true even in the best of times, like in previous years when our target audience lived right next to the build site. But now, volunteers would need to actively show up, and not many did.

I remember being very involved in deconstruction last year in 2022, and it was the exact same situation as this year: the build group chat would have 100 people, and frequent desperate bumps, and less than six people (and the same six people!) would show up every day to help out. The difference was that then, I could see the lurkers through their bedroom windows from the courtyard. Now, they were just out of sight.

Maybe things were different before the pandemic. I don’t think this pattern is sustainable. A spectacle of a community’s labor needs a community’s labor. There needs to be major overhaul in assessing or cultivating interested student labor in seeing the project through, or the project needs to be downscaled or shelved. Of course, it’s harder when the dorm is under renovation.

When I was disassembling some of the railings for deconstruction, I noticed some of them developed tan lines. I don’t actually know whether it’s from sun exposure or dirt. Maybe both.

The state of the fort in disassembly the morning I showed up.

The state of the fort in disassembly my last night in Cambridge. I wasn’t there the whole day: just for an hour at the front and a few hours at the back. There was a whole lot of progress. I’m told it was the most people (about 10) on site since the East Side Party, to the relief of the hard-working students in charge.

I always thought it was weird how soon we need to deconstruct. The full build would only be up for a few days to a week. Maybe it’s just been the case these past few post-pandemic years, and maybe in earlier years the builds would be up for longer. I don’t know. But the designs go through so much review, and the build takes so long, that it feels a bit like a waste that the builds have to go down so quickly.

The build uses more-or-less the same techniques as typical American wooden-frame houses do. I suspect (with absolutely zero education in mechanical engineering) that the structure could stand safely for months, if not years, with how large the safety margins are. It seems like a lot of overengineering for something that comes up and comes right back down. I get needing to take it down this year because we were using borrowed fields, but I feel like they could stay up just fine in the East Campus courtyard. Maybe it’s the Institute’s fear of liability or something.

I suppose it’s not so different than running a school play. Professional actors doing Broadway shows rehearse for months and do eight shows a week for months and months afterward. For school plays, actors have a couple months of (part time) rehearsal for a handful of shows, and then it’s all over. Both EC build and school plays have that bizarre ratio between preparation and runtime.

It’s probably not a controversial take that the EC build is more for the spectacle than it is for the structure itself.

Disposal

There’s a step that comes after deconstruction.

Usually after deconstruction, the builders and other EC residents salvage some of the lumber for their own projects. Most typical is the loft, which you might see in my room tour or other people’s posts. At some point I’ll write a more comprehensive post about my experience with my salvaged wood in 2022.

All the wood that remains after salvage (which this year was quite a lot, considering there was no East Campus in which to squirrel away all the wood) is taken for disposal.

The pandemic that shut down 2020 notwithstanding. ↩
Here’s an article of a previous build from 2016, though it also mentions Next Haunt, which is something Next House does. See also EC Build 2014. ↩
Anhad is a genius. He’s going to change the world one day. ↩

Basic Multiplier

2023-09-28T00:00:00-04:00

Introduction
Testing
Designs
Reduced Adder Design for Radix-8
Built-in Multiplier
Next Time

Introduction

Let’s write a simple multiplier that we can test and eventually put into our processor. The eventual aim is to implement the RISC-V “M” Standard Extension for Integer Multiplication and Division. It’s a small extension that takes only a couple paragraphs per operation in the RISC-V specification.

This post is the first in a series of worked examples and explorations using Bluespec. It’s also an opportunity for me to use the Bluespec lexer and Bluespec extension for VS Code I worked so hard on.

I begin by discussing how we create our testbench (in this house, we believe in test-driven development) and continue by going through a sequence of multiplier designs from the non-functional to decent. We don’t go into advanced multipliers; that’s more for a later post.

For each design, I present a Bluespec excerpt and some critical-path, area, and cycle numbers. The synthesis-related numbers come from using the Minispec synth tool. The numbers are probably worse than if we implemented directly in Verilog, but that’s alright for now.

All the code here was written from scratch and is hosted on this GitHub repo. I didn’t keep all the different implementations in the most recent commit, but you can go through the commit history for earlier stages.

Testing

It’s both quicker and easier to debug a little module like a multiplier by running it through unit tests than by putting it into a huge processor and seeing whether the processor breaks. In the hardware-world, I believe this is called verification.

In this case, I chose to do it as a two-step process. We first generate the tests (in whatever way we want), then we write a Bluespec testbench that consumes the tests and probes an instance of our multiplier-to-be-tested.

Test Case Generation

I chose to generate test cases as a hexadecimal file using a C script. If you’re unfamiliar with compiled languages like C, I just write my multiplier.c file, run it through a compiler with gcc multiplier.c -o multiplier, and execute the binary ./multiplier. I’ve tucked most of the compilation commands I’ll be using into a Makefile.

My C script outputs into a text file with one test per line, and each line as hexadecimal that my testbench will consume. Each line is basically a packed [a, b, ab], where a times b equals ab. There are 32 bits allocated to each of a and b, and 64 bits for ab, since we might need to supply the upper 32 bits as per the RISC-V “M” extension specification.

Here’s what generating each test case looks like. We perform the multiplication in C, then log both operands and the output in hex. Later, our testbench will feed these operands into our Bluespec multiplier and check that the outcome is equal to the outcome we got in C.

#include 
#include 

// in main(), we instantiate the file pointer with something like this:
// FILE *fptr = fopen("test_cases.vmh", "w");

// Log the case
void log_multiply(FILE* fptr, int32_t a, int32_t b) {
    int64_t ab = (int64_t)a * (int64_t)b;
    #ifdef DEBUG
    printf("%x times %x is %lx\n", a, b, ab);
    #endif
    // format is a, b, ab_upper, ab_lower
    fprintf(fptr, "%08x%08x%016lx\n", a, b, ab);
}

We use two batches of tests: special cases, then random cases. I also cap off the test cases with a sentinel value so our testbench knows when to end.

void end_case(FILE* fptr) {
    printf("deadbeef cap\n");
    fprintf(fptr, "%032x", 0xdeadbeef);
}

Special Cases

The first batch is to check a set of hand-picked test cases. I selected these particular test cases because they were either simple cases that we could check by eye (so we know that our test generation itself works) or edge cases like multiplying INT32_MIN by itself, which yields the largest signed product.

Between these test cases, we should be covering the breadth of possible inputs (e.g., multiply by 0, multiplying different signs, multiplying same signs). Another way of saying it is that if we pass this small set of tests, we have a pretty strong sign that our multiplier is fully correct.

In our case, our special cases are:

void specific_multiply(FILE* fptr) {
    // Non-negatives first
    log_multiply(fptr, 0, 0);  // 0 identity, easy
    log_multiply(fptr, 0, 1);  
    log_multiply(fptr, 1, 1);  // 1 identity, easy
    log_multiply(fptr, INT32_MAX, 0);
    log_multiply(fptr, INT32_MAX, 1);
    log_multiply(fptr, INT32_MAX, INT32_MAX);
    // Signed below
    log_multiply(fptr, -1, 0);
    log_multiply(fptr, -1, 1);  // sign extended
    log_multiply(fptr, -1, -1);  // negative times negative?
    log_multiply(fptr, INT32_MIN, 0);
    log_multiply(fptr, INT32_MIN, 1);
    log_multiply(fptr, INT32_MIN, INT32_MIN);  // biggest possible result
    log_multiply(fptr, INT32_MAX, INT32_MIN);
}

I know there’s a lot of repeated boilerplate here. If we wanted to, we could store all the test pairs in a separate array or file, or in any number of ways, and then iterate through calling log_multiply(fptr, a, b) for each a, b.

I think a two-step process is good enough for this worked example, so we’re keeping it a little simple at the cost of some repeated code. If we had a more complicated design that we were testing with many more test cases, then we might want to encode our human-readable cases separately from the C code.

In the test_cases.vmh hex file, these cases look like this:

00000000000000000000000000000000
00000000000000010000000000000000
00000001000000010000000000000001
7fffffff000000000000000000000000
7fffffff00000001000000007fffffff
7fffffff7fffffff3fffffff00000001
ffffffff000000000000000000000000
ffffffff00000001ffffffffffffffff
ffffffffffffffff0000000000000001
80000000000000000000000000000000
8000000000000001ffffffff80000000
80000000800000004000000000000000
7fffffff80000000c000000080000000

Each 32-bit segment is represented by eight characters. For example, look at the third line, which corresponds to 1 times 1 equals 1. On the left half we have a 00000001 for each operand, and on the right, we have a 1 with 15 leading zeros as the product.

Most of our bugs should be caught by our well-selected test cases.

Randomized Tests

The second batch is much larger and consists of randomized tests. For a simple design like a multiplier, we shouldn’t be getting any surprises in this batch. Randomized tests are more significant for complex designs where the input space is bigger or bothersome to write special cases for (e.g., testing correct functionality of a cache). But it’s good practice to do it here too.

Because our multiplier is simple, these are mostly to measure performance (in terms of cycles per multiplication, asymptotically) and partly to give us the confidence that our multiplier works even when we don’t know the inputs.

It doesn’t so much matter here since I’m both testing and developing the multipliers, but randomized tests can also protect from bad multipliers that just hardcode special cases. Or maybe that’s just why they did it in school.

#include 
#include 

void rand_multiply(FILE* fptr) {
    int32_t a = rand() - rand();  // covers full input space
    int32_t b = rand() - rand();
    log_multiply(fptr, a, b);
}

These test cases are a lot less pleasant to read than our earlier test cases, but we put them in the test_cases.vmh all the same. Because we’ve checked our previous test cases, we can be confident that these work.

f2161e7bfbd104f5003a3531831017b7
f69277035113f1d3fd039b9ee4faea79
f8e5e6de58f1fdd3fd8850c001a4aefa
19d0ac8848a2aa54075317c0791aeca0
aacbcda9489a4485e7d5f9a31e2cbccd
05257b1e54e6803e01b4eea67196d144
c0fc893844312d6cef36f99af260bba0
076ec261eba76b3bff68c66bda0c575b

For my testing, I wrote 13 special test cases and generated 900 randomized test cases. We can measure performance by dividing our total cycles by 913 as soon as we pass all tests.

For completeness, this is what my main() looks like (though you can look for yourself in the repo):

int main() {
    const int SEED = 2;
    const int cases = 900;

    srand(SEED);
    FILE *fptr = fopen("test_cases.vmh", "w");

    specific_multiply(fptr);

    for (int i = 0; i < cases; i++) {
        rand_multiply(fptr);
    }

    end_case(fptr);
}

Other Means of Generation

Alternatively, I could’ve also generated test cases in Bluespec directly, or using any other language that can write to a file.

If I had a reference Bluespec or Verilog implementation of a multiplier, I could’ve also used that to generate test cases during runtime inside of the testbench. That would’ve been a fine option, considering Bluespec actually has a built-in multiplier with the * operator, which I discuss later. We could’ve produced a pair of operands and fed them into both our multiplier-under-test and our reference multiplier. Then, we could’ve seen if they produced the same results.

I figured it would be easier to write our test cases using C since it has higher-level constructs than Bluespec. Also, I haven’t yet learned how to write a series of sequential test cases in Bluespec like I did in C.

Bluespec Testbench

Once we have our test cases in our test_cases.vmh, we can make our Bluespec testbench consume them. The way I did it is having our testbench instantiate a BRAM that loads from our hex file. Then, we can send read requests to the BRAM for each test case, one line at a time.

module mkMultiplierUnitTest(Empty);
    let cfg = defaultValue;
    cfg.loadFormat = tagged Hex "test_cases.vmh";
    BRAM1Port#(MaxTestAddress, TestPacket) tests <- mkBRAM1Server(cfg);

    MultiplierUnit dut <- mkMultiplierUnit;
    // ... everything else ...
endmodule

My testbench mkMultiplierUnitTest uses five rules:

puts submits a new read request for every line of our BRAM
question receives the BRAM response, queries our multiplier-under-test, and enqueues the expected result.
- Terminates if it detects our sentinel value deadbeef.
answer receives the multiplier response and compares it to our expected result, which it then discards.
- Terminates if we receive a wrong answer.
tick (minor) increments our cycle counter.
terminate (minor) ends our simulation if we end up stalling.

We can’t test something whose interface we don’t know, so here’s our MultiplierUnit interface below. We only worry about two methods: one to start computation, and one to free the multiplier and get the result. Internally, the implementation should have these methods guarded so that they’re only callable when the module is ready to perform each method.

typedef Bit#(32) Word;
typedef Vector#(2, Word) Pair;

interface MultiplierUnit;
    method Action start(Pair in);
    method ActionValue#(Pair) result;
endinterface

Our entire testbench then looks like this:

typedef Bit#(10) MaxTestAddress;  // 10 bits for max 1024 tests
typedef Vector#(4, Word) TestPacket;

module mkMultiplierUnitTest(Empty);
    let cfg = defaultValue;
    cfg.loadFormat = tagged Hex "test_cases.vmh";
    MultiplierUnit dut <- mkMultiplierUnit;  // design under test

    BRAM1Port#(MaxTestAddress, TestPacket) tests <- mkBRAM1Server(cfg);
    Reg#(MaxTestAddress) request_index <- mkReg(0);
    FIFO#(Pair) expected <- mkFIFO;
    Reg#(Word) cycles <- mkReg(0);
    Reg#(Word) last_solved <- mkReg(0);

    function Action conclude;
        action
        $display("Ended at %0d cycles after solving the %0d test", cycles, last_solved);
        $finish;
        endaction
    endfunction

    rule tick;
        cycles <= cycles + 1;
    endrule

    // This rule keeps us requesting
    rule puts;
        request_index <= request_index + 1;
        let request = BRAMRequest{
            write: unpack(0),
            address: request_index
        };
        tests.portA.request.put(request);
    endrule

    // This rule queries the dut
    rule question;
        TestPacket current_test <- tests.portA.response.get();
        current_test = reverse(current_test);
        // Reverse order because of the way the vmh is written/read

        Pair operands = unpack({current_test[0], current_test[1]});
        Pair results = unpack({current_test[2], current_test[3]});

        dut.start(operands);
        expected.enq(results);

        if (current_test[3] == 'hdeadbeef) begin
            $display("deadbeef detected; finishing at %0d cycles", cycles);
            conclude;
        end
    endrule

    rule answer;
        Pair result <- dut.result;
        last_solved <= last_solved + 1;
        if (result != expected.first) begin
            $display("Result was %x but expected %x", result, expected.first);
            conclude;
        end
        expected.deq;
    endrule

    rule terminate if (cycles > 'hFFFF);
        $display("Emergency exit");
        $finish;
    endrule
endmodule

We should test our testbench so we know it actually works. For that, we can write a dummy MultiplierUnit implementation that can receive requests and serve responses without the burden of producing correct answers. It should fail tests and trigger the failure message.

Dummy Implementation

Just as above, we’re using these definitions.

typedef Bit#(32) Word;
typedef Vector#(2, Word) Pair;

interface MultiplierUnit;
    method Action start(Pair in);
    method ActionValue#(Pair) result;
endinterface

typedef enum {
    Idle,
    Busy,
    Ready
} MultiplierState deriving (Bits, Eq, FShow);

This is a “multiplier” that does nothing. Its sole purpose is to let us run our testbench and fail without stalling. Which it does, beautifully.

(* synthesize *)
module mkMultiplierUnit(MultiplierUnit);
    Reg#(MultiplierState) state <- mkReg(Idle); 
    FIFO#(Pair) last_inputs <- mkFIFO;

    method Action start(Pair in) if (state == Idle);
        last_inputs.enq(in);
        state <= Ready;
    endmethod

    method ActionValue#(Pair) result if (state == Ready);
        last_inputs.deq;
        state <= Idle;
        return last_inputs.first;
    endmethod
endmodule

Area: 1208.08 um^2
Critical-path delay: 145.5 ps
Cycles: N/A
Coverage: Only when inputs = outputs

The large area despite having almost nothing can probably be chalked up to the FIFO that contains a Pair, or equivalent to Bit#(64) of space.

A stopped clock can be correct. Since our dummy implementation returns its inputs as outputs, we pass the 0 times 0 = 0 test, albeit nothing else. Now that we know our testbench works, we can start worrying about implementing the multiplier.

Designs

First Implementation

Let’s use the Hennessy and Patterson designs for integer multiplication. Let’s start with unsigned multiplication, and we’ll handle signed multiplication later. We’ll use their Radix-2 Multiplication and Division.

This works for non-negative integers. We would need to add some complexity for accommodating negative integers.

(* synthesize *)
module mkMultiplierUnit(MultiplierUnit);
    Reg#(MultiplierState) state <- mkReg(Idle); 
    Reg#(Word) a <- mkRegU;
    Reg#(Word) b <- mkRegU;
    Reg#(Word) p <- mkRegU;
    Reg#(Bit#(5)) index <- mkRegU;  // only need 0 through 31

    rule work(state == Busy);
        Bit#(33) new_p = (a[0] == 'b1) ? {0,p} + {0,b} : {0,p};  // (1); may need carry
        p <= {new_p[32:1]};
        a <= {new_p[0], a[31:1]};
        index <= index + 1;

        if (index == 31) state <= Ready;
    endrule

    method Action start(Pair in) if (state == Idle);
        a <= in[0];
        b <= in[1];
        p <= 0;
        index <= 0;
        state <= Busy;
    endmethod

    method ActionValue#(Pair) result if (state == Ready);
        state <= Idle;
        return unpack({p, a});
    endmethod
endmodule

Our multiplier is a state machine that goes between Idle, Busy, and Ready. It waits for a request in Idle from start, then it sets a bunch of registers and transitions to Busy. It will stay in Busy for 32 cycles worth of work, incrementing the index by 1 each time. When it’s on its last cycle, it will transition to Ready, where it waits for our testbench to pick up the product through a result call. Once result is called, the multiplier transitions back to Idle, ready for the next request.

Algorithm

The main operation is in the line labeled (1). We’re basically doing long multiplication in binary, one digit at a time. Hennessy and Patterson explain it better in their textbook.

We have 3 registers: a, b, and p.

a <= in[0];
b <= in[1];
p <= 0;

a and b are our initial operands. b stays constant, and a changes as we run the algorithm. The lower bits of a correspond to the operand, and the upper bits will gradually be filled with the lower bits of our product. We make the space by consuming and throwing away the lowest bit of a each cycle.
p starts as 0 and is part of our running sum.
At termination, a will hold the lower bits of our product and p will hold the upper bits.

Each cycle, we perform two steps inside of work.

Step 1, we insect the LSB of a (a[0]) to determine our new_p.

If it’s 0, we do nothing and just use new_p = p.
If it’s 1, we add b to p and use new_p = p + b.
We do this addition in Bit#(33) in case of overflow.

Bit#(33) new_p = (a[0] == 'b1) ? {0,p} + {0,b} : {0,p};  // (1); may need carry

Step 2, we shift new_p and a right by one bit, and store the resulting new_p >> 1 into p. We can think of p and a as stuck together, with p on the left and a on the right. The total 64-bit product will eventually be {p, a}. We make the space for the MSB of new_p by shifting out the a[0] that we consumed. Think of long multiplication.

p <= {new_p[32:1]};
a <= {new_p[0], a[31:1]};

I believe there are some startup cycles because it takes our BRAM a little bit of time to get set. Asymptotically, I can see when running many tests that the cycles per test tends to 34. Internally, only 32 of those cycles correspond to work happening in the multiplier. The other 2 have to do with input/output.

Area: 1273.08 um^2
Critical-path delay: 249.57 ps
Cycles: 272 cycles to the 7th test. (38.86 cycles per multiply (internal 32, external 34))
Coverage: Non-negative only.

I generate the area/delay numbers with synth, and the cycles and test numbers and my $display statements. Since we pass the non-negative tests, let’s extend our multiplier to work for signed multiplication.

Radix-2 Signed Multiplication

We can focus on implementation because we already have tests that cover multiplication with signed values.

We use Booth recoding to multiply with signed numbers. The main difference is that at each step, we inspect the current and previous bit of a rather than just the current bit. Almost all of the change is just to our work rule, with almost everything else staying the same.

It’s a little tricky to explain the idea behind the algorithm, but it gets clearer as we have more cases with higher Radix. Essentially, we read our two-bit case {a[0], last_a} as a two’s complement number where the a[0] is a two’s complement number with one bit (so it just corresponds to -1) and the last_a is the carry from the previous operation.

With each case:

00 is 0(-1) + 0(1) = 0 + 0 = 0
01 is 0(-1) + 1(1) = 0 + 1 = 1
10 is 1(-1) + 0(1) = -1 + 0 = -1
11 is 1(-1) + 1(1) = -1 + 1 = 0

We use these cases to determine what to do with b in our new_p addition. Depending on the case, we mux between new_p = p, new_p = p + b, or (new!) new_p = p - b.

Reg#(Bit#(1)) last_a <- mkRegU;

rule work(state == Busy);
    // Booth recoding
    Bit#(33) p_ = signExtend(p);
    Bit#(33) b_ = signExtend(b);
    // This turns into a mux choosing between 3.
    Bit#(33) new_p = case ({a[0], last_a}) matches
        2'b01: {p_ + b_};
        2'b10: {p_ - b_};
        default: {p_};  // to massage compiler
    endcase;

    last_a <= a[0];
    p <= {new_p[32:1]};
    a <= {new_p[0], a[31:1]};
    index <= index + 1;

    if (index == 31) state <= Ready;
endrule

Area: 1612.23 um^2
Critical-path delay: 309.79 ps
Cycles: 443 cycles for 13 tests (34 cycles per multiply - 2 overhead = 32)
Coverage: All 32-bit integer multiply

Our job is complete if we’re okay with 34 cycles per multiply. Internally, there’s only 32 cycles worth of work; the last 2 cycles come from constant testbench overhead. But let’s just worry about the internal work.

But we might want something like a quicker multiplier. We don’t necessarily need a critical path of 310 ps if the rest of our arithmetic units make us run on a slower clock, and we may be able to spare more than 1600 um^2 of area if it means we can do more multiplication.

Let’s try and get better performance. There are a few ways to do better multiplication. This time, let’s just focus on higher-radix multiplication. Next time, we can explore other options.

Radix-4 Multiplication

We can make things faster by inspecting two bits at a time instead of one. We can then finish in 16 internal cycles rather than 32, since we’re doing 16 cycles of 2 bits each.

Our case statement becomes a little uglier, muxing between 5 options rather than 3. Most of the change happens in the work rule. That’s because we need to add up to +/- 2b, instead of just +/- b like last time.

For a syntactical aside, I wrote the variable for 2*b as b2 instead of 2b because the Bluespec language specification requires that variable identifiers start with a lowercase letter. I raise syntax errors when identifiers start with numbers in my extension for VS Code, though not currently in my lexer for Rouge.

The Bluespec language specification is a little contradictory though, since identifiers can start with a $ or _ instead of a letter. For spots of ambiguity, I defer to the compiler. The Bluespec compiler raises a compile-time syntax error when it encounters a variable identifier starting with a number.

rule work(state == Busy);
    // Booth recoding
    Bit#(34) p_ = signExtend(p);
    Bit#(34) b_ = signExtend(b);
    Bit#(34) b2_ = signExtend({b, 1'b0});  // 2*b = b << 1

    // Mux that chooses between 5; the default is optimized out.
    Bit#(34) new_p = case ({a[1:0], last_a})
        3'b111, 3'b000: {p_};
        3'b001, 3'b010: {p_ + b_};
        3'b101, 3'b110: {p_ - b_};
        3'b011: {p_ + b2_};
        3'b100: {p_ - b2_};
        default: {0};  // never happens
    endcase;

    p <= {new_p[33:2]};
    a <= {new_p[1:0], a[31:2]};
    last_a <= a[1];
    index <= index + 2;

    if (index == 30) state <= Ready;  // on last step
endrule

Our cases follow the same rules as before, where a[1:0] is now a 2-bit two’s complement number and the last_a is still the 1-bit carry.

e.g., 3'b011 corresponds to 0(-2) + 1(1) + 1(1) = 2, so we add 2b.

Compiler Optimizations

Initially, I made available a b2 register that uses a precomputed value of 2b at the start, but then I realized it didn’t save any work because we can just do a single bit shift to double b, which is negligible in hardware.

Something interesting is that in Bluespec, the following statements result in the same circuit:

// This is if we store `2b` in a register
b2 <= 2*{0, in[1]};  // "multiply"
b2 <= {0, in[1]} + {0, in[1]};  // addition
b2 <= {in[1], 1'b0};    // simple fixed shift

// Total Circuit in Multiplier:
// Area: 2575.68 um^2
// Critical-path delay: 322.36 ps

The compiler sometimes performs these types of optimizations, but sometimes it doesn’t. For a Bluespec developer, there’s a careful balance between writing beautiful (potentially-) optimizable code and ugly (pretty-sure-is-) optimal code. The optimization part of the Bluespec compiler is not so mature as software compilers like Clang.

Because a multiplication by 2 only requires a fixed bit shift, we don’t need to store the value of 2b. In terms of synthesis, choosing to perform the bit shift every time to save state reduces the area (because we save a Bit#(34) register) but very slightly increases the path. This is the design we’ll go with for Radix-4.

Area: 2328.83 um^2
Critical-path delay: 324.62 ps
Cycles: 16435 cycles for 913 tests

Compared to radix-2 multiplication, we cut our internal cycles in half (now 16) while increasing the area by 44% and critical path only by less than 5%. The speed-up is close to double. It seems like this was well worth it.

Radix-8 Multiplication

Just as we went from 1 bit to 2 bits, we can inspect 3 bits at a time to further reduce the cycles. Of course, 32 is not a multiple of 3, so we need to handle the last cycle as a special case. We would get from 16 to about 11 cycles and hope it doesn’t cost us much in terms of delay and area.

You’ll notice that the cases become slightly more complicated. In Radix-2 we muxed between 3 choices. In Radix-4 we muxed between 5 choices. In Radix-8, we also work with 3b and 4b, so that gives us 9 choices (one for plus and one for minus). That’s just in terms of outcomes.

In terms of cases, in Radix-2 we had 2^2=4 cases; in Radix-4 we had 2^3=8 cases; in Radix-8 we’ll have 2^4=16 cases. Quite a lot to keep track of, but it’s fine once you know the pattern.

Like before, each case is itself like a two’s complement number that tells us what to do with b. In Radix-8, we look at 4 bits of a: a[2:0] and last_a. The last_a acts like a carry bit, so we just read a[2:0] as a two’s complement number, keeping in mind the carry. You’ll see if you inspect the cases carefully in the upcoming code excerpt.

We do need to take care of our loop terminating condition. With Radix-2 and Radix-4, we could just stop at previous index being 31 and 30 respectively because the current step would’ve brought the next index to 32 (or 0): a full reset. 32 is cleanly divisible by 2 and 1.

With Radix-8, we might want to stop at index 29, because 29 + 3 = 32. But 29 isn’t an index we would stop at because it’s not a multiple of 3. We can either stop at previous index being 30, bringing our next one to 33, or at 27, bringing our next one to 30. Both cases require us to do a special case at the end, either backtracking by a turn of 1 bit or doing one more turn of 2 bits.

I think it’s easier to do one more turn of 2 bits (which is just one step of Radix-4) than it is to do a backtrack. To backtrack, we’d need to also subtract b according to the last digit, as well as restore last_a. But either should work.

Since not much else is going on when we’re calling result, I’m putting the last step in the result method, though you could also put it in a separate rule or the same work rule. The tradeoff is that putting the step in result may spread out the critical path, which can help, hinder, or do nothing depending on whether processor’s critical path contains work, contains result, or contains neither. In exchange, we can shave off a cycle from work, giving us only 10 internal cycles.

There’s also the concern of instantiating new adders in result, but hold onto that thought for now.

Computing 3b

We should precompute 3b because rather than only requiring a shift like b2 and b4, it requires an addition. If we do it in work, then we may have two layers of adders: one to compute 3b, then another to add it to new_p. Conceptually, it might wreck our critical-path delay, so we precompute 3b in start. I also run the numbers later in a small experiment.

Unlike computing b2, computing a b3 seems to give the compiler some trouble giving us an efficient circuit, so we need to tune by hand.

Check the differences in synthesis:

b3_ <= 3*signExtend(in[1]);
// Area: 5222.38 um^2
// Critical-path delay: 411.48 ps

Bit#(35) b2 = signExtend({in[1], 1'b0});
b3_ <= b2 + signExtend(in[1]);
// Area: 5261.75 um^2
// Critical-path delay: 341.64 ps

b3_ <= signExtend(in[1]) + signExtend(in[1]) + signExtend(in[1]);
// Same as the b2 + b1 (second one)

Well, that’s interesting that the 3*signExtend(in[1]); gets the compiler to produce something much heavier than repeated addition or our “clever” addition. The repeated addition seems to get transformed into the clever addition, but I’m surprised the Bluespec compiler doesn’t optimize the first into either the second or third. I’m not sure what it’s doing.

New Rules

We precompute b3, but b2 and b4 are once again results of shifts. We add in the many new cases for work. Hopefully you can see the pattern.

// Constants; we can pull them out of the `work` rule to share with `result`
Bit#(35) b_ = signExtend(b);
Bit#(35) b2_ = signExtend({b, 1'b0});  // compiler expands lone 0
Bit#(35) b4_ = signExtend({b, 2'b0});
Bit#(35) p_ = signExtend(p);

rule work(state == Busy);  // handles 10 cycles of 3 bits each
    // Booth recoding
    // Mux choosing between 9; default is optimized out.
    Bit#(35) new_p = case ({a[2:0], last_a})
        4'b100_0:           {p_ - b4_};
        4'b101_0, 4'b100_1: {p_ - b3_};
        4'b110_0, 4'b101_1: {p_ - b2_};
        4'b111_0, 4'b110_1: {p_ - b_};
        4'b000_0, 4'b111_1: {p_};
        4'b001_0, 4'b000_1: {p_ + b_};
        4'b010_0, 4'b001_1: {p_ + b2_};
        4'b011_0, 4'b010_1: {p_ + b3_};
        4'b011_1:           {p_ + b4_};
        default: {0};  // never happens
    endcase;
    p <= {new_p[34:3]};
    a <= {new_p[2:0], a[31:3]};
    last_a <= a[2];
    index <= index + 3;

    if (index == 27) state <= Ready;  // on last step
endrule

method Action start(Pair in) if (state == Idle);
    // ... other boilerplate ...
    b3_ <= signExtend(in[1]) + signExtend(in[1]) + signExtend(in[1]);
endmethod

method ActionValue#(Pair) result if (state == Ready);  // handles last 2 bits
    state <= Idle;

    // Radix-2 case for last 2 bits
    Bit#(35) new_p = case ({a[1:0], last_a})
        3'b111, 3'b000: {p_};
        3'b001, 3'b010: {p_ + b_};
        3'b101, 3'b110: {p_ - b_};
        3'b011: {p_ + b2_};
        3'b100: {p_ - b2_};
        default: {0};  // never happens
    endcase;

    let p_ = {new_p[33:2]};
    let a_ = {new_p[1:0], a[31:2]};

    return unpack({p_, a_});
endmethod

Area Analysis

Area: 5261.75 um^2
Critical-path delay: 341.64 ps
Cycles: 10957 for 913 tests (10 internal cycles; 12 external cycles per)

Compared to Radix-4, we once again have about a 5% increase in critical path but a 125% increase in area. In exchange, we’ve gone from 16 internal cycles to 10 internal cycles, improving our time by over a third.

The large increase in area is because in Bluespec, we instantiate an adder with each call of +, except in very special cases where we have repeated subexpressions (e.g., two adders always perform identical calculations every cycle). So we’ve instantiated more adders. You may have noticed a similar effect when we did Radix-4 from Radix-2.

If we wanted, we could even instantiate a full multiplier with *.

For an area-efficient multiplier implementation, we may want to write one that uses only one adder. We would mux in the proper multiple of b instead of muxing between the results of several adders. So far, we’ve been generous with the amount of area we’re prepared to use. I discuss ways around it later.

In fact, Hennessy and Patterson’s description of higher-radix multiplication (what we’re doing) is in their section Speeding Up Multiplication with a Single Adder. Nine adders is starting to get a little high for a single adder design.

It’s interesting that our area increased by 125%, since an initial count of our adders (just going from +) suggests we now have 8 + 1 + 4 = 13 adders to our initial 4. It’s such a clean number that I would guess the compiler optimized out a few adders. Cycle-by-cycle, our result adders perform the same calculations as a subset of the work adders, so it makes sense for the two components to share the same adders and just wire the sums to both work and result.

Not Precomputing 3b

What happens if we don’t precompute 3b? Let’s do a little experiment and compare to the precomputed numbers:

// Precomputed
Area: 5261.75 um^2
Critical-path delay: 341.64 ps

// Not precomputed
Area: 4830.56 um^2
Critical-path delay: 481.22 ps

As you can see, we increase our delay by 40% because work must have two layers of adders if we compute b3 at runtime.

Radix-16 Consideration

If we continue onto Radix-16, we can expect more adders (unless we start sharing adders between multiples of b, as discussed above).

For work, we’d need to account up to +/- 8b, meaning 16 adders in that rule alone.

For start, we previously had one adder because we needed to compute b3. Now, we also need to compute b5, b6, and b7. Let’s say we save b6 by performing a shift on b3. Rather than having a second layer of adders in start, we can add one more cycle in between start and work to precompute b5 and b7 by adding b2 + b3 and b4 + b3, which also saves us having to compute it every cycle. The precomputation cycle might be helpful if Bluespec doesn’t have a good 3-way single-layer adder as a built-in.

Total, that means we need 16 + 3 = 19 adders, over double our current 9 adders.

What would we get in return? As long as we can mux between 17 (versus previous 9) values without an issue, and as long as our p_ can tolerate the fanout to all those adders, our critical path might not increase too much. We would just get a much higher area.

Our internal cycles should be 32/4 = 8 + 1 for setup, for 9 internal cycles. Our earlier Radix-8 implementation saved an internal cycle by bringing work from work to result, so we can think of it as going from 10 to 9 internal cycles, or 11 to 9 if we discount the saved cycle. It seems like a lot of added complexity and area to save one or two cycles per multiplication.

Let us not do Radix-16 for now.

The calculus might be different if we were doing 64-bit multiplication. In that case, we would be going from 21 cycles to 17 cycles. If we did this for 128-bit multiplication, we’d go from 42 to 32 cycles. It’s the saved-cycle in Radix-8 and the precomputation cycle in Radix-16 that results in a lot of overhead when we’re working with 32-bit multiplication.

Reduced Adder Design for Radix-8

All the above Radix-n designs are supposed to, in theory and as according to Hennessy and Patterson, use only a single adder. Because of the way we’ve implemented them, we’ve instantiated way more than a single adder.

In Bluespec, using the + operator, like calling any function, often results in the compiler inlining and therefore synthesizing a brand new adder, unless it can determine it can get away with fewer, like if you have two adders that always compute the same value like what happened when we did Radix-8.

In theory, it could be the compiler that optimizes for things like area and critical-path delay, but the Bluespec compiler isn’t nearly as sophisticated as the C compiler. The developer often needs to optimize by hand for area and delay.

I would bet that there are more people who have contributed to C optimizing compilers like GCC and Clang than there are people who have used Bluespec period, so it isn’t quite fair a race. Perhaps optimizing RTL output for things like use of + isn’t even within scope for current Bluespec compiler engineers.

Let’s say we don’t want to instantiate so many adders. Let’s say we only want one.

If we want to add without synthesizing adders everywhere, we can instantiate our own Adder submodule and use that. Then, we can add by calling the corresponding method, and as long as we’re using a synthesized module, the compiler will throw a compilation error if it detects we’re overusing the single add port.

interface Adder#(type t);
    method t add(t a, t b);
endinterface

module mkAdder(Adder#(t)) provisos (Bits#(t, t_bits), Arith#(t));
    method t add(t a, t b) = a + b;
endmodule

(* synthesize *)
module mkAdder35(Adder#(Bit#(35)));
    Adder#(Bit#(35)) adder <- mkAdder;
    return adder;
endmodule

The first module is polymorphic, which gives us freedom in producing different adders for different values. To massage the compiler for synthesis, I put in the second module and hard-coded it to work for 35 bits, which allows us to force a single port for the adder.

The overall multiplier still gets synthesized either way, but I want to mandate a single port for the adder. Non-synthesized modules can “grow” new ports, and then we’re right where we started.

Let’s drop this into place for our Radix-8 design like so:

Bit#(35) operand = case ({a[2:0], last_a})
    4'b100_0:           {- b4_};
    4'b101_0, 4'b100_1: {- b3_};
    4'b110_0, 4'b101_1: {- b2_};
    4'b111_0, 4'b110_1: {- b_};
    4'b000_0, 4'b111_1: {0};
    4'b001_0, 4'b000_1: {b_};
    4'b010_0, 4'b001_1: {b2_};
    4'b011_0, 4'b010_1: {b3_};
    4'b011_1:           {b4_};
    default: {0};  // never happens
endcase;

let new_p = adder.add(p_, operand);

Lower Area, Higher Delay

We get the following synthesis changes:

Area: 5261.75 um^2 -> 2960.05 um^2 (remember our Radix-4 was 2328.83)
Critical-path delay: 341.64 ps -> 520.25 ps  (big increase if we can't afford)

It’s strange that the critical path is even higher than when we had two layers of adders from our non-precomputed b3 experiment.

My first guess for the much greater critical-path delay is because of fan-in from the 9-way muxing on the b operands into the adder. Although, maybe that’s not it since we previously must’ve had fan-out from p_ going into eight different adders. Hmmm. Unless fan-in is much more harmful than fan-out, that shouldn’t be it.

My second guess is that our “negative” operands require a lot of hardware before going into the adder, versus all this happening in one go with a built-in adder. We might effectively have two layers of adders, with an implied unary 0 - b_ for the first four cases to generate our negative operand, then going into the actual adder. That is, maybe in Bluespec, a - operator doesn’t instantiate an a + (-b), but a genuine subtractor.

Experimentation

I don’t have the diagnostic tools to determine whether my second guess is correct visually or from the command line, so let’s test that guess experimentally.

When I try to replace the more elegant operand case statement with two separate statements that go into a mux for an adder and a sub, we save some marginal area and path at the expense of a lot uglier code. I plan on reverting.

Bit#(35) operand = case ({a[2:0], last_a})
    4'b100_0:           b4_;
    4'b101_0, 4'b100_1: b3_;     
    4'b110_0, 4'b101_1: b2_;     
    4'b111_0, 4'b110_1: b_;     
    4'b000_0, 4'b111_1: 0;
    4'b001_0, 4'b000_1: b_;    
    4'b010_0, 4'b001_1: b2_;     
    4'b011_0, 4'b010_1: b3_;     
    4'b011_1:           b4_;
endcase;

Bool is_add = case ({a[2:0], last_a})
    4'b100_0, 4'b101_0, 4'b100_1,
    4'b110_0, 4'b101_1, 4'b111_0,
    4'b110_1, 4'b000_0, 4'b111_1: False;
    4'b001_0, 4'b000_1, 4'b010_0,
    4'b001_1, 4'b011_0, 4'b010_1,
    4'b011_1:                     True;
endcase;

let new_p = (is_add) ? adder.add(p_, operand) :
                       adder.sub(p_, operand);

Area: 2948.08 um^2 (from 2960.05)
Critical-path delay: 494.37 ps (from 520.25)
Critical path: ~b[0] -> p[31:0][30](DFF_in)

It’s a rather marginal change though, so I don’t think that was it. A third guess: maybe it’s because our adder is shared between too many spots: work, start, result.

If we then only use a shared adder in work, instantiating standalone adders for start and result, we get a delay of 432 ps, still much higher than 342 ps from the 9-adder implementation (and with obvious cost of higher area from more adders). But that changes our critical path to going through last_a -> p[31:0][31](DFF_in), so it could be an issue with our selector fan-out. Maybe that was it?

Exercise for the Reader

I’m starting to run out of guesses. This is starting to creep into a lower level of abstraction than I’m accustomed to. (I got most of my expertise in big processor things.) Where’s all the delay coming from? Consider this an exercise for the reader.

If you have an idea why our delay has ballooned so much, contact me and I’ll see about following up in a later post. You can play around with the source on the corresponding GitHub repo as long as you have the Bluespec compiler and the synth tool. Otherwise, you can let me know and I can test it later.

It might be an issue with our synthesis tool. It could be falling into a local minimum. Maybe we could also draw diagrams and compare the two designs.

My current method of synthesis is primitive and uses the synth tool in the Minispec repository. There may be a better way that can get us more information about the area and delay, but I haven’t worked through such a process yet. I would only be a little surprised if nobody’s made an easy way to get timing and area information for a Bluespec design.

Since we don’t have great tools for diagnosing timing and area issues, we can still think conceptually and look at the numbers we have, but we can’t yet draw strong conclusions about area and delay directly from our Bluespec.

Built-in Multiplier

Bluespec offers a built-in multiplier as a primitive through the * operation. Here’s an implementation of our MultiplierUnit interface using it:

(* synthesize *)
module mkBuiltInMultiplierUnit(MultiplierUnit);
    FIFO#(Bit#(64)) data <- mkFIFO;

    method Action start(Pair in);
        data.enq(signExtend(in[0]) * signExtend(in[1]));
    endmethod

    method ActionValue#(Pair) result;
        data.deq;
        return unpack(data.first);
    endmethod
endmodule

It can do one multiplication per cycle. What’s the catch? You can probably guess: area and delay.

Area: 16272.28 um^2
Critical-path delay: 858.07 ps
Cycles: 914 cycles for 912 tests (basically one cycle per multiply)

Surprisingly, the penalty doesn’t seem untenably high. Wow. I only found this at the end of experimenting with the other multipliers.

For built-in multiplication compared to our many-adder interpretation of Radix-8, we have a 150% increase in delay and 200% increase in area. Compared to our single-adder interpretation of Radix-8, we have a 450% increase in area and 65% increase in delay. It’s a bit of a big increase, but the single cycle is crazy; we save 9 internal cycles, but we also save the external cycles too. One multiplication per cycle!

It almost seems worth the cost. With my slow processor and its 1000-1200 ps critical-path delay, maybe I could even embed the built-in multiplier as-is as a single cycle multiplier. I would only need to replace it if I started speeding up the rest of the stages.

The efficiency probably comes from the built-in multiplier being most likely programmed directly in Verilog, or a very low-level use of Bluespec. If we’re willing to give up the single cycle latency (e.g., and either wait or pipeline), we can definitely do better in Verilog, and we might be able to do better in Bluespec.

Maybe in a later post I’ll attempt to write a better multiplier in Verilog. In that case, it might still be useful to sketch things out first in Bluespec, just to check that the algorithm works cycle-wise. Once we have a Verilog module, we can still embed it in a Bluespec design for simulation and synthesis, just like an IP block.

Next Time

Next time, we’ll look at some more advanced multiplier designs. I’m particularly interested in increasing throughput, either through using faster adders or by pipelining. And I’m planning on looking at designs that intentionally use many adders.

Our aim right now is to increase multiplier throughput while assuming that our compiler gives us reasonable but not necessarily optimal hardware. Because we would be trying to fit this multiplier as a functional unit in our processor, we don’t have to worry all that much about the critical path as long as it’s below the critical path of all the other stages of the processor, and as long as the area isn’t prohibitively large.

Something else the reduced adder design allows us to do is to implement multipliers that use more special adders without just relying on the Bluespec built-in + operator. That might come in handy later if we look at designs that use particular adders, like carry-save adders. It might also be worth taking a peek at what state-of-the-art Bluespec processors actually use, in designs like Flute or Toooba.

If we do implement adders, then we’ll want to construct a testbench similar to the one we constructed for multipliers. We might need to focus a post on adders in particular before focusing on applying them toward multipliers. All in due time.

At some point, I’ll write a similar post about basic division, since it’s the other half of the RISC-V “M” extension.