<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.2">Jekyll</generator><link href="https://www.martinchan.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.martinchan.org/" rel="alternate" type="text/html" /><updated>2024-11-15T16:31:29-05:00</updated><id>https://www.martinchan.org/feed.xml</id><title type="html">Martin Chan</title><subtitle>Personal website for Martin Chan, a current MEng student at MIT in computer systems with an interest in computer architecture.</subtitle><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><entry><title type="html">Bean Stew</title><link href="https://www.martinchan.org/blog/bean-stew/" rel="alternate" type="text/html" title="Bean Stew" /><published>2024-01-16T00:00:00-05:00</published><updated>2024-01-16T00:00:00-05:00</updated><id>https://www.martinchan.org/blog/bean-stew</id><content type="html" xml:base="https://www.martinchan.org/blog/bean-stew/"><![CDATA[<figure>
<p><img src="/assets/media/bean-stew/Image00007.jpg" alt="A pot full of bean stew." /></p>
  <figcaption>Some people like to fish out the carrots and celery from the long cooking process and replace them with fresher vegetables for eating. When I’m cooking for one, I try to adhere to the <a href="https://en.wikipedia.org/wiki/Pareto_principle">80/20 principle</a>. What’s the least effort that can give us the best results?</figcaption>
</figure>

<p>I haven’t posted in a while (almost two months!) since I committed to my MEng. So here is a low-ish effort post to get me back into the swing of things. I moved back into Cambridge last week after a few months in Philadelphia. Shout out to the United States Postal Service for their role in shipping most of my stuff all the way here.</p>

<p>My spring room is still mostly empty. I have most of my furniture sans mattress standing by. I’m just waiting for friends to help me do some heavy lifting to prep the room for living. I have a housemate who graciously let me live in her room while she’s away for an internship, so I’ve set up shop there for January.</p>

<p>I’ve mostly settled in. We live near a bunch of restaurants, but I thought it would be prudent to start cooking for myself again, especially in January when I have more time. I made a big pot of bean stew and chicken a few days ago.</p>

<h2 id="bean-stew-background">Bean Stew Background</h2>

<p>I very loosely “followed” (and skipped half the steps of) the <a href="https://www.seriouseats.com/traditional-french-cassoulet-recipe">cassoulet recipe from Serious Eats</a>. It’s a fancy name for a rustic beans and chicken dish. They include a lot of little optimizations, but I stuck to the low-hanging fruit.</p>

<p>The biggest step I skipped was braising the poultry. I started with searing the chicken like they recommend, but I think my heat was so low that the chicken ended up cooked by the end of searing. I decided to just dice the seared chicken and set it aside as a stew topping.</p>

<p>My bean of choice was the pinto bean. You can probably use other beans instead. I’m just more accustomed to pinto from my days of eating Chipotle. When I feel more adventurous, I’ll probably try other beans. I know they’re out there.</p>

<p>Altogether, this made about 3750g of bean stew. I used 1250g to split between 5 servings (so 250g each) and stored 2500g in the freezer for later. I topped each serving with 100g of diced chicken and some bacon (didn’t bother to measure). This is what I call a meal kit, good for one meal.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00009.jpg" alt="A glass container of beans with chicken and bacon on top." /></p>
  <figcaption>And this is what each meal kit looks like. To reheat, I usually microwave it for a couple minutes to warm up the beans, then I finish it off in the air fryer to crisp up the chicken and bacon. When I’m feeling fancy, I add some shredded cheese on top. I use exclusively borosilicate glass to make sure the glass doesn’t explode on me.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/bean-stew/Image00010.jpg" alt="" /></p>
  <figcaption>The remaining 2500g of bean stew in jars in my freezer. I’ll thaw them out when I need them. Each one is good for about 3-4 meal kits.</figcaption>
</figure>

<p>I forgot to note the raw chicken weight, but I ended up with 800g of cooked chicken. I’ll need to cook more meat to accompany the stew later on, and I’m still investigating easy but tasty ways to do that. I’ve been searing by hand, which does the job and is tasty, but it takes a lot of time and causes a lot of oil splatter. I wonder if I can get similar results from the oven. Or maybe I just continue what I’m doing and just crank up the heat.</p>

<p>I think the stew’s alright. It’s a little celery-forward, which I wasn’t expecting since I only used like three stalks. It’s otherwise a pretty boring neutral stew. It would probably benefit from some change, but I don’t have enough food experience yet to know what change it needs. Maybe it doesn’t have enough umami (whatever that is) or acid (e.g., lemon juice or vinegar). I’m a little hesitant to use acid early on when cooking beans because it stops them from softening, but there’s no harm in adding it when they’re done cooking. Or maybe it needs more salt or more fat. Who knows. This wasn’t meant to be for experimentation.</p>

<p>One of my goals for the new year is to figure out how to use acid and other food science principles in cooking. It’ll take some reading (what are some good resources?) and some dedicated experimentation. I’ll probably need to compare recipes with more or less acid added to taste what the difference is.</p>

<h2 id="recipe">Recipe</h2>

<ul>
  <li>Two pounds dry pinto beans, soaked the night before in salted water.</li>
  <li>One pound bacon</li>
  <li>Mirepoix
    <ul>
      <li>Three large onions</li>
      <li>Some celery</li>
      <li>Some carrots</li>
    </ul>
  </li>
  <li>Better than Bouillon (or similar) and gelatin</li>
  <li>Bunch of chicken thighs.</li>
  <li>Miscellaneous herbs and spices (I added black pepper and thyme) and optionally cooking oil.</li>
</ul>

<h2 id="process">Process</h2>

<p>There’s more than one way to do it. I first seared the chicken in some cooking oil. It was only meant to be a sear, but I ended up cooking it through, so I figured I could just leave it at that and skip the braise. It’s not the same, but we’re going for minimal effort.</p>

<p>While I waited for the chicken to sear, I chopped up my vegetables and sliced my bacon. You could skip the step of adding cooking oil if you cook the bacon before the chicken, since the bacon releases fat. I did the bacon second because I wanted to cut my vegetables while the chicken seared, and I didn’t want bacon residue on my cutting board touching my vegetables. But it’s really no big deal since it all gets cooked in the end.</p>

<p>After searing the chicken (no pictures), I cooked down the bacon and fished it out. The bacon releases a lot of fat. I probably could’ve left all the fat in the pot, but recipes suggested removing some so as not to overdo the fat. I like cooking with leftover fat, so I use the bacon fat elsewhere like making quesadillas or later searing something else.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00001.jpg" alt="Cutting board with a one pound packet of bacon sliced into pieces." /></p>
  <figcaption>I like to use whole packets when I can. It’s easier to keep track of (less counting) and you don’t need to worry about having half a packet of bacon leftover. It’s probably only an issue for people who cook as infrequently as I do.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/bean-stew/Image00002.jpg" alt="A container with one pound of cooked bacon." /></p>
  <figcaption>Boy dinner.</figcaption>
</figure>

<p>I lazily chopped my vegetables. You could do something nicer, but these vegetables will shrivel up after a few hours of cooking anyway. Serious Eats likes recommending that you do much coarser chops and fish them out once they’ve given their essence to the stew. I didn’t bother. Since I was leaving the vegetables in, I think I should’ve given it a finer dice just so it looks nicer with the beans.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00003.jpg" alt="Dutch oven full of chopped onion, celery, carrot on the left, and a baking sheet with about two pounds of seared chicken on the right." /></p>
  <figcaption>My baking sheet of seared chicken on the right (also acts as a snack while cooking) and my mirepoix cooking on the left. The mug is full of bacon fat. The red dutch oven is one of my most beloved cooking vessels.</figcaption>
</figure>

<p>After giving the vegetables some time in the pot to meet each other, I added in half the cooked bacon and my raw beans. The beans will take a while to soften up and cook. I don’t know if adding the bacon back in does anything, but it’s kind of like <a href="https://en.wikipedia.org/wiki/Pork_and_beans">pork and beans</a>.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00004.jpg" alt="Dutch oven full of beans." /></p>
  <figcaption>My pot right after I added the beans and half the bacon back in. It’s a lot of beans relative to the pot. But the vegetables will melt down and give the beans more space.</figcaption>
</figure>

<p>I didn’t use pre-made chicken stock. Instead, I used a couple spoons of Better than Bouillon and a packet of gelatin. That’s a step I didn’t skip, since it’s meant to give the stew some body. Does it work? I don’t know. But it’s no big deal to put a packet of gelatin powder in warm water. Also, it’s easier to carry a jar of bouillon than it is to carry cartons of stock. Once I added the liquid, I put the pot into the oven at 180 degrees to slowly cook. I didn’t put on the lid, but it doesn’t really matter.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00005.jpg" alt="" /></p>
  <figcaption>My pot after one hour in the oven. I forgot to take a picture before putting it into the oven. You can see some of the thyme.</figcaption>
</figure>

<p>After an hour in the oven, I let it have another hour and a half before putting on a lid and turning off the heat. I kept the whole pot in the oven for a few more hours with the heat off. It’s not a critical part of the process. I just knew that the beans needed more time and I had dinner planned with friends. If I’d used the oven before, I probably would’ve kept the oven on while I was away (don’t tell the fire marshal), but I wanted to be extra careful.</p>

<p>What’s it mean for food safety? Well, the lid had some time to warm up in the oven, and the whole thing was at 180 degrees for at least a little while. I like to think it was food-safe. The beans certainly got soft enough, and there was no overcooking of meat.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00006.jpg" alt="" /></p>
  <figcaption>My pot after six hours in the oven. I turned off the heat after 2.5 hours, so it sat in the oven (lid closed) in the residual heat for the remaining 3.5.</figcaption>
</figure>

<p>A small crust developed on top of the beans, which is sort of the point of the gelatin in the stock, but I went against the spirit of cassoulet by using a deep pot rather than a shallow dish. It’s not very much crust relative to the rest of the beans. I finish my meal kits off in the air fryer, so maybe more crust will develop later.</p>

<figure>
<p><img src="/assets/media/bean-stew/Image00008.jpg" alt="" /></p>
  <figcaption>My meal kits after adding some diced chicken on top and some of the bacon.</figcaption>
</figure>

<p>After some assembly, we’re done. Once I get through the initial wave of meal kits, I’ll prepare more meat and thaw some of the frozen stew. I don’t want to prep too many and have them go bad in the fridge.</p>

<p>Hopefully I’ll have fancier dishes to share once the semester starts rolling again. It’s hard to say how busy the next few months will be for me. But since cooking gets easier with practice, maybe I’ll be able to make fancier things with less effort. Something like a real braise.</p>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[I moved back to Cambridge in preparation for the spring. And I made a bean stew as my first meal prep cycle.]]></summary></entry><entry><title type="html">Early Literature Review</title><link href="https://www.martinchan.org/blog/early-literature/" rel="alternate" type="text/html" title="Early Literature Review" /><published>2023-11-26T00:00:00-05:00</published><updated>2023-11-26T00:00:00-05:00</updated><id>https://www.martinchan.org/blog/early-literature</id><content type="html" xml:base="https://www.martinchan.org/blog/early-literature/"><![CDATA[<h2 id="overview">Overview</h2>

<p>In anticipation for my upcoming MEng project, I’ve been doing some reading on the ecosystem around language servers and related topics like compilers. Admittedly, there doesn’t seem to be much out there on language server implementation apart from <a href="https://matklad.github.io/">@matklad</a>’s work with <a href="https://github.com/rust-lang/rust-analyzer"><code class="language-plaintext highlighter-rouge">rust-analyzer</code></a>.</p>

<p>The MIT EECS website recommends the substantial work of the thesis be done while in residence, so I’ve mostly been starting a literature review so I can hit the ground running when the semester starts.</p>

<p>I’ve also been trying to brush up on my software engineering skills, which have gotten a little rusty from disuse. I’ll need them when I start implementing.</p>

<h2 id="language-servers-and-compilers">Language Servers and Compilers</h2>

<p>A language server is similar to a compiler in that it takes in a set of source files as input and derives meaning from those files. The main difference is that while compilers use their powers for code generation, language servers use their powers to help the programmer be more productive.</p>

<p>Unlike compilers, language server design is not yet an established academic discipline. They haven’t started teaching these things in school. There is no counterpart to the <a href="https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools">dragon book</a>. Language servers barely existed ten years ago, and only as proprietary pieces of IDEs.</p>

<p>Today, language servers are everywhere as productivity tools for programmers. The rise of language servers can be tied directly to the rise of Visual Studio Code and TypeScript in the past decade, both by Microsoft. The <a href="https://microsoft.github.io/language-server-protocol/">Language Server Protocol</a> itself is an open standard started by Microsoft and used to allow code editors and language servers to talk to each other. We could further trace language servers’ lineage to the IDE work pioneered by <a href="https://martinfowler.com/bliki/PostIntelliJ.html">IntelliJ</a> and Eclipse, a decade before VS Code.</p>

<p>Every modern programming language now has its language server, or several. Language servers have become so common that shipping a language without at least a modest language server is like shipping a language without a compiler.</p>

<h3 id="scoping">Scoping</h3>

<p>There is rich potential for research into language servers at the intersection of <abbr title="Human-Computer Interaction">HCI</abbr>, compilers, and static analysis. Developers are actively investigating ways to add productivity in the code editor, especially for language-specific features. Recent developments with generative AI have brought tools like <a href="https://blogs.nvidia.com/blog/llm-semiconductors-chip-nemo/">NVIDIA’s ChipNeMo</a> or <a href="https://en.wikipedia.org/wiki/GitHub_Copilot">GitHub Copilot</a>, but by no means is machine learning the only frontier left.</p>

<p>With computing where it is in 2023, there is a very, <em>very</em> high ceiling for how sophisticated smart editing can get. All that research is at the bleeding edge of the field. For now, most of that is beyond my reach.</p>

<p>My project is not so ambitious. I’m just looking to make a usable language server for Bluespec, a language that doesn’t have one yet. I’m doing it to improve the classroom experience of students who are using Bluespec as a learning tool.</p>

<p>When students are using sophisticated code editing features in all their other classes, e.g., with Python, TypeScript, or C, it can be disheartening if they aren’t provided the same quality of tools for Bluespec. If hardware development is as important as we make it out to be, it should have dignified tools. That’s the same reason why I wrote the <a href="/projects/vscode-bsv/">Bluespec syntax highlighter for VS Code</a>.</p>

<p>I’m striving to provide simple, quality of life features that are common across languages, like <a href="https://rust-analyzer.github.io/manual.html#go-to-definition">go-to-definition</a>, <a href="https://rust-analyzer.github.io/manual.html#hover">hover</a>, signature help, and similar features. These would be built on top of the core of the language server, which I would have to architect, build, and (re)write a <a href="https://rust-analyzer.github.io/blog/2020/10/24/introducing-ungrammar.html">grammar</a> of Bluespec for.</p>

<p>If time remains, there is ample room for useful features specific to Bluespec. A language server for Bluespec, like for any programming language, can take advantage of unique language-specific semantics. One such optimization might be to embed scheduling information from the Bluespec scheduler into the editor, such as with conflicting rules. I’m sure there are plenty of other opportunities.</p>

<p>I was first considering writing the language server in TypeScript, but I’m increasingly leaning toward writing it in Rust thanks to the ecosystem of language server implementation tools built by @matklad and the <a href="https://github.com/rust-lang/rust-analyzer"><code class="language-plaintext highlighter-rouge">rust-analyzer</code></a> team, as well as their exemplary documentation.</p>

<p>It’s common for software languages to have their language servers written in their own language. It should be obvious why it’s not an option for Bluespec.</p>

<h3 id="resources">Resources</h3>

<p>The closest thing I’ve found to a canonical source of knowledge on language servers is the writing of <a href="https://matklad.github.io/">@matklad</a>, or Alex Kladov. He leads the <a href="https://github.com/rust-lang/rust-analyzer"><code class="language-plaintext highlighter-rouge">rust-analyzer</code></a> project, the cutting-edge language server for Rust. And he’s been very generous with sharing his experience online. I’m learning most everything I can from his writing (both from <a href="https://matklad.github.io/">his site</a> and the <a href="https://rust-analyzer.github.io/blog"><code class="language-plaintext highlighter-rouge">rust-analyzer</code> blog</a>) and <a href="https://www.youtube.com/playlist?list=PLhb66M_x9UmrqXhQuIpWC5VgTdrGxMx3y">videos</a>.</p>

<p>I’ve also been supplementing with documentation from other well-made language server projects like the <a href="https://github.com/microsoft/TypeScript-Compiler-Notes/tree/main/intro#the-typescript-compiler">TypeScript Compiler</a><sup id="fnref:tscompiler" role="doc-noteref"><a href="#fn:tscompiler" class="footnote" rel="footnote">1</a></sup> and others from <a href="https://microsoft.github.io/language-server-protocol/implementors/servers/">the growing set of language servers</a>.</p>

<p>And of course, there are the official Microsoft documents <a href="https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/">specifying the Language Server Protocol</a> and a Visual Studio Code <a href="https://code.visualstudio.com/api/language-extensions/language-server-extension-guide#implementing-a-language-server">quick-start guide</a> for language server development. The issue with these respectively is that the LSP is only a small part of the challenge of writing a language server, and the quick-start guide doesn’t cover any of the details required for a useful language server.</p>

<p>Our best bet is relying on examples of good language servers to know what to build and how to build it.</p>

<h3 id="code-editors-ides-smart-editing">Code editors, IDEs, Smart Editing</h3>

<p>I plan on ignoring the whole “what is a code editor versus what’s an IDE (☝️🤓)” discourse. The main idea is that language servers are the programs that enable smart editing for code.</p>

<p>Language servers enable features like hover, go-to-definition, code folding, type hints, autocomplete, and all the good things one expects from a modern development experience. We give the engineer all the useful information that we can from statically analyzing the code. Features like these used to be exclusive to fancy programs called <a href="https://en.wikipedia.org/wiki/Integrated_development_environment">IDEs</a>, but now they’re supported by all sorts of code editing programs.</p>

<p>The Language Server Protocol is one way (currently, the standard way) for language servers to communicate with different editors like Visual Studio Code. It’s up to the code editor how to expose smart editing features to the developer, and it’s up to the language server to provide the information to the editor.</p>

<p>A good approach, like rust-analyzer’s, may be to isolate the Language Server Protocol part of the project from the language server itself, so that we can handle other protocols should they come along.</p>

<h3 id="compiler-non-reuse">Compiler (Non-)Reuse</h3>

<p>Can we use the existing <a href="https://github.com/B-Lang-org/bsc">Bluespec Compiler</a>? It sure seems like it’d save a lot of work! My current reading suggests probably not. We may be best served starting from scratch.</p>

<p>In particular, there are fault tolerance and latency demands on language servers that aren’t imposed on compilers. A language server is expected to be useful even (maybe especially) on code that doesn’t compile. It’s also expected to be responsive, sometimes at keystroke frequency. That responsiveness involves thinking about both latency and energy consumption.</p>

<p>From what I’ve read by @matklad, compilers and language servers are similar only enough to get you into trouble. While compilers and language servers <em>appear</em> similar, their tasks are very different, and that has extreme implications on the way they need to be implemented.</p>

<p>He writes on his experience writing rust-analyzer in “<a href="https://matklad.github.io/2020/11/11/yde.html">Why an IDE?</a>”,</p>

<blockquote>
  <p>LSP did achieve a significant breakthrough — it made people care about implementing IDE backends. Experience shows that re-engineering an existing compiler to power an IDE is often impossible, or isomorphic to a rewrite. How a compiler talks to an editor is the smaller problem. The hard one is building a compiler that can do IDE stuff in the first place. Check out <a href="https://rust-analyzer.github.io/blog/2020/07/20/three-architectures-for-responsive-ide.html">this post</a> for some of the technical details. Starting with this use-case in mind saves a lot of effort down the road.</p>
</blockquote>

<p>He writes further in “<a href="https://matklad.github.io/2022/04/25/why-lsp.html">Why LSP?</a>”,</p>

<blockquote>
  <p>Before LSP, there simply weren’t a lot of working language-server shaped things. The main reason for that is that building a language server is hard.</p>

  <p>The essential complexity for a server is pretty high. It is known that compilers are complicated, and a language server is a compiler <strong>and then some</strong>.</p>
</blockquote>

<p>and</p>

<blockquote>
  <p>And, when compiler authors start thinking about IDE support, the first thought is “well, IDE is kinda a compiler, and we have a compiler, so problem solved, right?”. This is quite wrong — internally an IDE is very different from a compiler but, until very recently, this wasn’t common knowledge.</p>

  <p>Language servers are a counter example to the <a href="https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/">“never rewrite”</a> rule. Majority of well regarded language servers are rewrites or alternative implementations of batch compilers.</p>
</blockquote>

<p>and, most vindicating,</p>

<blockquote>
  <p>[LSP] moved us from a world where not having a language IDE was normal and no one was even thinking about language servers, to a world where a language without working completion and goto definition looks unprofessional.</p>
</blockquote>

<p>It was a joy to find @matklad’s writing because it was consistent with everything I observed when using Bluespec, back when <em>I</em> was a student accustomed to fancy code editing from software languages and put off by Bluespec’s lack of similar support. And @matklad’s writing is so informative in lighting a path forward.</p>

<p>It’s not even that Bluespec’s editor support was always bad. Back when Bluespec was shiny and new, editor support was bad for <em>everything</em>. The world just moved really fast while Bluespec stayed the same. It’s time to catch up.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:tscompiler" role="doc-endnote">
      <p>I know I take pains to distinguish between compilers and language servers. TypeScript’s compiler is new enough and the relationship between TypeScript and JavaScript is such that they made the language server a first class concept in designing the language and compiler. <a href="#fnref:tscompiler" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[I started my literature review for my tentative MEng project. I discuss language servers, their relation to compilers, and amazing resources from @matklad and the rust-analyzer project.]]></summary></entry><entry><title type="html">MEng Thesis Options</title><link href="https://www.martinchan.org/blog/thesis-options/" rel="alternate" type="text/html" title="MEng Thesis Options" /><published>2023-11-19T00:00:00-05:00</published><updated>2023-11-19T00:00:00-05:00</updated><id>https://www.martinchan.org/blog/thesis-options</id><content type="html" xml:base="https://www.martinchan.org/blog/thesis-options/"><![CDATA[<figure>
<p><img src="/assets/media/meng-options/info.png" alt="Screenshot of a list of faculty and TA for 6.192. It's only Arvind and me, respectively." /></p>
  <figcaption>Is this list going to get any longer?</figcaption>
</figure>

<p>I recently accepted a position to TA for <em>Constructive Computer Architecture</em> (6.192) for Spring 2024.<sup id="fnref:last" role="doc-noteref"><a href="#fn:last" class="footnote" rel="footnote">1</a></sup> It would be my first semester as an MEng, so I’ll be expected to submit a <a href="https://www.eecs.mit.edu/academics/undergraduate-programs/meng-program/thesis-proposal/">thesis proposal</a> by May as part of the <a href="https://www.eecs.mit.edu/academics/undergraduate-programs/meng-program/meng-thesis/">MEng thesis component</a> of the degree.</p>

<p>It’s typical for people who have their MEng funded partially or fully by RAships to align their thesis with the research they conduct for funding. Those types of projects often fit into the typical framework of incremental research, like pieces of projects that more senior researchers are directing. I think it’s generally the case that the MEng thesis follows the RA research, rather than vice-versa.</p>

<p>Since I’m not reliant on RA funding<sup id="fnref:change" role="doc-noteref"><a href="#fn:change" class="footnote" rel="footnote">2</a></sup>, I have more flexibility on my choice of MEng project. That doesn’t mean I should just do anything, but I feel like I should take advantage of that flexibility to do a project that might not conventionally be funded given researchers’ budgets. Some people in my position take the chance to execute on a passion project.</p>

<p>And, because I’ll be taking quite a few classes throughout my MEng (at least 5 out of the 6 maximum<sup id="fnref:advance" role="doc-noteref"><a href="#fn:advance" class="footnote" rel="footnote">3</a></sup>), I think it’s especially important that I find a thesis topic that I’d like to spend a significant amount of time on. The high amount of coursework I’ll have means I will have plenty of opportunity for career-relevant learning and I’ll need to ration my energy. It’ll be easier to muster up my remaining energy for an MEng project that I’m excited about.</p>

<p>The degree includes 12 units of thesis work per semester as an MEng, so spending three semesters would mean about <code class="language-plaintext highlighter-rouge">14 hours x 12 units x 3 semesters ≈ 500 hours</code> of work.  You could build quite an impressive project with 500 hours of work, between design, building, documenting, and testing. I’d like to spend that time on something I can be proud of.</p>

<p>One type of MEng project I find very attractive is the kind that is practical and has immediate pedagogical benefits. One prime example is <a href="https://hz.mit.edu/">Adam Hartz</a>’s MEng project, which was <a href="https://catsoop.org/">CAT-SOOP</a>, a learning management system that, a decade later, is now used by almost 20 MIT EECS classes<sup id="fnref:hartz" role="doc-noteref"><a href="#fn:hartz" class="footnote" rel="footnote">4</a></sup>. Another example is <a href="https://www.katykem.com/">Katy Kem</a>’s project, <a href="https://dspace.mit.edu/handle/1721.1/119529"><em>Laboratory Assignments for Teaching Introductory Signal Processing Concepts</em></a>, which is exactly the title and has use for teaching. The <a href="http://up.csail.mit.edu/">Usable Programming Group</a> has a series of projects on programming education, several of which continue to be used in offerings of 6.102.</p>

<p>These are amazing projects that amplify the efforts of students. And most of them were built by and for MIT students, who (as we are all told) are the future. I think I’d like to do that sort of thing for my MEng project.</p>

<h2 id="preliminary-idea">Preliminary Idea</h2>

<p>A couple months ago, I wrote a <a href="/projects/vscode-bsv/">Bluespec extension for VS Code</a>. It’s being used by many current students in 6.191 as part of their code editing setup since almost every lab assignment in the class uses Minispec, which I cover with my extension. I like to think it’s being used by a few researchers as well. Next semester, I have no doubt that students will be coming into 6.192 with my extension already installed.</p>

<p>My current idea for an MEng project is to add fuller support for Bluespec in code editors by implementing the <a href="https://microsoft.github.io/language-server-protocol/">Language Server Protocol</a> (LSP) for Bluespec. The idea is that adding this support will drastically increase the productivity of Bluespec students, researchers, and if or when they exist, those in industry.</p>

<p>My plan is to add all the typical <a href="https://code.visualstudio.com/api/language-extensions/programmatic-language-features">bells and whistles</a> from a modern language in a code editor, but eventually add more Bluespec-specific and hardware-specific functionality like annotations for scheduling, ports, or state-change. Already, <a href="https://microsoft.github.io/language-server-protocol/implementors/servers/">many programming languages</a> support the LSP, and it’s a <a href="https://news.ycombinator.com/item?id=21638232">growing expectation for modern programming languages</a>. Just because the hardware industry is a little behind the times doesn’t mean we have to be.</p>

<p>I can slot these into my <a href="https://code.visualstudio.com/api/language-extensions/language-server-extension-guide">VS Code</a> extension, but the LSP can be used by other editors like <a href="vim/neovim">vim/neovim</a>, <a href="https://www.gnu.org/software/emacs/manual/html_mono/eglot.html">Emacs</a>, <a href="https://github.com/atom/atom-languageclient">Atom</a><sup id="fnref:whoever" role="doc-noteref"><a href="#fn:whoever" class="footnote" rel="footnote">5</a></sup>, and <a href="https://microsoft.github.io/language-server-protocol/implementors/tools/">many others</a>. I still anticipate most programmers in 2023 to be using VS Code. I don’t see another editor dethroning it for the near future, but the appearance of editor-agnosticism is a pleasant property of the LSP. If ever the mandate shifts, the new editor can still use the LSP.</p>

<p>One worry about the project is that it’s kind of a bet on Bluespec. Will it take off or not? The hardware industry seems kind of ossified around Verilog, SystemVerilog, and in some places VHDL, even despite the existence of shiny new HDLs like SpinalHDL and Chisel.</p>

<p>While I don’t know if Bluespec is the future, I <em>do</em> know that Bluespec is the present (at least at MIT). 6.191 was redesigned to use Minispec only a few years ago. 6.192, I hope, will stick around for as long as there is someone interested in teaching synthesizable computer architecture.</p>

<p>And I’m certain there is zero way Bluespec is taking off without the sort of mature programming language support that the LSP provides. And I’m not sure anyone but me is going to implement that LSP. I’m cautiously optimistic, but I think MIT can be a fount for Bluespec, especially if we continue to teach it consistently. So maybe it <em>is</em> a bet on Bluespec.</p>

<p>It’s more than likely that I will graduate and never use Bluespec again, but I don’t think it’s any more tragic than most MEng projects. At least my project would continue to be used as long as Bluespec and Minispec are used for classes and research. It feels more impactful (certainly more based) than most MEng theses.</p>

<p>Probably the biggest cost for the project is that the skills I’ll be gaining will be more pertinent to software development or programming languages than to computer architecture or hardware design. I think there’s value in being a generally well-rounded engineer, and I hope that the things I learn from TAing for 6.192 and taking all my other classes will cover any weak spots.</p>

<h2 id="the-alternative">The Alternative</h2>

<p>While I don’t currently rely on RA funding from a computer architecture lab, I could still select a project that has more direct relevance to the sort of work that I’d like to do. Something I was looking at when I was doing my <a href="/projects/processor/">processor project</a> was the <a href="https://github.com/csail-csg/riscy-OOO">Riscy-OOO project</a> led by <a href="https://people.csail.mit.edu/szzhang/">Sizhuo Zhang</a>. It’s a sophisticated multicore out-of-order processor written in Bluespec, and a worthy base for other projects.</p>

<p>Most of the people who worked on the Riscy-OOO project are no longer at MIT, but there have been some recent MEng projects that used it for research. My 6.192 TA last year wrote <a href="https://dspace.mit.edu/handle/1721.1/151629">his MEng thesis</a> on adding secure shared memory to the processor as a continuation of the <a href="https://dl.acm.org/doi/10.1145/3352460.3358310">MI6 project</a>. Doing something similar would probably require me to self-lead to a similar degree as with an LSP project, but with the reward of gaining more practice with hardware design and computer architecture (albeit with the cost of using shoddy existing Bluespec infrastructure).</p>

<p>The rub is that if I focus my attention on a more explicitly hardware-related project, then the infrastructure would probably never be built. But I know that a project like implementing the LSP would make Bluespec projects, <em>especially</em> for large code-bases like Riscy-OOO or even the 6.192 class projects, much more tractable for the next person. The lack of editing infrastructure is a significant barrier to productivity for all users of Bluespec, whether student or researcher.</p>

<p>The infrastructure is something best done sooner rather than later. And if not by me, then who? Not to make myself out like a saint, but I feel like I’m in the exact right spot to implement the LSP, as someone who already wrote a basic VS Code extension for Bluespec and who is about to gain access to a bunch of potential test users in the form of 6.192 (and maybe 6.191) students, already accustomed to modern features from their other classes. It seems like exactly the thing a long-term-thinking TA at MIT should do.</p>

<p>Frankly, my impression is that most MEng projects <em>don’t</em> really have much direct relevance to the students’ future careers. I’d be proud enough leaving a project of value to the MIT and Bluespec communities. I’d just need to use my other experiences to get a job in hardware.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:last" role="doc-endnote">
      <p>Last semester it was two instructors and two TAs. Not sure how we’ll make it with one instructor and one TA. Maybe someone else will come aboard. <a href="#fnref:last" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:change" role="doc-endnote">
      <p>It’s still kind of up in the air how I’ll find funding for Fall 2024. 6.192 is typically only offered in the spring. I’ll probably either attempt to TA for 6.191 or seek RA funding from Arvind or one of his colleagues. I’m hoping that my project will be seen as worth it, or I’ll help more directly with theirs. <a href="#fnref:change" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:advance" role="doc-endnote">
      <p>Some students took their graduate level classes or math electives during undergrad. I didn’t originally plan on MEnging when I was planning my undergrad classes, so I need to take all 4 of my <a href="https://www.eecs.mit.edu/academics/undergraduate-programs/meng-program/requirements/">AAGSes</a> and 1 more math elective. <a href="#fnref:advance" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:hartz" role="doc-endnote">
      <p>Of course it helps that Adam was a lecturer for one of the biggest classes in the school for almost as many years. <a href="#fnref:hartz" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:whoever" role="doc-endnote">
      <p><a href="https://github.blog/2022-06-08-sunsetting-atom/">Whoever is still using Atom in 2023</a>. <a href="#fnref:whoever" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[It looks like I'm going to MEng. A life update blog post discussing the preliminary project idea I have for my MEng thesis.]]></summary></entry><entry><title type="html">Write-After-Write Bug</title><link href="https://www.martinchan.org/blog/waw/" rel="alternate" type="text/html" title="Write-After-Write Bug" /><published>2023-11-12T00:00:00-05:00</published><updated>2023-11-12T00:00:00-05:00</updated><id>https://www.martinchan.org/blog/waw</id><content type="html" xml:base="https://www.martinchan.org/blog/waw/"><![CDATA[<figure>
<p><img src="/assets/media/waw/wrong.png" alt="The teaser for my processor project with a red X crossing it out" /></p>
  <figcaption>Could the <strong>whole <a href="/projects/processor/">processor project</a></strong> be wrong? Well not the whole thing. But I did find a bug that results in incorrect execution for a small set of cases.</figcaption>
</figure>

<h2 id="introduction">Introduction</h2>
<p>The minimal out-of-order processing in my <a href="/projects/processor/">processor project</a> has a bug that I didn’t catch until now. Not only is the out-of-order processing minimal, but it isn’t even fully correct because it ignores write-after-write (WAW) hazards.</p>

<p><em>Most</em> of the time, it isn’t an issue, and in fact I didn’t catch it <a href="#deficiency-in-existing-tests">with existing tests</a>, but we <em>can</em> write a <a href="#test-case">sequence of instructions</a> that result in our processor giving the wrong answer. It’s the specific case where instructions are sliding past each other but we write to the same destination register in the wrong order.</p>

<p>In this post, I write a test case that demonstrates the write-after-write (WAW) hazard. To make the out-of-order part correct, I would need to either add in proper <a href="https://en.wikipedia.org/wiki/Re-order_buffer">reordering</a> or induce more stalls for when out-of-order commits may produce incorrect results. I’ve plugged this post into my project write-up as a <a href="/projects/processor/#out-of-order-processing">correction</a>.</p>

<h2 id="background">Background</h2>

<p>I found the bug while reflecting on a recent job interview where I was discussing my processor project. It was the first time I had the opportunity to talk about my project’s out-of-order-ness in any detail with anyone else, let alone an engineer in the field. It’s not the best of circumstances to find a bug, but it’s probably better than finding a bug <a href="https://en.wikipedia.org/wiki/Semiconductor_device_fabrication">post-fabrication</a>.</p>

<p>In my processor, the register-read stage induces stalls so that instructions wait for their source registers to be ready (solving read-after-write hazards), but <em>not</em> for their destination registers to be ready. This was perfectly fine when every instruction would complete in order, but we end up with <a href="https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Write_after_write_(WAW)">write-after-write hazards</a> when instructions commit out of order <em>and</em> those instructions write to the same destination register.</p>

<p>It’s a little embarrassing that I plastered these pipeline visualizations all over my write-up that might look obviously wrong to an expert at a glance, but in context it’s not <em>too</em> too embarrassing. I only ever learned about out-of-order processing on my own during the summer, and it’s hard to get feedback when I don’t have instructors or office hours to draw on like I did for things I learned in class. If I was doing this for school, it would’ve been caught in office hours or after turning in an assignment, or I would’ve been reminded during lecture. It just so happened that I was reminded recently during a job interview.</p>

<p>I <em>do</em> recall a case during office hours last semester when I showed my professor Thomas a visualization where the instructions appeared to be committed out of order. He expressed concern then, like it wasn’t supposed to happen. In that case, I knew it was just because of a bug in my <a href="https://github.com/shioyadan/Konata/tree/master">Konata</a> instrumentation, not really the functionality of the processor itself.</p>

<p>I discounted the out-of-order commits when working on the project during the summer because I was expecting my processor to be doing <em>some</em> things out of order. I just forgot this was a case that might give incorrect results. I probably could’ve caught it myself earlier if I had been more careful with the design or did a closer reading of Hennessy and Patterson during the summer while I was self-studying.</p>

<p>This and similar bugs might be more obvious to me if I had more experience with sophisticated out-of-order processors like the <a href="https://github.com/csail-csg/riscy-OOO">MIT RiscyOO project</a> or during study before attempting to implement similar features myself. On the other hand, I might not have internalized the lesson as well if I didn’t find the issue in a firsthand project like this one. It’s hard to say!</p>

<h2 id="deficiency-in-existing-tests">Deficiency in Existing Tests</h2>
<p>I didn’t set up a unit test for write-after-write, but I am a little surprised my integration tests didn’t catch the issue. I can think of some reasons why.</p>

<p>My main guess is that the degree of out-of-order processing in my processor was so limited that the bug just didn’t show up in my tests. That, and my integration tests might just be too small for the issue to show up anyway.</p>

<p>The criteria for the bug to appear is where we have two instructions that write to the same register and were given permission to slide past each other during the register read stage. The only case that this would happen is when it’s happening across two separate functional units, like the ALU functional unit and the memory functional unit.</p>

<p>Because my processor’s <a href="https://en.wikipedia.org/wiki/Instruction_window">instruction window</a> was something like 2-3 instructions (very modest compared to the dozens or hundreds of instructions in high performance processors), <em>and</em> because all instructions spend only 1  cycle (in the case of ALU) or 2 cycles (in the case of memory) in execution, it was rare that I would have two instructions at the head of their respective queues that want to write to the same register.</p>

<p>Equivalently, I think the bug maybe would’ve shown up if the instruction window was wider or if the execution stage took longer. Either implementing my improved instruction issue buffer or lengthier functional units like a <a href="/blog/basic-multiplier/">multiplier</a> or floating point unit maybe would’ve caused the issue to appear.</p>

<p>Furthermore, most of my test cases were simple C programs optimized using <code class="language-plaintext highlighter-rouge">GCC</code>’s <code class="language-plaintext highlighter-rouge">O2</code> optimization flag. We saw previously that the compiler would rearrange our instructions in such a way that we could maximize use of instruction-level-parallelism, such as when we needed to save registers to the stack.</p>

<p>One of these optimizations may be that the compiler tries not to give us instances where two nearby instructions try to write to the same register. Sometimes that isn’t an option due to restrictions in <a href="https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf">RISC-V calling convention</a>, but my instruction issue window is so small and my test programs are so simple that it just wasn’t an issue.</p>

<figure>
<p><img src="/assets/media/processor/saving_registers.png" alt="Konata visualization of stack saving with very moderate out-of-order processing." /></p>
  <figcaption>An example of saving registers to the stack, taken from my <a href="/projects/processor/#out-of-order-processing">processor write-up</a></figcaption>
</figure>

<p>I had more tests available to me through <a href="https://riscof.readthedocs.io/en/stable/">RISCOF</a>, but I had stopped supporting it when I migrated the processor from a rigid execution pipeline to the flexible functional unit set-up. I wonder if those tests would’ve caught them. I would need to re-add support to find out.</p>

<h2 id="test-case">Test Case</h2>
<p>We can write a test so that we can see that the issue truly exists and, when we eventually fix it, to help verify that we fixed it.</p>

<p>Conceptually, we want to store to a register in two consecutive instructions, one that goes through the ALU functional unit, and another that goes through the memory functional unit. We might later also want coverage to check that our two ALU functional units don’t also display the WAW hazard, e.g., if we have two consecutive <code class="language-plaintext highlighter-rouge">li</code> to the same register. But let’s start small.</p>

<p>In C, the test might look like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="kt">int</span> <span class="n">a</span><span class="p">;</span>
    <span class="n">a</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="n">a</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">exit</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span> <span class="c1">// test passed</span>
    <span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">a</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span> <span class="c1">// test failed because we used the old value of `a`</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">exit</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span> <span class="c1">// test failed for another reason (maybe `a` is uninitialized)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>For the sake of this test case, I’m going to write it directly in RISC-V. I don’t know if the case I described is something that a compiler would produce from C code, but I do know the situation in RISC-V assembly where it could happen.</p>

<p>All of my existing tests are complied from C. This new test doesn’t exactly fit into my testing framework as-is, since it’s not doing any of the MMIO to indicate whether the test passed or failed. Instead, I’m judging visually from the pipeline visualization. It would take a bit more boilerplate before I could integrate it into my existing battery of correctness tests for automated testing.</p>

<p>Still, I want to do a quick mockup to illustrate the issue. The following test should fail with our current design. A pass should result in a forward branch, while a fail should result in no branch.</p>

<p>In practice, I don’t think a compiler would ever generate code like this. In this mockup, the <code class="language-plaintext highlighter-rouge">lw a0, 0(sp)</code> is dead code because the value is never used before the following <code class="language-plaintext highlighter-rouge">li a0, 0</code>. A compiler would probably remove it altogether (though maybe not with the <code class="language-plaintext highlighter-rouge">O0</code> flag). But this arrangement is useful to isolate the bug, since it’s a very short sequence of instructions.</p>

<div class="language-nasm highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">; I'm using ';' because the Rogue syntax highlighter only supports NASM assembler</span>
  
<span class="nl">test:</span>
  <span class="c1">; these two instructions put 1 into 0(sp) (it might violate calling convention)</span>
  <span class="nf">li</span> <span class="nv">a1</span><span class="p">,</span> <span class="mi">1</span>
  <span class="nf">sw</span> <span class="nv">a1</span><span class="p">,</span> <span class="mi">0</span><span class="p">(</span><span class="nb">sp</span><span class="p">)</span>  <span class="c1">; save 1 to 0(sp)</span>
  
  <span class="c1">; these instructions both use a0 as a destination. The second should overwrite the first.</span>
  <span class="nf">lw</span> <span class="nv">a0</span><span class="p">,</span> <span class="mi">0</span><span class="p">(</span><span class="nb">sp</span><span class="p">)</span>  <span class="c1">; load 1 to a0 (from memory)</span>
  <span class="nf">li</span> <span class="nv">a0</span><span class="p">,</span> <span class="mi">0</span>      <span class="c1">; load 0 to a0 (from immediate)</span>
  
  <span class="nf">beqz</span> <span class="nv">a0</span><span class="p">,</span> <span class="nv">pass</span> <span class="c1">; branch to pass if a0 is zero (otherwise continue to fail)</span>

<span class="nl">fail:</span>
  <span class="nf">li</span> <span class="nv">a0</span><span class="p">,</span> <span class="mi">1</span>  <span class="c1">; kind of like the exit(1); we fail</span>
  <span class="nf">unimp</span>  <span class="c1">; repeated several times for cosmetic reasons (processor exits)</span>

<span class="nl">pass:</span>
  <span class="nf">li</span> <span class="nv">a0</span><span class="p">,</span> <span class="mi">0</span>  <span class="c1">; kind of like the exit(0); we pass</span>
  <span class="nf">unimp</span>  <span class="c1">; repeated several times for cosmetic reasons (processor exits)</span>
</code></pre></div></div>

<p>The test should go to <code class="language-plaintext highlighter-rouge">pass</code>, executing <code class="language-plaintext highlighter-rouge">li a0, 0</code> and <code class="language-plaintext highlighter-rouge">unimp</code>, which triggers an exit on my processor. But I believe with my implementation that the <code class="language-plaintext highlighter-rouge">beqz</code> instruction is going to use the <code class="language-plaintext highlighter-rouge">lw</code> value instead, causing the test to fail.</p>

<p>And indeed it does fail, corresponding to Frame 2 of the following animation. We should be getting a result similar to Frame 3, where we exit with <code class="language-plaintext highlighter-rouge">0</code> in <code class="language-plaintext highlighter-rouge">a0</code> after a forward branch.</p>

<figure>
<p><img src="/assets/media/waw/test.apng" alt="A short animation using Konata showing three pipeline visualizations illustrating the WAW bug." /></p>
  <figcaption>Three frames showing the <code class="language-plaintext highlighter-rouge">li-lw</code>, <code class="language-plaintext highlighter-rouge">lw-li</code>, and <code class="language-plaintext highlighter-rouge">nop-li</code> cases respectively, which differ only in the instructions at <code class="language-plaintext highlighter-rouge">pc</code> <code class="language-plaintext highlighter-rouge">a4</code> and <code class="language-plaintext highlighter-rouge">a8</code>. We should certainly not get the same results in Frame 2 as in Frame 1.</figcaption>
</figure>

<p>This animation uses three different pairs of instructions to set the value of <code class="language-plaintext highlighter-rouge">a0</code>. Going frame by frame, this is what’s happening:</p>
<ul>
  <li>Frame 1 has <code class="language-plaintext highlighter-rouge">li-lw</code>. It’s true that the <code class="language-plaintext highlighter-rouge">li</code> commits earlier than it should, but it doesn’t mess with the correctness of the program’s output because the value from <code class="language-plaintext highlighter-rouge">li</code> is never used before being replaced by the value from <code class="language-plaintext highlighter-rouge">lw</code>. In this case, the bug is benign.</li>
  <li>Frame 2 has <code class="language-plaintext highlighter-rouge">lw-li</code>, corresponding exactly to our assembly test case above. Our processor gives the wrong result, since it <em>should</em> branch but doesn’t.</li>
  <li>Frame 3 has <code class="language-plaintext highlighter-rouge">nop-li</code>. We branch because we load <code class="language-plaintext highlighter-rouge">a0</code> with the <code class="language-plaintext highlighter-rouge">li</code> instruction only, without an <code class="language-plaintext highlighter-rouge">lw</code> instruction.</li>
</ul>

<p>When I say the result is correct or not, I’m only looking at the value we get at the end of the test. In reality, the state of our processor for all three frames can be incorrect if you stopped it at particular cycles because the register file and memory are updated as soon as the <code class="language-plaintext highlighter-rouge">W</code> stage concludes for a given instruction. In more sophisticated processors, they might require changes to the processor state to occur in the correct order through something like a <a href="https://en.wikipedia.org/wiki/Re-order_buffer">reorder buffer</a>. We don’t currently worry about interrupts or traps, so the out-of-order commits can be benign.</p>

<p>Looking at the visualizations, it’s no mystery why both orderings of <code class="language-plaintext highlighter-rouge">li-lw</code> and <code class="language-plaintext highlighter-rouge">lw-li</code> behave like <code class="language-plaintext highlighter-rouge">li-lw</code> with our processor. In <code class="language-plaintext highlighter-rouge">lw-li</code>, while <code class="language-plaintext highlighter-rouge">lw</code> is waiting on the previous <code class="language-plaintext highlighter-rouge">sw</code> to complete, our processor (mistakenly) allows the <code class="language-plaintext highlighter-rouge">li</code> to slide before the <code class="language-plaintext highlighter-rouge">lw</code>. The value that ends up in <code class="language-plaintext highlighter-rouge">a0</code> is therefore the one entered by the <code class="language-plaintext highlighter-rouge">lw</code> instruction, even if <code class="language-plaintext highlighter-rouge">lw</code> was supposed to happen before <code class="language-plaintext highlighter-rouge">li</code>.</p>

<h2 id="another-hint">Another Hint</h2>

<p>Another hint that my processor was bugged is that the Konata pipeline visualizer doesn’t really support instructions committing in the wrong order, as if it expects in-order commits. It looks fine when I make my visualizations without hiding flushed operations, but the bug appears when I do hide flushed operations.</p>

<figure>
<p><img src="/assets/media/waw/flushed.apng" alt="Two frame animation in Konata" /></p>
  <figcaption>Two frames illustrating the same <code class="language-plaintext highlighter-rouge">nop-li</code> case with flushed operations shown and hidden.</figcaption>
</figure>

<p>Hiding flushed operations makes Konata sort the instructions by when they commit. Because my processor has them (incorrectly) commit out of order, Konata “reorders” my instructions for me in the visualization.</p>

<p>Not only that, but Konata also removes the final <code class="language-plaintext highlighter-rouge">li</code> and <code class="language-plaintext highlighter-rouge">unimp</code> instructions. I think that must be because the processor terminates simulation in <code class="language-plaintext highlighter-rouge">Writeback</code> without the final Konata commits, which maybe were supposed to happen in a later cycle.</p>

<p>This also suggests that the way Konata measures whether an instruction is flushed might be by seeing whether it was <em>committed</em> to Konata, rather than seeing if it had been <em>flushed</em> per se. These are subtly different according to the <a href="https://github.com/shioyadan/Konata/blob/master/docs/kanata-log-format.md">Konata log format</a>. The distinction only really matters in corner cases like this, since most of the time all instructions are either flushed or committed. These last two instructions just happened to have neither.</p>

<h2 id="how-do-we-fix-it">How do we fix it?</h2>
<p>We have two main options for the fix.</p>
<ul>
  <li>We can add in register renaming with a reorder buffer (like prescribed in Hennessy and Patterson) so we can execute instructions that use the same destination register without dependencies between the inputs. This is the harder fix but should make our processor more capable of bona fide out-of-order processing.
    <ul>
      <li>For the reorder buffer, we could either use Bluespec’s <code class="language-plaintext highlighter-rouge">CompletionBuffer</code> package or write one ourselves.</li>
      <li>I suspect the register renaming part would be tougher to implement, but maybe not so if I can find a decent strategy elsewhere. For all I know, there might already be an idiomatic implementation in Bluespec.</li>
    </ul>
  </li>
  <li>We can introduce stalls so that instructions are stalled from issuing if their destination register is being used as an earlier instruction’s destination. It might slow down our processor but it should be a simpler fix.
    <ul>
      <li>We would still be able to allow some out-of-order commits as long as they don’t affect the same destination registers.</li>
      <li>A more conservative fix would be to fully disable out-of-order issuing altogether, if you want to call that a fix.</li>
    </ul>
  </li>
</ul>

<p>Neither of these I can do in time for this week’s post, so I’ll defer it to some other time.</p>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[A technical blog post describing the write-after-write bug I found in my processor project.]]></summary></entry><entry><title type="html">Prose that I Like</title><link href="https://www.martinchan.org/blog/prose/" rel="alternate" type="text/html" title="Prose that I Like" /><published>2023-11-05T00:00:00-04:00</published><updated>2023-11-05T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/prose</id><content type="html" xml:base="https://www.martinchan.org/blog/prose/"><![CDATA[<figure>
<p><img src="/assets/media/prose/bojack.png" alt="Screenshot of Bojack Horseman" /></p>
  <figcaption>Diane in Bojack Horseman S02E09 12:44: You can do anything you want in life! I mean, not everyone can write for the New Yorker, but there’s always the Atlantic!</figcaption>
</figure>

<p>I’ve been a long time reader and admirer of the <em>New Yorker</em>. It’s a rather fashionable magazine that publishes some of the best written essays I’ve read. As someone with some interest in writing, I look upon the <em>New Yorker</em>’s best pieces as models for the craft.</p>

<p>I set up this blog in part to practice writing so I can get <a href="https://www.goodreads.com/quotes/309485-nobody-tells-this-to-people-who-are-beginners-i-wish">closer to the quality I love to read</a>. I’ve been trying to post weekly with some consistency because I think a steady trickle of practice will get me farther than irregular bursts. It also helps keep time while I’m not employed, though I don’t intend on stopping when I am.</p>

<p>I’ve generally written these posts by the week, but I plan on accumulating a backlog of content to release over time. I’d like to be able to share something modest each week and occasionally release large, well-prepared posts like my <a href="/projects/processor/">processor write-up</a>. I don’t think I’ve got enough energy at this point in my life to write something of that caliber on a weekly basis, nor do I have enough content to justify it.</p>

<p>This week, I’d like to share a short list of pieces I admire. They have a neatness that I try to emulate in both prose and code. Each of these pieces were written deep into a celebrated writer’s career. While I’m trying to make my career in engineering, I can still admire their art.</p>

<ul>
  <li>
    <p>John McPhee has a column called “<a href="https://www.newyorker.com/magazine/the-writing-life">The Writing Life</a>” in the <em>New Yorker</em>. My favorite piece is “<a href="https://www.newyorker.com/magazine/2015/09/14/omission">Omission</a>,” on leaving some details unwritten.</p>
  </li>
  <li>
    <p>Ann Patchett wrote an essay collection called <em>These Precious Days</em>. I read the <a href="https://harpers.org/archive/2021/01/these-precious-days-ann-patchett-psilocybin-tom-hanks-sooki-raphael/">eponymous essay</a> after first encountering her <em>New Yorker</em> piece “<a href="https://www.newyorker.com/magazine/2021/03/08/how-to-practice">How to Practice</a>,” on living. Both are solid reads.</p>
  </li>
  <li>
    <p>Annie Dillard wrote many things, but I’ve only read her 1989 memoir <em>The Writing Life</em>. It speaks to living and writing. I think it’s worth reading in its entirety. There are no copies I can link to, but those of you who know me can ask me for mine. My phone wallpaper is a screenshot of an excerpt with one line highlighted: “<em>How we spend our days is, of course, how we spend our lives.</em>”</p>
  </li>
</ul>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[A short post where I give a small list of writers and pieces of prose that I like.]]></summary></entry><entry><title type="html">Supermarket Fried Chicken</title><link href="https://www.martinchan.org/blog/fried-chicken/" rel="alternate" type="text/html" title="Supermarket Fried Chicken" /><published>2023-10-29T00:00:00-04:00</published><updated>2023-10-29T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/fried-chicken</id><content type="html" xml:base="https://www.martinchan.org/blog/fried-chicken/"><![CDATA[<!-- 1. _
{:toc}
{::options toc_levels="1..3"/} -->

<figure>
<p><img src="/assets/media/fried-chicken/box.jpg" alt="I'm holding an 8-piece box of dark fried chicken from ShopRite while walking in a parking lot. It's $9 for 8 pieces." /></p>
  <figcaption>A picture I took recently from a walk back from ShopRite after picking up a set of dark fried chicken (thighs and drums). I often start eating on the way back, since the chicken doesn’t get any fresher.</figcaption>
</figure>

<p>I’ve long been a proponent of getting fried chicken at supermarkets. I can’t speak for more local establishments, but of America’s big three fried chicken chains (Chick-fil-A, Popeyes, KFC), I think only Chick-fil-A is worth the money. More often than not, grocery stores that offer fried chicken do it better and at far better prices than the other mainstream chains.</p>

<p>One of the tipping points that turned me off from mainstream fast food fried chicken is a bad experience I had at the KFC in Allston when I was at MIT. In my childhood, KFC used to be synonymous with fried chicken, and indeed it used to be at the top. But earlier this month, it fell to <a href="https://www.cnbc.com/2023/10/04/popeyes-overtakes-kfc-as-no-2-chicken-chain-Chick-fil-A-stays-on-top.html">#3 behind Popeyes in October 2023</a>, and further behind Chick-fil-A at #1. KFC does well internationally, just not so great domestically anymore.</p>

<p>My single visit to the location in Allston left no questions as to why. The brand has fallen off, and it has fallen off hard.</p>

<figure>
<p><img src="/assets/media/fried-chicken/kfc.jpg" alt="The KFC two piece dark meat meal, with a thigh, drumstick, biscuit, and mashed potatoes." /></p>
  <figcaption>Somehow KFC manages to be the most expensive and the worst in quality and portion size. Other locations might be doing better, but my visit to the Allston location of KFC only confirmed my earlier impressions.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/fried-chicken/kfc_focus.jpg" alt="A highlight of the KFC drumstick." /></p>
  <figcaption>The drumstick was smaller than some drumettes you see from chicken wings. Frankly, I don’t think Colonel Sanders would be proud of this product.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/fried-chicken/allston.png" alt="Google map screenshot showing the route from MIT's campus to the KFC in Allston." /></p>
  <figcaption>To get to the KFC in Allston, me and my dear friend Syd had to bike four miles away from the MIT campus.</figcaption>
</figure>

<p>I’ve had Popeyes a few times, but I don’t really like the way it tastes. I think it’s something in their frying oil. Even if Popeyes (or other chains like Jollibee) are actually better than supermarket fried chicken, it’s by such a thin margin that it’s not worth the increased expense. The only exception is Chick-fil-A, which truly does do it better.</p>

<p>Of the big three, I think there’s no mystery as to why Chick-fil-A has gotten so successful compared to the other chains. I’m no business analyst, so I can’t say anything about their methods in logistics or business strategy, but I do know that the product they offer is miles better than its competitors. I have yet to have a fried chicken sandwich anywhere else that holds up against Chick-fil-A’s. It’s just so good.</p>

<figure>
<p><img src="/assets/media/fried-chicken/boylston.jpg" alt="Google map screenshot showing the route from MIT's campus to the Chick-fil-A on Bolyston" /></p>
  <figcaption>There’s only one Chick-fil-A within 10 miles of MIT’s campus, and I would often bike there and back for their spicy chicken sandwiches. A previous Boston Mayor Thomas Menino <a href="https://patch.com/massachusetts/boston/boston-mayor-meninos-letter-chik-fil-makes-rounds-again">had released a letter in 2012 pushing against</a> the company’s attempt to open a Boston location. This location <a href="https://www.boston.com/food/restaurants/2021/12/13/chick-fil-a-opening-first-boston-restaurant-winter/">only opened in January 2022</a>, and I went there frequently from Spring 2022 and through the 2022-2023 school year.</figcaption>
</figure>

<p>Chick-fil-A has had some baggage in the past with regard to charitable contributions to anti-LGBT groups, and part of that is rooted in their uncommonly religious background compared to other large American chains. It’s not enough baggage to shake it from its #1 position on customer satisfaction for the 9th year running on the <a href="https://www.nrn.com/consumer-trends/chick-fil-ranked-no-1-customer-satisfaction-9th-straight-year">American Customer Satisfaction Index</a>, nor is it unadulterated enough that conservative Christian groups won’t <a href="https://www.christianpost.com/voices/Chick-fil-A-is-way-worse-than-we-thought.html">take the opportunity to accuse it of pandering to liberal America</a>.</p>

<p>I’ve heard rumors that the company has reformed its charitable contributions in recent years to become more palatable to liberal consumers. It’s a natural approach when a company is trying to expand outside of the American South, especially into the massive liberal urban coastal markets. The fact that Chick-fil-A has gotten so far despite its past baggage is because it has no real competition on quality of product.</p>

<p>In terms of whether it’s moral as a consumer to support Chick-fil-A as it is today, I wouldn’t sweat it any more than supporting any other big company like Nike or Nestle. You know how they are.</p>

<p>But enough about fast food restaurants. For a fraction of the price, you can get fried chicken at or above the caliber of non-Chick-fil-A chains at your local supermarket, at least in my experience living in Cambridge and Philadelphia. I often like to get a batch of fried chicken once every few visits to the supermarket.</p>

<p>As a policy, I only ever get dark meat (thighs and drums). A chicken breast cannot survive hours under a supermarket heat lamp the same way a chicken thigh can, and it will not measure up as favorably against its restaurant counterpart.</p>

<p>While I’ve been here in Philadelphia, I’ve been getting fried chicken from my local <a href="https://en.wikipedia.org/wiki/ShopRite_(United_States)">ShopRite</a>. I haven’t sampled supermarket fried chicken from other stores, mainly because I’ve been staying put in my neighborhood. But it’s as good here as it’s been anywhere else, with delicious golden skin and acceptable meat.</p>

<figure>
<p><img src="/assets/media/fried-chicken/2019.jpg" alt="Close-up of a fried chicken box with a sell-by in December 2019, $2 for 4 pieces." /></p>
  <figcaption>4-piece dark (much rarer) sold hot in December 2019 at my local ShopRite when I visited for winter break freshman year. I imagine it must’ve been a mislabeling for the chicken to be sold so cheaply, since the 8-piece was usually $6.49 (now it’s $9, not a big increase). The quality hasn’t changed much in the years since. It’s just as tasty as ever. If you want a good picture of the chicken today, scroll all the way up.</figcaption>
</figure>

<p>When I was at MIT, I would get supermarket fried chicken at several nearby supermarkets (there, <a href="https://en.wikipedia.org/wiki/Shaw%27s_and_Star_Market">Shaw’s and Star Market</a>). I don’t know if the ingredients vary by location, but I know that the manner of preparation certainly does. Each location had its own style. I never kept track of how it varied.</p>

<figure>
<p><img src="/assets/media/fried-chicken/bag.jpg" alt="A close up of a fried chicken thigh in a paper bag." /></p>
  <figcaption>Here’s a recent picture of a fried chicken thigh my dear friend Syd got from a Star Market in Somerville. We often used to get groceries and supermarket fried chicken together. Admittedly, it doesn’t look very pretty. It kind of looks like an amateur fry job that you might see at a high school bake sale with <a href="https://en.wikipedia.org/wiki/Deep-fried_Oreo">fried Oreos</a>.</figcaption>
</figure>

<p>Over the years, I’ve tried frying my own chicken several times according to a <a href="https://www.seriouseats.com/homemade-Chick-fil-A-sandwiches-recipe">copycat Chick-fil-A recipe on Serious Eats</a> by (MIT alum) J. Kenji López-Alt. It’s an excellent recipe, but fried chicken really is one of those foods that both benefits from scale and can’t be feasibly meal prepped. I think most people are better off buying than making.</p>

<p>Whether you’re frying a few servings or many servings, there’s a comparable amount of preparation and cleanup because of the large quantity of oil. Meanwhile, you can’t exactly prepare ten servings of fried chicken the same way you might prepare a big pot of chili and expect the quality to remain steady over a week.</p>

<figure>
<p><img src="/assets/media/fried-chicken/homemade.jpg" alt="Pieces of fried chicken thighs in a fryer basket, draining from oil." /></p>
  <figcaption>I fried this batch of chicken according to the Serious Eats recipe when I was still living at my dorm East Campus at MIT. I used a communal deep fryer that 5e got in either 2019 or 2020, but I think this occasion was the only time the fryer had ever been used.</figcaption>
</figure>

<p>Fried chicken connoisseurs might claim that supermarket fried chicken is too mushy or too soft on account of the way it’s made and stored. It can certainly be true, but I would say the quality can vary dramatically by location and time of day. I’ve only ever had good experiences with my local ShopRite in Philadelphia, but I have had bad experiences with some supermarkets in New England.</p>

<figure>
<p><img src="/assets/media/fried-chicken/connecticut.jpg" alt="The mushiest, worst fried chicken you've ever seen." /></p>
  <figcaption>Here’s some so-called fried chicken at a no-name supermarket in Connecticut. I bought it at a rest stop on the way back to Philadelphia at the start of winter break from when I was living on <a href="https://en.wikipedia.org/wiki/Cape_Cod">Cape Cod</a> for the pandemic academic year 2020-2021. It was tolerable but not particularly good, kind of like most food that was sold on the Cape in the off-season. It was the worst supermarket chicken I’ve ever had. Still better than my visit to KFC.</figcaption>
</figure>

<p>The quantities in which the chicken is sold can also vary. In Cambridge, Star Markets tend to sell fried chicken by the piece, but my experience in Philadelphia has been that they’re sold in sets of four to eight pieces, most commonly eight. Here, they’re offered in both 8-piece regular (2 each of breasts, wings, thighs, and drums) and 8-piece dark (4 each of thighs and drums). As mentioned, I always get the dark.</p>

<p>On an empty stomach, I’m unable to eat more than three or (pushing it) four dark pieces in a single sitting. The leftovers, I tend to either eat cold straight from the fridge or reheated in an air fryer. Surprisingly, air frying leftover supermarket fried chicken can get pretty close to quality while fresh.</p>

<h2 id="gallery">Gallery</h2>

<figure>
<p><img src="/assets/media/fried-chicken/jan_2021.jpg" alt="Another picture of fried chicken, same as the other Philadelphia ShopRite ones. Pretty good!" /></p>
  <figcaption>Some ShopRite fried chicken I bought during January 2021 when I was visiting Philadelphia from Cape Cod for winter break my sophomore year.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/fried-chicken/2021.jpg" alt="Close-up of a fried chicken box with a sell-by in June 2021, $6.49 for 8 pieces." /></p>
  <figcaption>A picture I took of a 8-piece dark set sold cold in June 2021 at my local ShopRite when I was living in Philadelphia for the second pandemic summer. I assume that hot fried chicken from the previous day (or days) is relabeled and sold cold in the following days. It can be enjoyed as-is or reheated. I typically only buy fried chicken hot.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/fried-chicken/quarters.jpg" alt="Roasted chicken quarters in plastic boxes on a heated foodservice shelf." /></p>
  <figcaption>Sometimes, instead of getting fried chicken, I like to get the roasted leg quarters. It’s like rotisserie chicken but just the yummy dark meat, and it’s even cheaper. It’s too much to eat in one sitting so I like to eat the skin first and then keep the meat for later meals.</figcaption>
</figure>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[I talk about my experiences with supermarket fried chicken in Philadelphia and in Cambridge, as well as mainstream fried chicken from KFC and Chick-fil-A.]]></summary></entry><entry><title type="html">One Hundred Installations</title><link href="https://www.martinchan.org/blog/one-hundred/" rel="alternate" type="text/html" title="One Hundred Installations" /><published>2023-10-22T00:00:00-04:00</published><updated>2023-10-22T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/one-hundred</id><content type="html" xml:base="https://www.martinchan.org/blog/one-hundred/"><![CDATA[<ol id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#caveat" id="markdown-toc-caveat">Caveat</a></li>
  <li><a href="#impact" id="markdown-toc-impact">Impact</a></li>
  <li><a href="#effort-vs-impact" id="markdown-toc-effort-vs-impact">Effort vs Impact</a></li>
  <li><a href="#who-else" id="markdown-toc-who-else">Who Else?</a></li>
  <li><a href="#who-else-really" id="markdown-toc-who-else-really">Who Else, Really?</a></li>
</ol>

<h2 id="introduction">Introduction</h2>
<p>Earlier this week, my <a href="/projects/vscode-bsv/">Bluespec extension for VS Code</a> hit <strong><a href="https://marketplace.visualstudio.com/items?itemName=MartinChan.bluespec">100 installations</a></strong>, slightly under a month from when I released it. The number of actual users must be lower, but I’ve heard from multiple contacts in MIT’s 6.191 (both students and TAs) that people are finding it useful.</p>

<p><img src="/assets/media/one-hundred/marketplace.png" alt="VS Code marketplace header for my extension, showing 114 installs and 3 reviews" /></p>

<p>Thanks to my extension, there are now a bunch of folks who have industry-standard syntax highlighting for their Bluespec homework assignments and projects instead of plain black-and-white.</p>

<p>I remember going into it thinking, “well, if the only person this helps is me, that’s enough.” I’m very proud that it’s ended up helping others.</p>

<h2 id="caveat">Caveat</h2>

<p>While reaching 100 installations is something I’m pleased with, it’s a coarse measure of impact that would’ve happened even if my extension was complete garbage.<sup id="fnref:which" role="doc-noteref"><a href="#fn:which" class="footnote" rel="footnote">1</a></sup></p>

<p>Installations are something that users can’t take back. If someone installs an extension, sees that it sucks, and immediately uninstalls it, there’s no mechanism that accounts for it in the popularity count. It still counts as an installation.</p>

<p>The number also reliably goes up because of how <em>empty</em> the field is. There are only three results that show up from <a href="https://marketplace.visualstudio.com/search?term=bluespec&amp;target=VSCode">searching Bluespec on the Marketplace</a>, and installing VS Code extensions is so easy that people can just install all three and stick with the one that’s best.</p>

<p>I think my extension’s quality (both in <a href="https://www.martinchan.org/projects/vscode-bsv/">implementation</a> and <a href="https://marketplace.visualstudio.com/items?itemName=MartinChan.bluespec">documentation</a>) has a positive effect for retention, but I have no way of measuring it. Practically nobody leaves ratings or reviews. The most popular extensions on the Marketplace with their 50M+ installations only have hundreds of ratings. Mine has three ratings.<sup id="fnref:already_more" role="doc-noteref"><a href="#fn:already_more" class="footnote" rel="footnote">2</a></sup></p>

<h2 id="impact">Impact</h2>

<p>In all, I know a good handful of students taking 6.191 who are using my extension for their Bluespec (or I suppose Minispec). Some of them switched from the two barebones extensions on the Marketplace, and some of them switched from plaintext black-and-white.</p>

<p>When I released the extension, I told two current 6.191 TAs that I made my extension available, just in case it’d be useful for them or their students. One of them told me that they already had students asking for a VS Code extension for Minispec in office hours.</p>

<p>More recently, I was told by a friend that one of the other TAs (whom I’d never seen or heard of) was helping their students install my extension for their VS Code setups.</p>

<p><em>That</em> was a sublime feeling, knowing that people I hadn’t met were recommending my tool to others and helping them install it.</p>

<p>It also feels nice, like in an artisanal way, to produce something beautiful (and I would consider my extension beautiful) and see it appreciated by others. It feels like what a carpenter might feel after having made a nice stool appreciated by themselves and others.</p>

<h2 id="effort-vs-impact">Effort vs Impact</h2>

<p>In all, I didn’t spend a herculean effort on the extension, and yet it’s paid off handsomely. I spent about a week of effort and now that effort’s been amortized across dozens of my peers at MIT and Bluespec users at other institutions. I genuinely think the extension has already saved enough cognitive effort cumulatively across its users to outweigh the amount of time I spent developing it, and the degree by which that’s true is only going to grow with more users.<sup id="fnref:efficiency" role="doc-noteref"><a href="#fn:efficiency" class="footnote" rel="footnote">3</a></sup></p>

<p>Is my Bluespec extension technically impressive? Not really. It’s basically a set of regex rules in a trench coat.<sup id="fnref:for_now" role="doc-noteref"><a href="#fn:for_now" class="footnote" rel="footnote">4</a></sup> But has it filled a niche that has already made dozens of people happier to write Bluespec in VS Code? Yes.</p>

<p><a href="https://thetech.com/2017/12/07/firehose-updated">Firehose</a> (now <a href="https://hydrant.mit.edu/">Hydrant</a>) was built in a weekend in 2017. Every semester, practically every undergrad at MIT uses it to plan their classes. It’s brought immeasurable comfort (even joy) to the undergrad community at MIT, and it has become quintessentially MIT as the <a href="https://en.wikipedia.org/wiki/Campus_of_the_Massachusetts_Institute_of_Technology#Maclaurin_Buildings_and_Great_Dome_(1916)">Great Dome</a>. I want to give appropriate credit to the <a href="https://mitadmissions.org/blogs/entry/you-dont-have-to-be-a-founder/">maintainers</a> from <a href="https://sipb.mit.edu/">SIPB</a> like <a href="https://cjquines.com/">CJ</a> without whom the tool would no longer exist, but I’ve always thought it was remarkable that such an impactful tool was first created in a single weekend.<sup id="fnref:start" role="doc-noteref"><a href="#fn:start" class="footnote" rel="footnote">5</a></sup></p>

<p>It especially feels weird having this extension as a project next to my months-long-and-technically-difficult (but in-hindsight-kind-of-unimpressive<sup id="fnref:employers" role="doc-noteref"><a href="#fn:employers" class="footnote" rel="footnote">6</a></sup>) project working on <a href="/projects/processor/">my processor</a>.<sup id="fnref:impact" role="doc-noteref"><a href="#fn:impact" class="footnote" rel="footnote">7</a></sup></p>

<h2 id="who-else">Who Else?</h2>

<p>These things bring to mind the apocryphal story of <a href="https://en.wikipedia.org/wiki/Egg_of_Columbus">Columbus’s egg</a>. It’s like, well anyone could’ve done it. The thing is (and not to toot my own horn), nobody did do it.<sup id="fnref:not_like" role="doc-noteref"><a href="#fn:not_like" class="footnote" rel="footnote">8</a></sup> Why not?</p>

<p>My extension still has a ways to go to be a truly great extension<sup id="fnref:ways_to_go" role="doc-noteref"><a href="#fn:ways_to_go" class="footnote" rel="footnote">9</a></sup>, but it’s a little mind-blowing that the niche it filled was so, so empty considering the amount of positive impact per unit effort. It’s by no means a massive user base, but definitely at least a thousand Bluespec users per year stand to benefit, if the <a href="https://marketplace.visualstudio.com/items?itemName=raamakrishnan.bluespec-system-verilog">other extension</a> gives us any indication.</p>

<p>Before I did it, I would’ve expected maybe course staff in 6.191 (with their instructors or army of TAs) to provide a modest syntax highlighter the same way that other introductory classes at MIT have built up their class infrastructure.<sup id="fnref:infrastructure" role="doc-noteref"><a href="#fn:infrastructure" class="footnote" rel="footnote">10</a></sup> It would’ve taken a week, maybe two, of a single TA’s hours out of their ten TAs, and it would’ve paid dividends for several semesters because such a tool can be reused.</p>

<p>But I think the course staff hasn’t, as an institution, hopped onto the VS Code hype train yet.<sup id="fnref:terminal" role="doc-noteref"><a href="#fn:terminal" class="footnote" rel="footnote">11</a></sup> Or, just as likely, they just haven’t sought to take on that sort of pedagogical burden of providing syntax highlighting tools to students, especially when there are so many other pedagogical burdens to consider.</p>

<p>Instructors have many, many, choices on where to expend their energy to improve a class, and syntax highlighters might not be on their radars the same way it was on mine, especially when placed against improving lecture materials or operational aspects like running office hours. The instructors might not have known (or recognized the need), and the TAs might not have wanted to assume that sort of extra responsibility on top of their existing duties.</p>

<p>If not 6.191 course staff, then I would’ve expected <a href="https://bluespec.com/">Bluespec Inc.</a> or the related <a href="https://github.com/B-Lang-org/Documentation">B-Lang organization</a> to develop and publish a VS Code extension for Bluespec. They seem to be doing a <a href="https://bluespec.com/news">lot  of marketing</a> for Bluespec as a tool, including <a href="https://bluespec.com/2020/09/06/bluespec-inc-to-open-source-its-proven-bsv-high-level-hdl-tools/">open-sourcing</a> their <a href="https://github.com/B-Lang-org/bsc">Bluespec compiler</a>, but none of that outreach seemed to include adding Bluespec support for the world’s most popular IDE, Visual Studio Code.</p>

<p>It’s a rather small company though, so it’s unsurprising if they’re focused more on other tooling, compiler enhancements, or toolchain integrations that make Bluespec more attractive to hardware companies. They have internal syntax highlighters for Emacs and Vim, but it could also be a generational<sup id="fnref:generational" role="doc-noteref"><a href="#fn:generational" class="footnote" rel="footnote">12</a></sup> difference where nobody there uses VS Code, or none of their main customers in the hardware design industry do.</p>

<blockquote class="twitter-tweet" data-dnt="true" data-theme="dark"><p lang="en" dir="ltr">every time I have to review software source code that was written by a hardware company I come away with the feeling that I have encountered an alien form of life</p>&mdash; badidea 🪐 (@0xabad1dea) <a href="https://twitter.com/0xabad1dea/status/1714268650402873473?ref_src=twsrc%5Etfw">October 17, 2023</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>I don’t have data, but I would also be unsurprised if VS Code’s market share was greater among software engineers than it is among hardware engineers. There’s less of a distinction in the EECS undergraduate program at MIT, because everyone who’s writing hardware is probably writing software in another class. I can imagine the industry being a different world.</p>

<p>In this way, VS Code support for Bluespec might be far more important for college students than for industry engineers, but Bluespec Inc. is exclusively focused on the latter, as per their business model.</p>

<h2 id="who-else-really">Who Else, Really?</h2>

<p>It kind of all comes down to, who is in a position where they would care enough about Bluespec syntax highlighting to make it better? It’s not like there are syntax highlighting engineers running around looking for languages to develop extensions for. Course staff are generally focused on teaching their class material, and syntax highlighting is a ways off the beaten path. Bluespec Inc. is focused on developing their EDA tools for enterprise, whatever it is that they do.</p>

<p>And hardware engineers (like what I’m aspiring to be) generally want to be working on hardware, not the sort of infrastructural tools that involve playing with JSONs and writing regular expressions. But who else but the hardware engineer would care to work on tools to make hardware engineering more comfortable? It’s weird!<sup id="fnref:ice" role="doc-noteref"><a href="#fn:ice" class="footnote" rel="footnote">13</a></sup></p>

<p>These considerations presume that syntax highlighting for Bluespec is important. It’s a difficult case to prove to instructors or seasoned Bluespec engineers who have <a href="https://en.wikipedia.org/wiki/Curse_of_knowledge">forgotten what it’s like to be new</a>. I just know subjectively that, as a student who was learning unfamiliar concepts in an unfamiliar language, the familiarity of high-quality syntax highlighting would’ve made the learning curve easier and freed up some cognitive resources to focus on the technically difficult parts of writing Bluespec.<sup id="fnref:verbose" role="doc-noteref"><a href="#fn:verbose" class="footnote" rel="footnote">14</a></sup></p>

<p>In no place is syntax highlighting more important than for <em>introducing</em> people to Bluespec, whether it is MIT undergraduates in 6.191 or engineers considering picking up Bluespec. These people have not yet developed the Bluespec visual recognition skills that come with experience. Of course, syntax highlighting can remain useful after you do develop those skills.</p>

<p>I could’ve also been caught up thinking about how it should’ve been someone else making the tool, rather than me, a joe shmoe looking for a job after graduation. But I don’t know. It would’ve been nice if someone else made it, but it ended up being me. Someone had to do it. No great harm came to me from doing something useful that I wasn’t obligated to do.</p>

<p>Intellectual work, its proper distribution and compensation, and humanity’s open-source project are hard issues to think about in this digital age. Maybe I’ll think more about it another time. I don’t think the answer is necessarily to kick the can around waiting for someone else to pick it up, but I also don’t think the answer is for a self-selected few to pick up cans while the world looks on. But what can you do? Not everything fits within a transactional framework.</p>

<p>This syntax highlighting is a <em>very</em> small-stakes case for a <em>very</em> small audience, but it reminds me of the many actually important projects that really are <a href="https://xkcd.com/2347/">thanklessly made and maintained</a> by volunteer labor. The highest profile example that comes to mind, while not necessarily code, is the content on <a href="https://www.wikipedia.org/">Wikipedia</a>. There are countless examples in open-source software, but it’s not a world that I’m very steeped in on the developer side.</p>

<p>As a <em>user</em>, like everyone else, I’m constantly benefiting from software (and more generally, intellectual work) made by the volunteer labor of others. This website itself was generated using <a href="https://jekyllrb.com/">Jekyll</a>, an open-source static site generator with really <a href="https://opencollective.com/jekyll">not all that many contributions</a> compared to the value it provides.</p>

<p>I know in the age of hyperscale and impact (especially in software), a hundred installations is small potatoes compared to tools that millions of people use. But it’s my first time having such tangible (if modest) positive impact through something digital I created. It’s a profound feeling to have.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:which" role="doc-endnote">
      <p>Case-in-point, <a href="https://marketplace.visualstudio.com/items?itemName=raamakrishnan.bluespec-system-verilog">the only other true Bluespec extension</a> has nearly 10k installations since its release five years ago. Almost all its functionality is from out-of-the-box VS Code language support (C-like <code class="language-plaintext highlighter-rouge">//</code> comment recognition, bracket matching) and very basic keyword recognition. My extension had better syntax highlighting within half an hour of development. However, the other extension’s redeeming feature (or context) is that there <em>truly wasn’t</em> anything better on the Marketplace, and this is a case where something is better than nothing. Kudos to them for taking action, however small. <a href="#fnref:which" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:already_more" role="doc-endnote">
      <p>Three ratings is already more than the other true Bluespec extension, which has two ratings across its nearly 10k installations. And if <em>you</em>, dear reader, enjoy the extension, please leave me a rating too. <a href="#fnref:already_more" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:efficiency" role="doc-endnote">
      <p>Of course, not that I necessarily value <em>my</em> time the same as I do anybody else’s time, but in terms of Improving Efficiency™️ for humanity (taken at face value without all the moral baggage that attends whether brute efficiency is good), it feels like I’ve helped people! <a href="#fnref:efficiency" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:for_now" role="doc-endnote">
      <p>For now. We need to start somewhere. It <em>is</em> a nice base from which to add better quality-of-life features. <a href="#fnref:for_now" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:start" role="doc-endnote">
      <p>I’ve long been captivated by the idea of beginnings. For example, why did <a href="/blog/ec-build-2023/">EC Build</a> start in 2004, why did no code editor command a supermajority of developers until Visual Studio Code, why did nobody make Firehose until Firehose was made? Every beginning must come from some context, but that context isn’t obvious for everything. It’s no surprise that rideshares or bikeshares didn’t take off until the proliferation of smartphones. The beginnings that are most surprising are the ones where there doesn’t seem to be an obvious accompanying shift that made them possible. Although, that might just be me not inspecting closely enough. <a href="#fnref:start" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:employers" role="doc-endnote">
      <p>Don’t tell my prospective employers that I called it unimpressive. It represents a lot of focus, effort, and difficulty, even if it doesn’t look all that fancy. It’s just subjectively anticlimactic because of what it is. An aspiring but solitary architect would make a dud of a cathedral by themselves. <a href="#fnref:employers" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:impact" role="doc-endnote">
      <p>You could also say that whereas the Bluespec extension was built for impact, the processor was built as a learning exercise, which is true. With three months of solitary effort, I didn’t expect to go toe-to-toe with processors built by teams of hundreds of engineers. <a href="#fnref:impact" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:not_like" role="doc-endnote">
      <p>Though, it’s not like anyone but me is saying “anyone could’ve done it.” I just happened to do it and I’m thinking, anyone really could’ve done it, if only they just did it. <a href="#fnref:not_like" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:ways_to_go" role="doc-endnote">
      <p>I still need to figure out (if I’m still working on the extension) how to have truly useful snippets, and other subtle, quality-of-life changes that make writing Bluespec more comfortable. For that, I’ll need either user data (good luck getting that) or personal experience, which my slow but steady technical blog post series using Bluespec provides. <a href="#fnref:ways_to_go" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:infrastructure" role="doc-endnote">
      <p>Some MIT classes have developed impressive pedagogical tools, like 6.101 and Adam Hartz’s <a href="https://catsoop.org/docs/about">CAT-SOOP</a>, 6.102 and its <a href="https://web.mit.edu/6.102/www/sp23/tools/getting-started/#praxis-tutor">Praxis Tutor</a>, or (the very same) 6.191 and Daniel Sanchez’s <a href="https://github.com/minispec-hdl/minispec">Minispec HDL</a>. It’s hard to pick one example, but when I took Software Performance Engineering: 6.106 in Fall 2021, I considered it to be a master class in pedagogy (or at least classroom infrastructure).</p>

      <p>Outside of MIT, I know there are some who use Bluespec at <a href="https://timesofindia.indiatimes.com/city/chennai/iit-m-creates-shakti-indias-1st-microprocessor/articleshow/66454041.cms">IIT Madras</a>, <a href="https://www.cl.cam.ac.uk/research/security/ctsrd/beri/">University of Cambridge</a>, <a href="https://ics.uci.edu/~swjun/">UC Irvine</a>, and so forth. I don’t know their standards for classroom (or laboratory) development infrastructure, so I don’t hold it against them. <a href="#fnref:infrastructure" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:terminal" role="doc-endnote">
      <p>When I took 6.191 and when many of my friends took it, we were only taught to use the terminal. I still shudder thinking about it. In this year 2023? <a href="#fnref:terminal" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:generational" role="doc-endnote">
      <p>VS Code was only released in 2016, and even though it has taken over the world for several years running, it’s probably not super obvious to people who don’t actively interact with young developers writing code. I had a millennial professor who lectures hundreds of students every year and hadn’t even heard of VS Code when I mentioned it Fall 2022. <a href="#fnref:generational" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:ice" role="doc-endnote">
      <p>When I was writing the syntax highlighter, I was reminded of the ice vendor that was <a href="https://www.youtube.com/watch?v=ET8mqVGDQ1s">featured recently</a> on <em>Eater</em>’s YouTube channel. The guy was a bartender who wanted access to high-quality ice, but there was nobody in New York who could provide it. He started an ice manufacturing business that now produces clear ice for both his bar and other bars across the city. <a href="#fnref:ice" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:verbose" role="doc-endnote">
      <p>Though, the verbosity of Bluespec and most programming languages is nothing compared to the open-close tag system of HTML. The usefulness still holds. <a href="#fnref:verbose" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[I reflect on my Bluespec extension for VS Code, which has now been released for a month and has gotten over a hundred installations.]]></summary></entry><entry><title type="html">Experimenting with the Synth Tool</title><link href="https://www.martinchan.org/blog/tweaking-synth/" rel="alternate" type="text/html" title="Experimenting with the Synth Tool" /><published>2023-10-15T00:00:00-04:00</published><updated>2023-10-15T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/tweaking-synth</id><content type="html" xml:base="https://www.martinchan.org/blog/tweaking-synth/"><![CDATA[<ol id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#overview" id="markdown-toc-overview">Overview</a></li>
  <li><a href="#background" id="markdown-toc-background">Background</a></li>
  <li><a href="#synth-tweaks" id="markdown-toc-synth-tweaks">Synth Tweaks</a>    <ol>
      <li><a href="#accepting-verilog-inputs" id="markdown-toc-accepting-verilog-inputs">Accepting Verilog Inputs</a></li>
      <li><a href="#buffer-configurations" id="markdown-toc-buffer-configurations">Buffer Configurations</a></li>
      <li><a href="#svg-tweaks" id="markdown-toc-svg-tweaks">SVG Tweaks</a></li>
    </ol>
  </li>
  <li><a href="#verilog-full-adder" id="markdown-toc-verilog-full-adder">Verilog Full Adder</a>    <ol>
      <li><a href="#basic-cell-library" id="markdown-toc-basic-cell-library">Basic Cell Library</a></li>
      <li><a href="#extended-cell-library" id="markdown-toc-extended-cell-library">Extended Cell Library</a></li>
      <li><a href="#multisize-cell-library" id="markdown-toc-multisize-cell-library">Multisize Cell Library</a></li>
      <li><a href="#full-cell-library" id="markdown-toc-full-cell-library">Full Cell Library</a></li>
    </ol>
  </li>
  <li><a href="#bluespec-full-adder" id="markdown-toc-bluespec-full-adder">Bluespec Full Adder</a>    <ol>
      <li><a href="#bitwise-implementation" id="markdown-toc-bitwise-implementation">Bitwise Implementation</a></li>
      <li><a href="#boolean-implementation" id="markdown-toc-boolean-implementation">Boolean Implementation</a></li>
      <li><a href="#resulting-verilog-files" id="markdown-toc-resulting-verilog-files">Resulting Verilog Files</a></li>
      <li><a href="#wrappers-around-bluespec" id="markdown-toc-wrappers-around-bluespec">Wrappers around Bluespec</a></li>
    </ol>
  </li>
  <li><a href="#verilog-wrapped-with-bluespec" id="markdown-toc-verilog-wrapped-with-bluespec">Verilog Wrapped with Bluespec</a></li>
  <li><a href="#next-time" id="markdown-toc-next-time">Next Time</a></li>
</ol>

<h2 id="introduction">Introduction</h2>

<p>In a series of upcoming posts, I will be presenting worked Bluespec and Verilog examples of different adders for eventual use in my <a href="/projects/processor/">RISC-V processor project</a>. I’ll be using these adders to replace both existing adders and as components in future functional units like my <a href="/blog/basic-multiplier/">integer multiplier</a> or floating point unit.</p>

<p>Before all that, I need to perform some tests and set up some infrastructure. It’s no good to blindly implement components, so I spend this post experimenting with <code class="language-plaintext highlighter-rouge">synth</code> to identify quirks and see how it interacts with Bluespec and Verilog when we involve wrappers, which are required to import Verilog into Bluespec.</p>

<p>I also tweak <code class="language-plaintext highlighter-rouge">synth</code> to accept Verilog directly, which will be helpful to evaluate Verilog components in the same way I evaluate Bluespec components. Some upcoming posts will see whether we actually get any performance gains from implementing modules in Verilog rather than Bluespec.</p>

<p>This post also serves as a visual walkthrough of using the Minispec <code class="language-plaintext highlighter-rouge">synth</code> tool. There’s sparse documentation anywhere on its use, so I figured I may as well write some here.</p>

<h2 id="overview">Overview</h2>

<p>I begin by discussing <a href="#synth-tweaks">some tweaks</a> I made to <a href="https://github.com/mchanphilly/minispec">my fork</a> of Daniel Sanchez’s <a href="https://github.com/minispec-hdl/minispec/tree/main/synth"><code class="language-plaintext highlighter-rouge">synth</code></a> tool for Minispec. These tweaks enable the rest of this post.</p>

<p>Then, I demonstrate the use of <code class="language-plaintext highlighter-rouge">synth</code> on a <a href="#verilog-full-adder">Verilog implementation of a full adder</a>. I also show an example using <a href="#boolean-quirks">boolean and bitwise operators</a> where quirks in our downstream synthesis tools can create suboptimal circuits, so we should take synthesis results with a grain of salt. Because a full adder creates only a simple circuit, I also include <a href="https://en.wikipedia.org/wiki/Logic_gate">gate-level logic circuit</a> visualizations created using <code class="language-plaintext highlighter-rouge">synth</code> with several cell libraries.</p>

<p>I also demonstrate the use of <code class="language-plaintext highlighter-rouge">synth</code> on <a href="#bluespec-full-adder">Bluespec implementations of full adders</a>, including showing the <a href="#resulting-verilog-files">resulting Verilog files from compilation</a> and some strange properties that emerge <a href="#wrappers-around-bluespec">when we nest Bluespec wrappers</a>, including losing and gaining efficiency in the resulting circuits.</p>

<p>Afterward, I demonstrate Bluespec’s ability to <a href="#verilog-wrapped-with-bluespec">directly use Verilog implementations in Bluespec designs</a>, which will be helpful if we find Verilog implementations to be more efficient than our Bluespec ones. However, I found no performance difference with simple circuits like full adders, so that would require more testing with more complex circuits to see whether implementing in Verilog is worth the trouble. We’ll explore these things and more <a href="#next-time">next time</a>.</p>

<h2 id="background">Background</h2>

<p>For the past couple weeks, I’ve slowed down on technical blogging because I’ve been practicing my Verilog with the wonderful exercises on <a href="https://hdlbits.01xz.net/wiki/Main_Page">HDLBits</a>. I’m starting to exhaust their Verilog material, so it’s about time to apply what I’ve learned. With all this practice, I’m now able to do two things:</p>

<ul>
  <li>I can now inspect and understand the <code class="language-plaintext highlighter-rouge">.v</code> that result from compiling my <code class="language-plaintext highlighter-rouge">.bsv</code> files. Simple Bluespec modules can give us legible Verilog. With complex modules, it takes more effort but can be done, especially when side-by-side with the Bluespec source code.</li>
  <li>I can now write <code class="language-plaintext highlighter-rouge">.v</code> files directly and import them as <a href="https://en.wikipedia.org/wiki/Semiconductor_intellectual_property_core">IP blocks</a> into my Bluespec designs through the <code class="language-plaintext highlighter-rouge">import "BVI"</code> feature. This works best for simple modules that are done more efficiently in Verilog.</li>
</ul>

<p>I like Bluespec for its high-level constructs and abstractions. One common criticism of the language is that the Verilog outputted by the <a href="https://github.com/B-Lang-org/bsc">Bluespec compiler</a> might not be performant enough to supplant writing Verilog by hand. The trade-off is acceptable for complex top-level modules that can’t be prototyped quickly in Verilog, but in small, reusable components like adders and FIFOs, it can make sense to go lower in abstraction.</p>

<p>(In this post, I found no evidence with the simple full adder example that Bluespec produces any less performant circuitry than Verilog. It’s too soon to draw conclusions on this front, since we’d need more complex modules.)</p>

<p>This is especially the case when the optimizing compiler isn’t mature enough. There was probably a point in history when C compilers didn’t produce performant enough assembly for developers to program exclusively in C. Bluespec may very well be at that point right now with producing performant Verilog. In an ideal world, the Bluespec compiler should be able to <a href="https://en.wikipedia.org/wiki/Electronic_design_automation">automatically</a> make the same optimizations a human designer would.</p>

<p>To understand how our Bluespec turns into Verilog, we can refer to the <a href="https://github.com/B-Lang-org/bsc/releases/latest/download/bsc_user_guide.pdf">BSC User Guide</a>. People interested in greater detail should check out the chapter “Verilog back end” and especially the subsection “Bluespec to Verilog mapping”, which describes how <code class="language-plaintext highlighter-rouge">.bsv</code> files are transformed into Verilog <code class="language-plaintext highlighter-rouge">.v</code> files.</p>

<p>You can also read the chapter “Embedding RTL in a BSV design” in the <a href="https://github.com/B-Lang-org/bsc/releases/latest/download/BSV_lang_ref_guide.pdf">BSV Reference Guide</a> where they discuss importing Verilog modules into Bluespec for use in the Verilog backend. As per the User Guide, the Bluesim backend is currently incapable of using Verilog directly. When we import, we’d need to use Verilog simulators or write Bluespec implementations for simulation in Bluesim. This makes it a little less convenient to import Verilog when we use Bluesim for simulation, like I currently do.</p>

<h2 id="synth-tweaks">Synth Tweaks</h2>
<p>The <a href="https://github.com/minispec-hdl/minispec/tree/main/synth"><code class="language-plaintext highlighter-rouge">synth</code></a> synthesis tool we use from Daniel Sanchez’s <a href="https://github.com/minispec-hdl/minispec/">Minispec</a> compiles our Bluespec <code class="language-plaintext highlighter-rouge">.bsv</code> files into Verilog <code class="language-plaintext highlighter-rouge">.v</code> files, then does a bunch of processing with <a href="https://github.com/YosysHQ/yosys"><code class="language-plaintext highlighter-rouge">yosys</code></a> and <a href="https://github.com/berkeley-abc/abc"><code class="language-plaintext highlighter-rouge">ABC</code></a> to determine our area and critical-path delay.</p>

<p>It’s a nicely designed tool, but I need to make a series of tweaks to make it work better for my purposes. The main change is that I’d like to be able to synthesize Verilog files directly, but I also make a bug fix and a cosmetic change. You can see my modified version on <a href="https://github.com/mchanphilly/minispec">my fork on GitHub</a>. I don’t know how widely applicable my changes are, so I don’t plan on making a pull request.</p>

<h3 id="accepting-verilog-inputs">Accepting Verilog Inputs</h3>

<p>The <code class="language-plaintext highlighter-rouge">synth</code> tool was built to consume Minispec and Bluespec, but internally it compiles both into Verilog <code class="language-plaintext highlighter-rouge">.v</code> files for synthesis with downstream tools with <code class="language-plaintext highlighter-rouge">yosys</code> and <code class="language-plaintext highlighter-rouge">ABC</code>. There might be established tools for generating area and delay numbers for Verilog designs, but I both like Minispec’s <code class="language-plaintext highlighter-rouge">synth</code> and I have trouble finding off-the-shelf synthesis tools. (I suspect many of them are proprietary.)</p>

<p>I modified <code class="language-plaintext highlighter-rouge">synth</code> to be able to accept Verilog modules directly for synthesis. It’s just a matter of being able to skip the Minispec/Bluespec compilation step of the <code class="language-plaintext highlighter-rouge">synth</code> tool and using the Verilog <code class="language-plaintext highlighter-rouge">.v</code> files directly.</p>

<p>It’s also a matter of moving the <code class="language-plaintext highlighter-rouge">.v</code> files in the current directory into the <code class="language-plaintext highlighter-rouge">synthDir</code> so that they can be consumed as needed by other modules (specified by the <code class="language-plaintext highlighter-rouge">.use</code> files). This is especially important for Bluespec <code class="language-plaintext highlighter-rouge">import "BVI"</code> statements because the <code class="language-plaintext highlighter-rouge">.v</code> files from the compilation will assume that the imported <code class="language-plaintext highlighter-rouge">.v</code> files will be available for synthesis.</p>

<p>When we eventually do Verilog simulation, we’ll also need to ensure that our <code class="language-plaintext highlighter-rouge">.v</code> files are moved to <code class="language-plaintext highlighter-rouge">build</code> for simulation.</p>

<h4 id="alternatives">Alternatives</h4>
<p>When I was thinking about how to measure the performance of both Bluespec and Verilog modules, I briefly considered using the wrapper-only route. I wouldn’t need to modify the <code class="language-plaintext highlighter-rouge">synth</code> tool as long as all my Verilog modules were presented as Bluespec modules.</p>

<p>I decided that it would be a little too roundabout to need to wrap all my Verilog modules in Bluespec just to synthesize them. I may want to synthesize separately even before importing these modules into a Bluespec design. It’s not much trouble, but it requires writing a bit of boilerplate.</p>

<p>It wasn’t so hard to modify the <code class="language-plaintext highlighter-rouge">synth</code> program. It’s written in Python, so I just needed to read through it and figure out what to change.</p>

<h3 id="buffer-configurations">Buffer Configurations</h3>

<p>I had already tweaked my installation of <code class="language-plaintext highlighter-rouge">synth</code> during my <a href="/projects/processor/">processor project</a>. During the step where the program synthesizes with three buffer configurations, one of them would suddenly require much, much, <em>much</em> more computation than the other two. It’s no problem for small designs, but it would take so much computation for synthesizing my L1 caches that the synthesis would crash.</p>

<p>To locate the issue, I looked at the different output logs from <code class="language-plaintext highlighter-rouge">synth</code> to see where the tool was stalling. I found that <code class="language-plaintext highlighter-rouge">synth</code> would generate several configurations and select the best one. <code class="language-plaintext highlighter-rouge">synth</code> would crash because one of these sets would stall.</p>

<figure>
<p><img src="/assets/media/extending-synth/buffers.png" alt="Debug output showing two normal critical paths and one very large critical path." /></p>
  <figcaption>There are 6 different outputs because the tool tries 3 buffer configurations and both -O0 (<code class="language-plaintext highlighter-rouge">ox</code>) and -O1 (<code class="language-plaintext highlighter-rouge">ob</code>) optimization parameters.</figcaption>
</figure>

<p>I “fixed” the issue by making <code class="language-plaintext highlighter-rouge">synth</code> skip the configuration prone to stalling. I don’t know whether it’s a true fix because it might result in worse generated circuits for some designs. I checked it makes no difference for my full adder implementations.</p>

<h3 id="svg-tweaks">SVG Tweaks</h3>

<p>I also adjusted the color scheme of the <code class="language-plaintext highlighter-rouge">svg</code> generator to output dark mode circuit visualizations, just because that’s what I use for everything, including this blog.</p>

<p>If I was submitting a pull request, I would want to make it configurable from the command line. But because I only ever need dark mode, I just changed the color values in the <code class="language-plaintext highlighter-rouge">svg</code> file in <code class="language-plaintext highlighter-rouge">synth</code>.</p>

<h2 id="verilog-full-adder">Verilog Full Adder</h2>

<p>In this section, I wanted to test out my changes to <code class="language-plaintext highlighter-rouge">synth</code> by synthesizing a simple full adder module written in Verilog.</p>

<p>I also run a little experiment with using different operators. Below, I choose to use boolean operators (e.g., <code class="language-plaintext highlighter-rouge">&amp;&amp;</code>, <code class="language-plaintext highlighter-rouge">||</code>) even though I could use bitwise operators (e.g., <code class="language-plaintext highlighter-rouge">&amp;</code>, <code class="language-plaintext highlighter-rouge">|</code>). <a href="#boolean-quirks">I explain more soon.</a></p>

<p>The synth tools customarily requires us to have every module accept a <code class="language-plaintext highlighter-rouge">CLK</code>, which can remain unused.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="n">FullAdder</span><span class="p">(</span>
    <span class="kt">input</span> <span class="n">CLK</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c_in</span><span class="p">,</span>
    <span class="kt">output</span> <span class="n">sum</span><span class="p">,</span> <span class="n">c_out</span>
    <span class="p">);</span>
    <span class="k">always</span> <span class="o">@</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">begin</span>  <span class="c1">// generally I would prefer always_comb in SystemVerilog</span>
        <span class="n">sum</span> <span class="o">=</span> <span class="n">a</span> <span class="o">^</span> <span class="n">b</span> <span class="o">^</span> <span class="n">c_in</span><span class="p">;</span>
        <span class="n">c_out</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span><span class="o">&amp;&amp;</span><span class="n">b</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">a</span><span class="o">&amp;&amp;</span><span class="n">c_in</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">b</span><span class="o">&amp;&amp;</span><span class="n">c_in</span><span class="p">);</span>
    <span class="k">end</span>
<span class="k">endmodule</span>
</code></pre></div></div>

<p>With my <a href="#accepting-verilog-inputs">above tweak</a>, I can run <code class="language-plaintext highlighter-rouge">synth FullAdder.v FullAdder</code> to generate synthesis logs.</p>

<h3 id="basic-cell-library">Basic Cell Library</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Synthesizing FullAdder from file FullAdder.v as a Verilog module.
Synthesizing circuit with std cell library = basic, O1, target delay = 1 ps

Gates: 14
Area: 10.11 um^2
Critical-path delay: 51.75 ps (not including setup time of endpoint flip-flop)

Critical path: b -&gt; sum
               Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
               ---------   ------        ---------------  --------------------- 
                       b        3                    7.6                    7.6 
                   NAND2        3                   14.3                   21.9 
                     INV        1                    8.4                   30.3 
                    NOR2        1                    6.1                   36.4 
                   NAND2        1                    8.6                   45.0 
                   NAND2        1                    6.7                   51.7 
                     sum        0                    0.0                   51.7 

Area breakdown:
               Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
               ---------    -----       ----------------       ----------------
                     INV        4                  0.532                  2.128
                   NAND2        8                  0.798                  6.384
                    NOR2        2                  0.798                  1.596
                   Total       14                                        10.108
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">synth</code> tool includes an <code class="language-plaintext highlighter-rouge">svg</code> diagram visualizer for circuits made with the standard (basic) cell library. We get that by using the <code class="language-plaintext highlighter-rouge">-v</code> flag, e.g., <code class="language-plaintext highlighter-rouge">synth FullAdder.v FullAdder -v</code>.</p>

<p>Let’s see what this looks like.</p>

<p><img src="\assets\media\extending-synth\fa_v\fa_v_std.svg" alt="SVG diagram of the Verilog full adder with std cell library" /></p>

<p>Notice the synthesis mostly uses <code class="language-plaintext highlighter-rouge">INV</code>, <code class="language-plaintext highlighter-rouge">NAND2</code> and a couple <code class="language-plaintext highlighter-rouge">NOR2</code> gates, whereas a textbook full adder might only use <code class="language-plaintext highlighter-rouge">NOR2</code>, <code class="language-plaintext highlighter-rouge">AND2</code>, and an <code class="language-plaintext highlighter-rouge">OR2</code>. Modern physical design (or at least the kind that they teach in schools) preferentially uses <code class="language-plaintext highlighter-rouge">NAND</code> gates because they result in an overall cheaper circuit.</p>

<h4 id="boolean-quirks">Boolean Quirks</h4>
<p>By accident, I noticed there’s a quirk that happens when I use bitwise versus boolean operators. I think it must be an issue with the downstream optimization because semantically, it shouldn’t matter whether we’re using boolean operators or bitwise operators when each operand is a single bit. Indeed, we’ll <a href="#nondeterminism">see later</a> that the downstream gate placement can vary unpredictably.</p>

<p>We get a different circuit when we use <code class="language-plaintext highlighter-rouge">c_out = (a&amp;b) | (a&amp;c_in) | (b&amp;c_in);</code>, even if semantically we should get the same thing.</p>

<p><img src="/assets/media/extending-synth/fa_v/fa_v_bitwise.svg" alt="SVG of a Verilog full adder circuit with a couple extra gates." /></p>

<p>It’s technically up to the engineer whether this circuit is better or worse. It results in 16 rather than 14 gates, but we shave off half a <code class="language-plaintext highlighter-rouge">ps</code> of delay. I would probably go with the original 14-gate circuit since it’s only 0.7% faster (<code class="language-plaintext highlighter-rouge">51.4 ps</code> vs <code class="language-plaintext highlighter-rouge">51.7 ps</code>) but 15% larger (<code class="language-plaintext highlighter-rouge">11.704 um^2</code> vs <code class="language-plaintext highlighter-rouge">10.108 um^2</code>).</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Critical-path delay: 51.39 ps (not including setup time of endpoint flip-flop)
  Gate/port   Fanout        Gate delay (ps)  Cumulative delay (ps) 
  ---------   ------        ---------------  --------------------- 
          a        4                    9.8                    9.8 
      NAND2        2                   12.2                   22.0 
        INV        1                    7.7                   29.7 
       NOR2        1                    6.3                   36.0 
      NAND2        1                    8.6                   44.6 
      NAND2        1                    6.8                   51.4 
        sum        0                    0.0                   51.4 

  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        4                  0.532                  2.128
      NAND2       10                  0.798                  7.980
       NOR2        2                  0.798                  1.596
      Total       16                                        11.704
</code></pre></div></div>

<p>In some cases, we can use the <a href="https://en.wikipedia.org/wiki/Retiming"><code class="language-plaintext highlighter-rouge">--retime</code></a> flag with <code class="language-plaintext highlighter-rouge">synth</code> to re-generate a more efficient and logically equivalent circuit. For whatever reason, it didn’t work with this one.</p>

<h3 id="extended-cell-library">Extended Cell Library</h3>

<p>We can also get different results with different cell libraries. I generally stick with <code class="language-plaintext highlighter-rouge">basic</code>, but there’s no reason why we can’t use the other ones. They just give us different gates. The main difference with this library for the full adder is that we gain access to <code class="language-plaintext highlighter-rouge">NAND3</code> gates, which we use for <code class="language-plaintext highlighter-rouge">c_out</code>.</p>

<p>I synthesize using the <code class="language-plaintext highlighter-rouge">-l</code> option with a cell library name, e.g., <code class="language-plaintext highlighter-rouge">synth FullAdder.v FullAdder -l extended -v</code>. I trimmed the following log for conciseness.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Extended]
Critical-path delay: 49.98 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
        INV        3                  0.532                  1.596
      NAND2        8                  0.798                  6.384
      NAND3        2                  1.064                  2.128
      Total       13                                        10.108
</code></pre></div></div>

<p><img src="\assets\media\extending-synth\fa_v\fa_v_extended.svg" alt="SVG diagram of the Verilog full adder with std cell library" /></p>

<h3 id="multisize-cell-library">Multisize Cell Library</h3>

<p>Here, we use a few different gates other than <code class="language-plaintext highlighter-rouge">NAND2</code>, but we still stick mostly with <code class="language-plaintext highlighter-rouge">NAND2</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Multisize]
Critical-path delay: 48.84 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
     INV_X1        1                  0.532                  0.532
   NAND2_X1        5                  0.798                  3.990
   NAND3_X1        1                  1.064                  1.064
     OR2_X2        1                  1.330                  1.330
   XNOR2_X1        1                  1.596                  1.596
      Total        9                                         8.512
</code></pre></div></div>

<p><img src="\assets\media\extending-synth\fa_v\fa_v_multisize.svg" alt="SVG diagram of the Verilog full adder with std cell library" /></p>

<h3 id="full-cell-library">Full Cell Library</h3>

<p>We can synthesize with a more diverse <code class="language-plaintext highlighter-rouge">full</code> cell library, but <code class="language-plaintext highlighter-rouge">synth</code> doesn’t currently support generating circuit diagrams for it. It’s probably just a matter of adding in the <code class="language-plaintext highlighter-rouge">svg</code> components for all the different gates.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Full]
Critical-path delay: 47.63 ps (not including setup time of endpoint flip-flop)
  Gate type    Gates       Area/gate (um^2)       Area/type (um^2)
  ---------    -----       ----------------       ----------------
    AND2_X1        1                  1.064                  1.064
     INV_X1        1                  0.532                  0.532
   NAND2_X1        2                  0.798                  1.596
   NAND3_X1        1                  1.064                  1.064
    NOR2_X1        1                  0.798                  0.798
   OAI21_X1        2                  1.064                  2.128
     OR2_X2        1                  1.330                  1.330
      Total        9                                         8.512
</code></pre></div></div>

<h2 id="bluespec-full-adder">Bluespec Full Adder</h2>

<p>In this section, I wanted to synthesize a simple Bluespec full adder and inspect the resulting Verilog files and synthesis outputs. I also wanted to test whether the choice in boolean or bitwise operators made a difference in the resulting circuit like it did for the <a href="#verilog-full-adder">Verilog full adder</a>.</p>

<p>Implementing in Bluespec gives us some more design choices. Bluespec’s richer type system distinguishes between booleans <code class="language-plaintext highlighter-rouge">Bool</code> and bits <code class="language-plaintext highlighter-rouge">Bit#(1)</code>. Typically, I would prefer the bitwise implementation because semantically, the bits of a full adder generally represent parts of larger bit vector operands and sums.</p>

<p>But like in the above <a href="#boolean-quirks">Verilog case</a>, there may be performance implications in our downstream tools for using boolean versus bitwise operators. Until such a time that the performance quirk gets optimized out, I need to weigh the trade-offs between a more performant circuit with the boolean implementation, versus semantic accuracy with the bitwise implementation.</p>

<p>It may even turn out that it’s easier to work with the bitwise implementation, or that the quirk only appears when we’re synthesizing the full adder directly and not as a component. Because it’s only two gates, I’m leaning toward using the bitwise implementation for future components. In this section, we test both.</p>

<p>Switching between boolean and bitwise in Bluespec is a little trickier than in Verilog because I need to not only change the operators, but also the types. If you want the bitwise implementation, just replace <code class="language-plaintext highlighter-rouge">Bool</code> with <code class="language-plaintext highlighter-rouge">Bit#(1)</code> and the  operators <code class="language-plaintext highlighter-rouge">!=</code>, <code class="language-plaintext highlighter-rouge">&amp;&amp;</code>, and <code class="language-plaintext highlighter-rouge">||</code> with <code class="language-plaintext highlighter-rouge">^</code>, <code class="language-plaintext highlighter-rouge">&amp;</code>, and <code class="language-plaintext highlighter-rouge">|</code>.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">typedef</span><span class="w"> </span><span class="kd">struct</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">sum</span><span class="p">;</span><span class="w">
    </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">c_out</span><span class="p">;</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="nc">FullAdderResult</span><span class="w"> </span><span class="kd">deriving</span><span class="w"> </span><span class="p">(</span><span class="no">Bits</span><span class="p">,</span><span class="w"> </span><span class="no">Eq</span><span class="p">);</span><span class="w">

</span><span class="kd">interface</span><span class="w"> </span><span class="nc">FullAdder</span><span class="p">;</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="nc">FullAdderResult</span><span class="w"> </span><span class="nf">exec</span><span class="p">(</span><span class="nc">Bool</span><span class="w"> </span><span class="nv">a</span><span class="p">,</span><span class="w"> </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">c_in</span><span class="p">);</span><span class="w">
</span><span class="kd">endinterface</span><span class="w">

</span><span class="fm">(* synthesize, always_enabled, no_default_reset *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkFullAdder</span><span class="p">(</span><span class="nc">FullAdder</span><span class="p">);</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="nc">FullAdderResult</span><span class="w"> </span><span class="nf">exec</span><span class="p">(</span><span class="nc">Bool</span><span class="w"> </span><span class="nv">a</span><span class="p">,</span><span class="w"> </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="nc">Bool</span><span class="w"> </span><span class="nv">c_in</span><span class="p">);</span><span class="w">
        </span><span class="k">return</span><span class="w"> </span><span class="nc">FullAdderResult</span><span class="w"> </span><span class="p">{</span><span class="w">
            </span><span class="nv">sum</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="nv">a</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="nv">b</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="nv">c_in</span><span class="p">,</span><span class="w">  </span><span class="o">//</span><span class="w"> </span><span class="nv">no</span><span class="w"> </span><span class="nv">logical</span><span class="w"> </span><span class="kr">xor</span><span class="w">
            </span><span class="nv">c_out</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="p">(</span><span class="nv">a</span><span class="o">&amp;&amp;</span><span class="nv">b</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="p">(</span><span class="nv">a</span><span class="o">&amp;&amp;</span><span class="nv">c_in</span><span class="p">)</span><span class="w"> </span><span class="o">||</span><span class="w"> </span><span class="p">(</span><span class="nv">b</span><span class="o">&amp;&amp;</span><span class="nv">c_in</span><span class="p">)</span><span class="w">
        </span><span class="p">};</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>For such a simple design, the Bluespec generates identical circuits as the corresponding (bitwise or boolean) implementations in Verilog, so I don’t bother reproducing the synthesis logs.</p>

<p>There <em>are</em> some minor differences in the visualizations:</p>
<ul>
  <li>The ordering of the operands (doesn’t matter in a full adder),</li>
  <li>The <code class="language-plaintext highlighter-rouge">{sum, c_out}</code> are bused into a 2-bit output, and</li>
  <li>If we don’t include <code class="language-plaintext highlighter-rouge">no_default_reset</code> and <code class="language-plaintext highlighter-rouge">always_enabled</code> attributes, there would be an unused <code class="language-plaintext highlighter-rouge">RST_N</code> and <code class="language-plaintext highlighter-rouge">RDY_exec</code> driver on the visualization.
    <ul>
      <li>In the following visualizations, I omitted the attributes, so they don’t correspond exactly with the above excerpt. So, imagine there’s only the <code class="language-plaintext highlighter-rouge">synthesize</code> attribute.</li>
    </ul>
  </li>
</ul>

<p>Notice that the operands are prefixed with <code class="language-plaintext highlighter-rouge">exec</code>. That’s because this whole circuit corresponds to the <code class="language-plaintext highlighter-rouge">exec</code> method of the module. We’d have a different looking circuit if we had other methods or rules to synthesize.</p>

<h3 id="bitwise-implementation">Bitwise Implementation</h3>

<p><img src="/assets/media/extending-synth/fa_bsv/fa_bsv_bitwise.svg" alt="SVG diagram of bitwise Bluespec full adder" /></p>

<h3 id="boolean-implementation">Boolean Implementation</h3>

<p><img src="/assets/media/extending-synth/fa_bsv/fa_bsv_bool.svg" alt="SVG diagram of boolean Bluespec full adder" /></p>

<p>You may also notice the unused <code class="language-plaintext highlighter-rouge">RDY_exec</code>. We can remove it by adding the <code class="language-plaintext highlighter-rouge">always_enabled</code> attribute next to the <code class="language-plaintext highlighter-rouge">synthesize</code> attribute, and it’ll be gone. It wouldn’t change the resulting circuit’s delay or area, since the unused <code class="language-plaintext highlighter-rouge">RDY_exec</code> signal gets optimized out anyway.</p>

<p>We could further remove the unused <code class="language-plaintext highlighter-rouge">CLK</code> and <code class="language-plaintext highlighter-rouge">RST_N</code> ports with the attributes <code class="language-plaintext highlighter-rouge">no_default_clock</code> and <code class="language-plaintext highlighter-rouge">no_default_reset</code>. We won’t remove the clock since the <code class="language-plaintext highlighter-rouge">synth</code> tool requires a clock port to synthesize a module. But there’s no reason why we can’t remove the <code class="language-plaintext highlighter-rouge">RST_N</code>.</p>

<p>I add the <code class="language-plaintext highlighter-rouge">no_default_reset</code> and <code class="language-plaintext highlighter-rouge">always_enabled</code> attributes into the Bluespec excerpt above, but I’ve kept the drivers in the visualizations so you can see what I’m talking about.</p>

<h3 id="resulting-verilog-files">Resulting Verilog Files</h3>
<p>For the above visualizations, I didn’t add any attributes other than <code class="language-plaintext highlighter-rouge">synthesize</code>. To generate the following Verilog, I added the <code class="language-plaintext highlighter-rouge">always_enabled, no_default_reset</code> attributes (just like the Bluespec excerpt above).</p>

<p>These Verilog files are generated by the Bluespec compiler for use in downstream tools like <code class="language-plaintext highlighter-rouge">synth</code>, other Verilog synthesis tools, or Verilog simulators.</p>

<p>Note that I present these files in the reverse order as the visualizations above.</p>

<h4 id="boolean-implementation-1">Boolean Implementation</h4>
<p>The compiled Verilog for such a simple circuit as the boolean implementation of the full adder is very legible, though it uses Verilog 1995 style declaration. The calculation of the carry also uses a boolean simplification.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="n">mkFullAdder</span><span class="p">(</span><span class="n">CLK</span><span class="p">,</span>

		   <span class="n">exec_a</span><span class="p">,</span>
		   <span class="n">exec_b</span><span class="p">,</span>
		   <span class="n">exec_c_in</span><span class="p">,</span>
		   <span class="n">exec</span><span class="p">);</span>
  <span class="kt">input</span>  <span class="n">CLK</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="kt">input</span>  <span class="n">exec_a</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_b</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_c_in</span><span class="p">;</span>
  <span class="kt">output</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>

  <span class="c1">// signals for module outputs</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="k">assign</span> <span class="n">exec</span> <span class="o">=</span>
	     <span class="o">{</span> <span class="p">(</span><span class="n">exec_a</span> <span class="o">!=</span> <span class="n">exec_b</span><span class="p">)</span> <span class="o">!=</span> <span class="n">exec_c_in</span><span class="p">,</span>
	       <span class="n">exec_a</span> <span class="o">&amp;&amp;</span> <span class="p">(</span><span class="n">exec_b</span> <span class="o">||</span> <span class="n">exec_c_in</span><span class="p">)</span> <span class="o">||</span> <span class="n">exec_b</span> <span class="o">&amp;&amp;</span> <span class="n">exec_c_in</span> <span class="o">}</span> <span class="p">;</span>
<span class="k">endmodule</span>  <span class="c1">// mkFullAdder (boolean implementation)</span>
</code></pre></div></div>

<h4 id="bitwise-implementation-1">Bitwise Implementation</h4>
<p>Unfortunately, the bitwise implementation doesn’t result in as legible a Verilog file. The compiler makes liberal use of internal signals and wire instantiations.</p>

<p>There’s no boolean simplification like above. I would’ve originally guessed the lack of simplification is why the design costs more gates, but <a href="#boolean-quirks">we saw earlier</a> that this happens even when we write directly in Verilog, and <a href="#wrappers-around-bluespec">we’ll see later</a> that we sometimes regain efficiency with some strange wrapping.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="n">mkFullAdder</span><span class="p">(</span><span class="n">CLK</span><span class="p">,</span>

		   <span class="n">exec_a</span><span class="p">,</span>
		   <span class="n">exec_b</span><span class="p">,</span>
		   <span class="n">exec_c_in</span><span class="p">,</span>
		   <span class="n">exec</span><span class="p">);</span>
  <span class="kt">input</span>  <span class="n">CLK</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="kt">input</span>  <span class="n">exec_a</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_b</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_c_in</span><span class="p">;</span>
  <span class="kt">output</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>

  <span class="c1">// signals for module outputs</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>

  <span class="c1">// remaining internal signals</span>
  <span class="kt">wire</span> <span class="n">x__h20</span><span class="p">,</span> <span class="n">x__h37</span><span class="p">,</span> <span class="n">x__h40</span><span class="p">,</span> <span class="n">x__h52</span><span class="p">,</span> <span class="n">x__h54</span><span class="p">,</span> <span class="n">y__h53</span><span class="p">,</span> <span class="n">y__h55</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="k">assign</span> <span class="n">exec</span> <span class="o">=</span> <span class="o">{</span> <span class="n">x__h20</span><span class="p">,</span> <span class="n">x__h40</span> <span class="o">}</span> <span class="p">;</span>

  <span class="c1">// remaining internal signals</span>
  <span class="k">assign</span> <span class="n">x__h20</span> <span class="o">=</span> <span class="n">x__h37</span> <span class="o">^</span> <span class="n">exec_c_in</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">x__h37</span> <span class="o">=</span> <span class="n">exec_a</span> <span class="o">^</span> <span class="n">exec_b</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">x__h40</span> <span class="o">=</span> <span class="n">x__h52</span> <span class="o">|</span> <span class="n">y__h53</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">x__h52</span> <span class="o">=</span> <span class="n">x__h54</span> <span class="o">|</span> <span class="n">y__h55</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">x__h54</span> <span class="o">=</span> <span class="n">exec_a</span> <span class="o">&amp;</span> <span class="n">exec_b</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">y__h53</span> <span class="o">=</span> <span class="n">exec_b</span> <span class="o">&amp;</span> <span class="n">exec_c_in</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">y__h55</span> <span class="o">=</span> <span class="n">exec_a</span> <span class="o">&amp;</span> <span class="n">exec_c_in</span> <span class="p">;</span>
<span class="k">endmodule</span>  <span class="c1">// mkFullAdder (bitwise implementation)</span>
</code></pre></div></div>

<h3 id="wrappers-around-bluespec">Wrappers around Bluespec</h3>

<p>In Bluespec, we can wrap a module’s implementation in another module. It looks like this:</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkFullAdderWrapper</span><span class="p">(</span><span class="nc">FullAdder</span><span class="p">);</span><span class="w">
    </span><span class="nc">FullAdder</span><span class="w"> </span><span class="nv">_adder</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFullAdder</span><span class="p">;</span><span class="w">
    </span><span class="k">return</span><span class="w"> </span><span class="nv">_adder</span><span class="p">;</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>The underlying Verilog instantiates the inner module and connects its ports with the external module’s ports. It’s all done in wires, so we might expect no difference in the resulting circuit.</p>

<p>In this section, I investigate whether there’s any overhead in synthesizing wrapped Bluespec. For thoroughness, I check nested wrappers too, like when we wrap a wrapper.</p>

<h4 id="losing-efficiency">Losing Efficiency</h4>
<p>When I experimented using <code class="language-plaintext highlighter-rouge">synth</code>, I saw using a wrapper <em>can</em> (but might not) affect the resulting circuit. Wrapping our boolean implementation gives us a 16-gate circuit (like with the bitwise implementation) instead of our original 14-gate circuit.</p>

<p>We might chalk this up to overhead from wrapping, but we shouldn’t be getting any overhead from just connecting wires.</p>

<p>It must do with the downstream tools. Similar to the boolean versus bitwise case, there’s something preventing the synthesis tool from optimizing the resulting gate placements.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="n">mkFullAdderWrapper</span><span class="p">(</span><span class="n">CLK</span><span class="p">,</span>
			  <span class="n">RST_N</span><span class="p">,</span>

			  <span class="n">exec_a</span><span class="p">,</span>
			  <span class="n">exec_b</span><span class="p">,</span>
			  <span class="n">exec_c_in</span><span class="p">,</span>
			  <span class="n">exec</span><span class="p">,</span>
			  <span class="n">RDY_exec</span><span class="p">);</span>
  <span class="kt">input</span>  <span class="n">CLK</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">RST_N</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="kt">input</span>  <span class="n">exec_a</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_b</span><span class="p">;</span>
  <span class="kt">input</span>  <span class="n">exec_c_in</span><span class="p">;</span>
  <span class="kt">output</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>
  <span class="kt">output</span> <span class="n">RDY_exec</span><span class="p">;</span>

  <span class="c1">// signals for module outputs</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="n">exec</span><span class="p">;</span>
  <span class="kt">wire</span> <span class="n">RDY_exec</span><span class="p">;</span>

  <span class="c1">// ports of submodule _unnamed_</span>
  <span class="kt">wire</span> <span class="p">[</span><span class="mi">1</span> <span class="o">:</span> <span class="mi">0</span><span class="p">]</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec</span><span class="p">;</span>
  <span class="kt">wire</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_a</span><span class="p">,</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_b</span><span class="p">,</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_c_in</span><span class="p">;</span>

  <span class="c1">// value method exec</span>
  <span class="k">assign</span> <span class="n">exec</span> <span class="o">=</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="n">RDY_exec</span> <span class="o">=</span> <span class="mi">1'd1</span> <span class="p">;</span>

  <span class="c1">// submodule _unnamed_</span>
  <span class="n">mkFullAdder</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">(.</span><span class="n">CLK</span><span class="p">(</span><span class="n">CLK</span><span class="p">),</span>
			<span class="p">.</span><span class="n">exec_a</span><span class="p">(</span><span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_a</span><span class="p">),</span>
			<span class="p">.</span><span class="n">exec_b</span><span class="p">(</span><span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_b</span><span class="p">),</span>
			<span class="p">.</span><span class="n">exec_c_in</span><span class="p">(</span><span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_c_in</span><span class="p">),</span>
			<span class="p">.</span><span class="n">exec</span><span class="p">(</span><span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec</span><span class="p">));</span>

  <span class="c1">// submodule _unnamed_</span>
  <span class="k">assign</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_a</span> <span class="o">=</span> <span class="n">exec_a</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_b</span> <span class="o">=</span> <span class="n">exec_b</span> <span class="p">;</span>
  <span class="k">assign</span> <span class="mi">_u</span><span class="n">nnamed_</span><span class="p">$</span><span class="n">exec_c_in</span> <span class="o">=</span> <span class="n">exec_c_in</span> <span class="p">;</span>
<span class="k">endmodule</span>  <span class="c1">// mkFullAdderWrapper</span>
</code></pre></div></div>

<p>I also tried adding a second layer of wrapper. If the first wrapper reduced performance (for unknown reasons), maybe a second wrapper would reduce performance even more.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkFullAdderWrapper2</span><span class="p">(</span><span class="nc">FullAdder</span><span class="p">);</span><span class="w">
    </span><span class="nc">FullAdder</span><span class="w"> </span><span class="nv">_adder</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFullAdderWrapper</span><span class="p">;</span><span class="w">
    </span><span class="k">return</span><span class="w"> </span><span class="nv">_adder</span><span class="p">;</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>But we didn’t lose performance! The resulting circuit is <em>back to 14-gate</em>, which is the same as the unwrapped boolean implementation.</p>

<p>At first, I found that wrapping three times gets us the 16-gate, and wrapping four times gets us the 14-gate. There was a cycle of gaining and losing performance, even when the Verilog for each layer of wrapper was practically identical to the last.</p>

<p>(When I went back to verify, the results changed, which <a href="#nondeterminism">I soon discuss</a>.)</p>

<h4 id="gaining-efficiency">Gaining Efficiency</h4>
<p>I ran the same wrapper experiment with the bitwise implementation. If <code class="language-plaintext highlighter-rouge">synth</code> gave us the 16-gate for bitwise, maybe we’d get 16-gate no matter the wrapper.</p>

<p>Surprisingly, adding a wrapper actually gave us the 14-gate circuit. The tool was telling us that our full adder was more performant <em>with</em> a wrapper. Adding more wrappers resulted in several 14-gate, and one 16-gate. There didn’t seem to be any pattern.</p>

<p>This is only if we <em>don’t</em> specify <code class="language-plaintext highlighter-rouge">no_default_reset</code>; otherwise they’re all 16-gate. (Don’t ask me why.)</p>

<h4 id="nondeterminism">Nondeterminism</h4>

<p>The day after, I found that each arrangement of wrappers didn’t necessarily result in the same circuit as the day before. I don’t believe I really changed anything, so I wonder if it’s a nondeterministic bug.</p>

<p>It’s interesting that the mere action of adding more wrappers can be enough to massage the synthesis tool into giving us the more efficient 14-gate circuit. It shows that the downstream bug isn’t just restricted to the kind of operator you use.</p>

<p>The <strong>main takeaway</strong> is that we should be wary about how much stock we put into our synthesis numbers. Even for a circuit as simple as a full adder, there seems to be inefficient gate placement. For much more complex designs, we should consider the synthesis numbers to be only approximate, at least until we secure more sophisticated downstream synthesis tools.</p>

<h2 id="verilog-wrapped-with-bluespec">Verilog Wrapped with Bluespec</h2>

<p>Wrapping Bluespec modules in Bluespec can be useful, but the real use comes with wrapping <em>other</em> languages in Bluespec.</p>

<p>Bluespec offers support for <a href="https://en.wikipedia.org/wiki/Language_binding">bindings</a> between Bluespec modules and Verilog modules (going down in abstraction, at the cost of productivity) or Bluespec functions and C functions (going up in abstraction, at the cost of performance).</p>

<p>For us, I’m focusing on wrapping Verilog because it might allow us to write more performant components to use in our Bluespec, like adders.</p>

<p>According to the BSC User Guide:</p>

<blockquote>
  <p>Using the <code class="language-plaintext highlighter-rouge">import "BVI"</code> syntax, a designer can specify that the implementation of a particular BSV module is an RTL (Verilog or VHDL) module, as described in the BSV Reference Guide. The module is treated exactly as if it were originally written in BSV and then converted to hardware by the compiler, but instead of the <code class="language-plaintext highlighter-rouge">.v</code> file being generated by the compiler, it was supplied independently of any BSV code. It may have been written by hand or supplied by a vendor as an IP, etc.</p>
</blockquote>

<p>The main thing I’d like to see is whether the synthesis of a Bluespec-wrapped Verilog module is identical to a Verilog module synthesized directly. Given the above description, it should be, since it’s exactly what we practiced by playing with <code class="language-plaintext highlighter-rouge">synth</code> and Bluespec-wrapped Bluespec.</p>

<p>Let’s take our <a href="#verilog-full-adder">Verilog full adder</a> and wrap it in Bluespec. Remember that each of the boolean implementations, in Verilog and in Bluespec, resulted in a 14-gate circuit. But with the capriciousness of the downstream synthesis, I would accept a 16-gate circuit too. This is especially true because we got 16-gate circuits from wrapping implementations that would’ve given us 14-gate circuits.</p>

<p>An <code class="language-plaintext highlighter-rouge">import "BVI"</code> statement also requires us to declare the mappings between the Bluespec interface and the Verilog ports. I’ve modified my Verilog full adder to output <code class="language-plaintext highlighter-rouge">{sum, c_out}</code> as a single <code class="language-plaintext highlighter-rouge">reg [1:0]</code> to be consistent with my Bluespec <code class="language-plaintext highlighter-rouge">exec</code> method, which packs the two values together. In Bluespec, <code class="language-plaintext highlighter-rouge">FullAdderResult</code> is a <code class="language-plaintext highlighter-rouge">struct</code>, but we implicitly pack/unpack to bits as necessary when we’re working with foreign modules.</p>

<div class="language-verilog highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">module</span> <span class="n">FullAdderVerilog</span><span class="p">(</span>
    <span class="kt">input</span> <span class="n">CLK</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c_in</span><span class="p">,</span>
    <span class="kt">output</span> <span class="p">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">]</span> <span class="n">out</span>
    <span class="p">);</span>
    <span class="k">always</span> <span class="o">@</span><span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="k">begin</span>
        <span class="n">out</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">a</span> <span class="o">^</span> <span class="n">b</span> <span class="o">^</span> <span class="n">c_in</span><span class="p">;</span>
        <span class="n">out</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">a</span><span class="o">&amp;&amp;</span><span class="n">b</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">a</span><span class="o">&amp;&amp;</span><span class="n">c_in</span><span class="p">)</span> <span class="o">||</span> <span class="p">(</span><span class="n">b</span><span class="o">&amp;&amp;</span><span class="n">c_in</span><span class="p">);</span>
    <span class="k">end</span>
<span class="k">endmodule</span>
</code></pre></div></div>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span><span class="w"> </span><span class="dl">"</span><span class="s">BVI</span><span class="dl">"</span><span class="w"> </span><span class="nc">FullAdderVerilog</span><span class="w"> </span><span class="o">=
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkFullAdderVerilog</span><span class="p">(</span><span class="nc">FullAdder</span><span class="p">);</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="nv">out</span><span class="w"> </span><span class="nf">exec</span><span class="p">(</span><span class="nv">a</span><span class="p">,</span><span class="w"> </span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="nv">c_in</span><span class="p">);</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>We can’t directly synthesize foreign modules, but we <em>can</em> wrap them and synthesize the wrapper.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkFullAdderVerilogWrapper</span><span class="p">(</span><span class="nc">FullAdder</span><span class="p">);</span><span class="w">
    </span><span class="nc">FullAdder</span><span class="w"> </span><span class="nv">_adder</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFullAdderVerilog</span><span class="p">;</span><span class="w">
    </span><span class="k">return</span><span class="w"> </span><span class="nv">_adder</span><span class="p">;</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>After synthesis, I found there’s no overhead to wrapping the Verilog, but the <a href="#wrappers-around-bluespec">the same quirks from wrapping Bluespec reappeared</a> to give us either 14-gate or 16-gate circuits. We should be good to go in terms of embedding Verilog into our Bluespec designs.</p>

<p>The main drawback of this is that while importing Verilog is fine for using the Verilog backend for Bluespec (e.g., to run simulations with Verilog tools), it doesn’t work for using the Bluesim backend, which requires all modules to be implemented in Bluespec and compiled into <code class="language-plaintext highlighter-rouge">.ba</code> files. We would need to either re-implement the Verilog modules in Bluespec with conditional compilation, find a Verilog simulator, or not use Verilog implementations at all.</p>

<p>If using the Bluespec-recommended method of conditional compilation, we need to be extra careful that our Verilog implementation of a module is cycle-equivalent to our Bluespec implementation of that same module. Otherwise, we may run into trouble with correctness when we simulate with Bluesim and find our results to be different than our results in, say, Vivado. However, I think whatever can be implemented in Verilog can usually be implemented more easily in Bluespec.</p>

<p>If it turns out that implementing in Verilog gives us no benefit over implementing in Bluespec, then I might just stick with Bluespec implementations for use in Bluesim. The full adder example gave no evidence of greater overhead in Bluespec, so at least it’s clean enough for simple modules.</p>

<h2 id="next-time">Next Time</h2>
<p>This time, we tweaked <code class="language-plaintext highlighter-rouge">synth</code> to work better for our goals, and we did some investigation on the interplay between Bluespec, Verilog, and the <code class="language-plaintext highlighter-rouge">synth</code> tool.</p>

<p>Next time, we can see about implementing adders in both Bluespec and Verilog, which <code class="language-plaintext highlighter-rouge">synth</code> allows us to quantitatively evaluate. For correctness, we’ll check against the built-in <code class="language-plaintext highlighter-rouge">+</code> operator (it looks like Bluespec’s <code class="language-plaintext highlighter-rouge">+</code> just wraps around Verilog’s <code class="language-plaintext highlighter-rouge">+</code>) as we implement a simple <a href="https://en.wikipedia.org/wiki/Adder_(electronics)#Ripple-carry_adder">ripple-carry adder</a> and several types of <a href="https://en.wikipedia.org/wiki/Carry-lookahead_adder">carry-lookahead adder</a>.</p>

<p>As we implement adders, I’ll continue to evaluate synthesis differences between Bluespec and Verilog. If performance permits, we might end up not actually needing to use any Verilog implementations in our processor, allowing us to maintain a strictly Bluespec code base.</p>

<p>We’ll see about using these adders later on in our multiplication unit and in other places.</p>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[Technical blog post experimenting and tweaking with Minispec's synthesis tool. I walk through some features using Verilog and Bluespec full adders as the main example.]]></summary></entry><entry><title type="html">East Campus Build 2023</title><link href="https://www.martinchan.org/blog/ec-build-2023/" rel="alternate" type="text/html" title="East Campus Build 2023" /><published>2023-10-01T00:00:00-04:00</published><updated>2023-10-01T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/ec-build-2023</id><content type="html" xml:base="https://www.martinchan.org/blog/ec-build-2023/"><![CDATA[<h2 id="preview">Preview</h2>
<figure>
<p><img src="/assets/media/ec-build-2023/from_tracks.jpg" alt="" /></p>
  <figcaption>A view of this year’s EC Fort 2023 from the track, with the tennis bubble and the Prudential Center behind.</figcaption>
</figure>

<!-- No Table of Contents this time -->

<h2 id="introduction">Introduction</h2>

<p>Every year since 2004<sup id="fnref:pandemic" role="doc-noteref"><a href="#fn:pandemic" class="footnote" rel="footnote">1</a></sup>, the students of MIT’s <a href="https://ec.mit.edu/">East Campus dormitory</a> (EC) have built and disassembled a large wooden structure as a monument to the community. Last year, it was a climbing wall fort and <a href="https://news.mit.edu/2022/featured-video-building-roller-coaster-0926">rollercoaster</a>.<sup id="fnref:reprise" role="doc-noteref"><a href="#fn:reprise" class="footnote" rel="footnote">2</a></sup> This year, it was a double climbing wall and a three-story fort.</p>

<p>Usually, the EC build happens in the courtyard between the two parallel buildings of East Campus. It happens during the few weeks before fall classes start, and it’s partially meant to drum up excitement among the incoming first-year students. What a marvelous sight, moving onto campus for college and seeing students building something so grand next to their dorm, of their own design, and all by themselves!</p>

<p>Things are different in 2023 because the physical dorm itself has become a construction site, having just started its <a href="https://studentlife.mit.edu/east-campus-renovation">2-year renovation project</a> that runs from Summer 2023 through Summer 2025. Instead, the community was granted permission to use the field next to the tennis bubble as a build site.</p>

<p>One up-side is that the build site is much closer to dorm row. It’s easier to do external outreach this way since students moving into most of the dorms can see and hear the build. They might otherwise only find out about it through word-of-mouth, since East Campus is on the other side of campus.</p>

<p>One major down-side is that the people primarily working on the project need to commute way farther to get to it. Mustering up student labor to complete the project has historically been hard enough in normal times. It was much easier to get masses of people involved when the build was less than two hundred feet away from their bedroom.</p>

<p>I only helped out for a few days during my visit to Cambridge this year. Many of my friends were working on build, and I expected that they would appreciate the extra hand. I was much more involved with build for 2021 and 2022, and I always remember us being shorthanded.</p>

<p>This year, I helped out for about 15-20 hours total between building, monitoring for East Side Party, and deconstructing.</p>

<h2 id="background">Background</h2>

<p>I’ve never been very involved in the planning process for build. Usually there are a few head engineers who do much of the planning. They write up the designs and these get signed off on by several school offices and a professional engineer. They also do procurement for the materials and equipment. Some others raise funds through sponsors, though much of the money sometimes comes from the dorm house tax.</p>

<p>This year, like last year, the head engineer was my friend <a href="https://www.anhadsawhney.com/">Anhad Sawhney ‘25</a>.<sup id="fnref:anhad" role="doc-noteref"><a href="#fn:anhad" class="footnote" rel="footnote">3</a></sup> There are loads of pictures and CAD models available on his website, both of <a href="https://www.anhadsawhney.com/#/east-campus-fort/">this year’s</a> and of <a href="https://www.anhadsawhney.com/#/rollercoaster/">last year’s</a> builds. There was also a team of build leads who were formally in charge of leading the build effort during the weeks when we’re cutting wood, assembling the build, and taking it down. I don’t remember all of them, but the most significant one I remember was Jordan P.A ‘24, who was on site day-in, day-out, putting in major sweat and hours to keep the project on track.</p>

<p>It takes so many person-hours to successfully complete the EC Build. It takes the student engineers months of preparation, but it also takes many students many hours to do the actual labor of carrying, cutting, screwing together, and eventually unscrewing the wood. It’s not particularly glamorous work, but you really can’t simply think up a fort and have it manifest itself. It takes major hours and major elbow grease. And it can be hard to get that from a labor pool of MIT students who might not be accustomed to getting their hands dirty with physical labor, especially on hot August days.</p>

<h2 id="construction">Construction</h2>

<figure>
<p><img src="/assets/media/ec-build-2023/materials.jpg" alt="Pallets of plywood, 2x4s, other lumber, and a shipping container for storage." /></p>
  <figcaption>Here are some of the materials and the storage pod where we put the tools away at night. In the foreground is one of our pallets of plywood. You can see some good progress in the background already.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/blanks.jpg" alt="Partially assembled fort with two big blank climbing walls." /></p>
  <figcaption>The state of the fort on the day I arrived. Pretty much the entire skeleton was built before I got there. The front has two climbing walls that will eventually be painted with murals and have climbing holds screwed in. Each plywood sheet is 4x8 feet, so each wall is about 14x16 feet.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/murals.jpg" alt="Two murals on the climbing wall, one with a revolution fist and the other a satire of Liberty in the French Revolution." /></p>
  <figcaption>This year’s theme was <strong>Revolution</strong>. The red burning man on black background is the East Campus mascot.</figcaption>
</figure>

<p>This year, I helped with the stairs, some miscellaneous tasks that simply needed doing, and, as a reasonably experienced builder, some minor people support.</p>

<h3 id="stairs">Stairs</h3>
<p>The stairs were a pain point in this year’s build, so I was part of the team working on getting them up. I helped screw in about half the tread 2x4s and began cutting the plywood risers that would stop people’s toes from going too far past each stair. A full build is many, many medium sized tasks that are split into little tasks. And every single one of them needs to be done to complete the build.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/stairs_up.jpg" alt="2x4s on top of a stair without any plywood yet." /></p>
  <figcaption>I helped screw in many of the supports beneath the treads, which are the horizontal pieces you see here. We needed to alternate which side got screwed into the bottom and which got screwed into the side, which was a little funny-looking. I’m told it was structurally optimal, like screwing in your tires in a star shape. It has something to do with load distribution.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/stairs_front.jpg" alt="Slightly more complete stairs." /></p>
  <figcaption>I wasn’t directly involved in this part, but this was when the stairs were farther along. Here, they’ve started adding the tread tops. I cut a few of the risers here, which stop your toe from going underneath the next step. A small team was responsible for the outer railing, and it looks like they were doing a pretty good job.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/stairs_back.jpg" alt="" /></p>
  <figcaption>Here’s a back view of the stairs as they got further along. I passed cutting out the risers to someone else as I went to work on other things.</figcaption>
</figure>

<h3 id="people-support">People Support</h3>
<p>I don’t have any pictures for this part, but an understated part of the build process is being able to make good use of the available labor. Many times, first-years would come up, interested in the project and maybe even interested in helping out. But they don’t stick around for long unless they have the appropriate guidance.</p>

<p>Several times during the build, I stopped to make sure that the people who were showing up were being taken care of. If they wanted to help, I would find them someone more experienced who could do with a hand. Optimally, it would be for tasks that the newcomer would be excited about. On the other side, sometimes when a team was short-handed, I would go out and try to find the people who could come help. There were no formal structures in the build for these people-support kind of things; it was more holistically seeing where people were and where people needed to be. In the weeks before classes, there are people floating around for all sorts of reasons. And I like meeting with people.</p>

<p>In previous years, I would spend some time doing some outreach trying to get more people to build. Sometimes, I would be like a little foreman. People would be part of my crew and we would collectively work on a task like building trapdoors or building railings. There’s another thing: I could keep my nose down and just do the physical tasks that needed doing, but I could also get more person-hours onto the project by making sure that the right people were in the right places. It’s kind of like the capital versus consumer goods that they teach in microeconomics with the <a href="https://en.wikipedia.org/wiki/Production%E2%80%93possibility_frontier">production-possibility frontier</a>. There’s a balance between time spent directly on building versus on expanding the team.</p>

<p>We also had a scaled-down grilling operation this year, in part because this year’s labor force was scaled down. In previous years, I would also be one of the runners getting food (what we call <a href="https://mitadmissions.org/blogs/entry/a-place/">rush burgers</a>) from the grills to the builders. It was a good way to keep morale up when I was too tired to continue hauling or screwing things in.</p>

<h3 id="miscellaneous-tasks">Miscellaneous Tasks</h3>
<p>I helped with a few things that just needed doing.</p>

<p>One of the things that’s almost always needed is hauling: just carrying things from point A to point B. Wood needed to be carried from the pallets to the saws; mattresses needed to be carried from the loading point to the climbing walls; just all sorts of things needed to be carried from somewhere to somewhere else.</p>

<p>Some things don’t need as much time or effort as much as someone with the right know-how in the right place. At some point, we needed to switch out the jigsaw blades. I was the only one on-site with any experience maintaining jigsaws, so that was on me to swap out and quality-test the jigsaws to make sure they were still fit for construction.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/jigsaw_blades.jpg" alt="A packet of jigsaw blades." /></p>
  <figcaption>The jigsaw blades we were using. I took this to keep track of which ones we had left.</figcaption>
</figure>

<p>One of the days, we needed to continue spray painting the steel plates we had. Each coat needs time to dry, so I got there early one day and did a couple coats. These would go into this year’s iteration of <a href="https://www.gregoryxie.com/wao/">wAo</a>, managed by Lili. Some of the coats were meant to protect the steel from corrosion.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/spray_paint.jpg" alt="" /></p>
  <figcaption>A few spray-painted pieces of steel that were going to be used in the wAo ride.</figcaption>
</figure>

<p>My dear friend <a href="http://www.katrinajander.com/">Kat Jander ‘25</a> was on the lighting team and needed some extra hands assembling and transporting her awesome <a href="http://www.katrinajander.com/2023/09/cannonball-project.html">cannonball letters</a>, so I left the build site for a few hours to help her with that. She had already done all the design work and production work in the months before, but the project just needed some tedious screwing things together to get it across the finish line.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/kat_lights.jpeg" alt="" /></p>
  <figcaption>Kat working on putting together the letters for suspension from the cable.</figcaption>
</figure>

<h2 id="east-side-party">East Side Party</h2>
<p>The deadline for construction is the night of the East Side Party, where one or several EC residents DJ from the third floor of the fort. All students are invited to hang out, dance, and see the product of the builders’ hard work. It’s also a time when we run the rides like Space Trainer and wAo, and open up the climbing wall.</p>

<p>I think it’s meant to kick off <a href="https://ec.mit.edu/rex_cpw">REX</a>, where dorms host a bunch of cool events to <a href="https://dormcon.mit.edu/rex/">welcome the first-years</a>. I was an orientation leader in 2021, and I could swear that the attendance of the East Side Party that year was <em>far</em> higher than any orientation event. And this is happening before most upperclassmen are invited back to campus.</p>

<p>During the East Side Party this year, I accompanied my friends who were watching the climbing wall to make sure things were staying safe during the event.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/night_lights.jpg" alt="A string of circular lights strung between two towers of the fort." /></p>
  <figcaption>The lights designed and build by my dear friend Kat. Read more about <a href="http://www.katrinajander.com/2023/09/cannonball-project.html">the particular project on her website</a>.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/ec_party.jpg" alt="" /></p>
  <figcaption>The view from the third floor of the fort during the East Side Party overlooking the rest of the site. Buildings in the background include New Vassar, the <a href="https://betterworld.mit.edu/met-warehouse/">Metropolitan Storage Warehouse</a>, the Z Center, and the new music buildings, still under construction.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/night_build.jpg" alt="The complete fort from the side." /></p>
  <figcaption>A view of the fort from the materials pile. This was as we were closing up for the night right after East Side Party. We still needed to put away a bunch of stuff in preparation for deconstruction, which was happening very soon. On the right is the tennis bubble. Also notice the <a href="https://www.strongtie.com/">Simpson Strong-Tie</a> banner. They’ve been good sponsors of the EC Build for many years, providing us with their wonderful fasteners.</figcaption>
</figure>

<h2 id="deconstruction">Deconstruction</h2>

<p>On my last full day in Cambridge, I helped out for a little bit in the morning and a few hours at night in the deconstruction effort. It was the day before Registration Day, and the team (especially hardworking Jordan P.A.) was getting stressed out that not very much progress was being made on deconstruction. It would be even harder to work on getting the build down once classes start, so Registration Day was a soft deadline.</p>

<p>The people who were showing up were working very hard, but there was severe difficulty getting people to show up. It’s been true even in the best of times, like in previous years when our target audience lived right next to the build site. But now, volunteers would need to actively show up, and not many did.</p>

<p>I remember being very involved in deconstruction last year in 2022, and it was the exact same situation as this year: the build group chat would have 100 people, and frequent desperate bumps, and less than six people (and the same six people!) would show up every day to help out. The difference was that then, I could see the lurkers through their bedroom windows from the courtyard. Now, they were just out of sight.</p>

<p>Maybe things were different before the pandemic. I don’t think this pattern is sustainable. A spectacle of a community’s labor needs a community’s labor. There needs to be major overhaul in assessing or cultivating interested student labor in seeing the project through, or the project needs to be downscaled or shelved. Of course, it’s harder when the dorm is under renovation.</p>

<figure>
<p><img src="/assets/media/ec-build-2023/tan_lines.jpg" alt="A few 2x4s with stripes running across." /></p>
  <figcaption>When I was disassembling some of the railings for deconstruction, I noticed some of them developed tan lines. I don’t actually know whether it’s from sun exposure or dirt. Maybe both.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/disassemble_1.jpg" alt="Fort with the walls removed and the towers mostly still there." /></p>
  <figcaption>The state of the fort in disassembly the morning I showed up.</figcaption>
</figure>

<figure>
<p><img src="/assets/media/ec-build-2023/disassemble_2.jpg" alt="Fort with walls and most of the second floor gone." /></p>
  <figcaption>The state of the fort in disassembly my last night in Cambridge. I wasn’t there the whole day: just for an hour at the front and a few hours at the back. There was a whole lot of progress. I’m told it was the most people (about 10) on site since the East Side Party, to the relief of the hard-working students in charge.</figcaption>
</figure>

<p>I always thought it was weird how soon we need to deconstruct. The full build would only be up for a few days to a week. Maybe it’s just been the case these past few post-pandemic years, and maybe in earlier years the builds would be up for longer. I don’t know. But the designs go through so much review, and the build takes so long, that it feels a bit like a waste that the builds have to go down so quickly.</p>

<p>The build uses more-or-less the <a href="https://en.wikipedia.org/wiki/Framing_(construction)">same techniques</a> as typical American wooden-frame houses do. I suspect (with absolutely zero education in mechanical engineering) that the structure could stand safely for months, if not years, with how large the safety margins are. It seems like a lot of overengineering for something that comes up and comes right back down. I get needing to take it down this year because we were using borrowed fields, but I feel like they could stay up just fine in the East Campus courtyard. Maybe it’s the Institute’s fear of liability or something.</p>

<p>I suppose it’s not so different than running a school play. Professional actors doing Broadway shows rehearse for months and do eight shows a week for months and months afterward. For school plays, actors have a couple months of (part time) rehearsal for a handful of shows, and then it’s all over. Both EC build and school plays have that bizarre ratio between preparation and runtime.</p>

<p>It’s probably not a controversial take that the EC build is more for the spectacle than it is for the structure itself.</p>

<h2 id="disposal">Disposal</h2>
<p>There’s a step that comes after deconstruction.</p>

<p>Usually after deconstruction, the builders and other EC residents salvage some of the lumber for their own projects. Most typical is the loft, which you might see in my <a href="/blog/dorm-room/">room tour</a> or <a href="https://mitadmissions.org/blogs/entry/a-mans-reach-should-exceed-his-grasp/">other people’s posts</a>. At some point I’ll write a more comprehensive post about my experience with my salvaged wood in 2022.</p>

<p>All the wood that remains after salvage (which this year was quite a lot, considering there was no East Campus in which to squirrel away all the wood) is taken for disposal.</p>

<hr />
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:pandemic" role="doc-endnote">
      <p>The <a href="https://en.wikipedia.org/wiki/COVID-19_pandemic">pandemic that shut down 2020</a> notwithstanding. <a href="#fnref:pandemic" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:reprise" role="doc-endnote">
      <p>Here’s an article of a <a href="https://spectrum.mit.edu/spring-2017/building-traditions/">previous build from 2016</a>, though it also mentions Next Haunt, which is something Next House does. See also <a href="https://mitadmissions.org/blogs/entry/roller-coasters-and-dance-parties/">EC Build 2014</a>. <a href="#fnref:reprise" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:anhad" role="doc-endnote">
      <p>Anhad is a genius. He’s going to change the world one day. <a href="#fnref:anhad" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[I talk about my experience at the 2023 annual MIT East Campus build effort. I helped out for a little bit during my visit to Cambridge a few weeks ago.]]></summary></entry><entry><title type="html">Basic Multiplier</title><link href="https://www.martinchan.org/blog/basic-multiplier/" rel="alternate" type="text/html" title="Basic Multiplier" /><published>2023-09-28T00:00:00-04:00</published><updated>2023-09-28T00:00:00-04:00</updated><id>https://www.martinchan.org/blog/basic-multiplier</id><content type="html" xml:base="https://www.martinchan.org/blog/basic-multiplier/"><![CDATA[<ol id="markdown-toc">
  <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
  <li><a href="#testing" id="markdown-toc-testing">Testing</a>    <ol>
      <li><a href="#test-case-generation" id="markdown-toc-test-case-generation">Test Case Generation</a></li>
      <li><a href="#bluespec-testbench" id="markdown-toc-bluespec-testbench">Bluespec Testbench</a></li>
      <li><a href="#dummy-implementation" id="markdown-toc-dummy-implementation">Dummy Implementation</a></li>
    </ol>
  </li>
  <li><a href="#designs" id="markdown-toc-designs">Designs</a>    <ol>
      <li><a href="#first-implementation" id="markdown-toc-first-implementation">First Implementation</a></li>
      <li><a href="#radix-2-signed-multiplication" id="markdown-toc-radix-2-signed-multiplication">Radix-2 Signed Multiplication</a></li>
      <li><a href="#radix-4-multiplication" id="markdown-toc-radix-4-multiplication">Radix-4 Multiplication</a></li>
      <li><a href="#radix-8-multiplication" id="markdown-toc-radix-8-multiplication">Radix-8 Multiplication</a></li>
      <li><a href="#radix-16-consideration" id="markdown-toc-radix-16-consideration">Radix-16 Consideration</a></li>
    </ol>
  </li>
  <li><a href="#reduced-adder-design-for-radix-8" id="markdown-toc-reduced-adder-design-for-radix-8">Reduced Adder Design for Radix-8</a>    <ol>
      <li><a href="#lower-area-higher-delay" id="markdown-toc-lower-area-higher-delay">Lower Area, Higher Delay</a></li>
      <li><a href="#experimentation" id="markdown-toc-experimentation">Experimentation</a></li>
      <li><a href="#exercise-for-the-reader" id="markdown-toc-exercise-for-the-reader">Exercise for the Reader</a></li>
    </ol>
  </li>
  <li><a href="#built-in-multiplier" id="markdown-toc-built-in-multiplier">Built-in Multiplier</a></li>
  <li><a href="#next-time" id="markdown-toc-next-time">Next Time</a></li>
</ol>

<h2 id="introduction">Introduction</h2>
<p>Let’s write a simple multiplier that we can test and eventually put into our <a href="/projects/processor/">processor</a>. The eventual aim is to implement the RISC-V “M” Standard Extension for Integer Multiplication and Division. It’s a small extension that takes only a couple paragraphs per operation in the RISC-V specification.</p>

<p>This post is the first in a series of worked examples and explorations using Bluespec. It’s also an opportunity for me to use the <a href="/projects/bluespec-lexer/">Bluespec lexer</a> and <a href="/projects/vscode-bsv/">Bluespec extension for VS Code</a> I worked so hard on.</p>

<p>I begin by discussing how we create our testbench (in this house, we believe in <a href="https://en.wikipedia.org/wiki/Test-driven_development">test-driven development</a>) and continue by going through a sequence of multiplier designs from the non-functional to decent. We don’t go into advanced multipliers; that’s more for a later post.</p>

<p>For each design, I present a Bluespec excerpt and some critical-path, area, and cycle numbers. The synthesis-related numbers come from using the <a href="https://github.com/minispec-hdl/minispec">Minispec</a> <code class="language-plaintext highlighter-rouge">synth</code> tool. The numbers are probably worse than if we implemented directly in Verilog, but that’s alright for now.</p>

<p>All the code here was written from scratch and is hosted on <a href="https://github.com/mchanphilly/basic-multiplier/tree/main">this GitHub repo</a>. I didn’t keep all the different implementations in the most recent commit, but you can go through the commit history for earlier stages.</p>

<h2 id="testing">Testing</h2>
<p>It’s both quicker and easier to debug a little module like a multiplier by running it through <a href="https://en.wikipedia.org/wiki/Unit_testing">unit tests</a> than by putting it into a huge processor and seeing whether the processor breaks. In the hardware-world, I believe this is called <em>verification</em>.</p>

<p>In this case, I chose to do it as a two-step process. We first generate the tests (in whatever way we want), then we write a Bluespec testbench that consumes the tests and probes an instance of our multiplier-to-be-tested.</p>

<h3 id="test-case-generation">Test Case Generation</h3>
<p>I chose to generate test cases as a hexadecimal file using a C script. If you’re unfamiliar with compiled languages like C, I just write my <code class="language-plaintext highlighter-rouge">multiplier.c</code> file, run it through a compiler with <code class="language-plaintext highlighter-rouge">gcc multiplier.c -o multiplier</code>, and execute the binary <code class="language-plaintext highlighter-rouge">./multiplier</code>. I’ve tucked most of the compilation commands I’ll be using into a <a href="https://github.com/mchanphilly/basic-multiplier/blob/main/Makefile">Makefile</a>.</p>

<p>My <a href="https://github.com/mchanphilly/basic-multiplier/blob/main/multiplier.c">C script</a> outputs into a text file with one test per line, and each line as hexadecimal that my testbench will consume. Each line is basically a packed <code class="language-plaintext highlighter-rouge">[a, b, ab]</code>, where <code class="language-plaintext highlighter-rouge">a times b equals ab</code>. There are 32 bits allocated to each of <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code>, and 64 bits for <code class="language-plaintext highlighter-rouge">ab</code>, since we might need to supply the upper 32 bits as per the RISC-V “M” extension specification.</p>

<p>Here’s what generating each test case looks like. We perform the multiplication in C, then log both operands and the output in hex. Later, our testbench will feed these operands into our Bluespec multiplier and check that the outcome is equal to the outcome we got in C.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="c1">// in main(), we instantiate the file pointer with something like this:</span>
<span class="c1">// FILE *fptr = fopen("test_cases.vmh", "w");</span>

<span class="c1">// Log the case</span>
<span class="kt">void</span> <span class="nf">log_multiply</span><span class="p">(</span><span class="kt">FILE</span><span class="o">*</span> <span class="n">fptr</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int32_t</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int64_t</span> <span class="n">ab</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span><span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="kt">int64_t</span><span class="p">)</span><span class="n">b</span><span class="p">;</span>
    <span class="cp">#ifdef DEBUG
</span>    <span class="n">printf</span><span class="p">(</span><span class="s">"%x times %x is %lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">ab</span><span class="p">);</span>
    <span class="cp">#endif
</span>    <span class="c1">// format is a, b, ab_upper, ab_lower</span>
    <span class="n">fprintf</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="s">"%08x%08x%016lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">ab</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We use two batches of tests: special cases, then random cases. I also cap off the test cases with a <a href="https://en.wikipedia.org/wiki/Sentinel_value">sentinel value</a> so our testbench knows when to end.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">end_case</span><span class="p">(</span><span class="kt">FILE</span><span class="o">*</span> <span class="n">fptr</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"deadbeef cap</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="n">fprintf</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="s">"%032x"</span><span class="p">,</span> <span class="mh">0xdeadbeef</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="special-cases">Special Cases</h4>
<p>The first batch is to check a set of hand-picked test cases. I selected these particular test cases because they were either simple cases that we could check by eye (so we know that our test generation itself works) or edge cases like multiplying <code class="language-plaintext highlighter-rouge">INT32_MIN</code> by itself, which yields the largest signed product.</p>

<p>Between these test cases, we should be covering the breadth of possible inputs (e.g., multiply by 0, multiplying different signs, multiplying same signs). Another way of saying it is that if we pass this small set of tests, we have a pretty strong sign that our multiplier is fully correct.</p>

<p>In our case, our special cases are:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">specific_multiply</span><span class="p">(</span><span class="kt">FILE</span><span class="o">*</span> <span class="n">fptr</span><span class="p">)</span> <span class="p">{</span>
    <span class="c1">// Non-negatives first</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>  <span class="c1">// 0 identity, easy</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>  
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>  <span class="c1">// 1 identity, easy</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MAX</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MAX</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MAX</span><span class="p">,</span> <span class="n">INT32_MAX</span><span class="p">);</span>
    <span class="c1">// Signed below</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>  <span class="c1">// sign extended</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">);</span>  <span class="c1">// negative times negative?</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MIN</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MIN</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MIN</span><span class="p">,</span> <span class="n">INT32_MIN</span><span class="p">);</span>  <span class="c1">// biggest possible result</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">INT32_MAX</span><span class="p">,</span> <span class="n">INT32_MIN</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>I know there’s a lot of repeated boilerplate here. If we wanted to, we could store all the test pairs in a separate array or file, or in any number of ways, and then iterate through calling <code class="language-plaintext highlighter-rouge">log_multiply(fptr, a, b)</code> for each <code class="language-plaintext highlighter-rouge">a, b</code>.</p>

<p>I think a two-step process is good enough for this worked example, so we’re keeping it a little simple at the cost of some repeated code. If we had a more complicated design that we were testing with many more test cases, then we might want to encode our human-readable cases separately from the C code.</p>

<p>In the <code class="language-plaintext highlighter-rouge">test_cases.vmh</code> hex file, these cases look like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00000000000000000000000000000000
00000000000000010000000000000000
00000001000000010000000000000001
7fffffff000000000000000000000000
7fffffff00000001000000007fffffff
7fffffff7fffffff3fffffff00000001
ffffffff000000000000000000000000
ffffffff00000001ffffffffffffffff
ffffffffffffffff0000000000000001
80000000000000000000000000000000
8000000000000001ffffffff80000000
80000000800000004000000000000000
7fffffff80000000c000000080000000
</code></pre></div></div>
<p>Each 32-bit segment is represented by eight characters. For example, look at the third line, which corresponds to <code class="language-plaintext highlighter-rouge">1 times 1 equals 1</code>. On the left half we have a <code class="language-plaintext highlighter-rouge">00000001</code> for each operand, and on the right, we have a <code class="language-plaintext highlighter-rouge">1</code> with 15 leading zeros as the product.</p>

<p>Most of our bugs should be caught by our well-selected test cases.</p>

<h4 id="randomized-tests">Randomized Tests</h4>
<p>The second batch is much larger and consists of randomized tests. For a simple design like a multiplier, we shouldn’t be getting any surprises in this batch. Randomized tests are more significant for complex designs where the input space is bigger or bothersome to write special cases for (e.g., testing correct functionality of a cache). But it’s good practice to do it here too.</p>

<p>Because our multiplier is simple, these are mostly to measure performance (in terms of cycles per multiplication, asymptotically) and partly to give us the confidence that our multiplier works even when we don’t know the inputs.</p>

<p>It doesn’t so much matter here since I’m both testing and developing the multipliers, but randomized tests can also protect from bad multipliers that just hardcode special cases. Or maybe that’s just why they did it in school.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdint.h&gt;</span><span class="cp">
</span>
<span class="kt">void</span> <span class="nf">rand_multiply</span><span class="p">(</span><span class="kt">FILE</span><span class="o">*</span> <span class="n">fptr</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">int32_t</span> <span class="n">a</span> <span class="o">=</span> <span class="n">rand</span><span class="p">()</span> <span class="o">-</span> <span class="n">rand</span><span class="p">();</span>  <span class="c1">// covers full input space</span>
    <span class="kt">int32_t</span> <span class="n">b</span> <span class="o">=</span> <span class="n">rand</span><span class="p">()</span> <span class="o">-</span> <span class="n">rand</span><span class="p">();</span>
    <span class="n">log_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>These test cases are a lot less pleasant to read than our earlier test cases, but we put them in the <code class="language-plaintext highlighter-rouge">test_cases.vmh</code> all the same. Because we’ve checked our previous test cases, we can be confident that these work.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>f2161e7bfbd104f5003a3531831017b7
f69277035113f1d3fd039b9ee4faea79
f8e5e6de58f1fdd3fd8850c001a4aefa
19d0ac8848a2aa54075317c0791aeca0
aacbcda9489a4485e7d5f9a31e2cbccd
05257b1e54e6803e01b4eea67196d144
c0fc893844312d6cef36f99af260bba0
076ec261eba76b3bff68c66bda0c575b
</code></pre></div></div>
<p>For my testing, I wrote 13 special test cases and generated 900 randomized test cases. We can measure performance by dividing our total cycles by 913 as soon as we pass all tests.</p>

<p>For completeness, this is what my <code class="language-plaintext highlighter-rouge">main()</code> looks like (though you can look for yourself in the <a href="https://github.com/mchanphilly/basic-multiplier/blob/main/multiplier.c">repo</a>):</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">const</span> <span class="kt">int</span> <span class="n">SEED</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span>
    <span class="k">const</span> <span class="kt">int</span> <span class="n">cases</span> <span class="o">=</span> <span class="mi">900</span><span class="p">;</span>

    <span class="n">srand</span><span class="p">(</span><span class="n">SEED</span><span class="p">);</span>
    <span class="kt">FILE</span> <span class="o">*</span><span class="n">fptr</span> <span class="o">=</span> <span class="n">fopen</span><span class="p">(</span><span class="s">"test_cases.vmh"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">);</span>

    <span class="n">specific_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">);</span>

    <span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">cases</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">rand_multiply</span><span class="p">(</span><span class="n">fptr</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="n">end_case</span><span class="p">(</span><span class="n">fptr</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<h4 id="other-means-of-generation">Other Means of Generation</h4>

<p>Alternatively, I could’ve also generated test cases in Bluespec directly, or using any other language that can write to a file.</p>

<p>If I had a reference Bluespec or Verilog implementation of a multiplier, I could’ve also used that to generate test cases during runtime inside of the testbench. That would’ve been a fine option, considering Bluespec actually <em>has</em> a built-in multiplier with the <code class="language-plaintext highlighter-rouge">*</code> operator, which I <a href="#built-in-multiplier">discuss later</a>. We could’ve produced a pair of operands and fed them into both our multiplier-under-test and our reference multiplier. Then, we could’ve seen if they produced the same results.</p>

<p>I figured it would be easier to write our test cases using C since it has higher-level constructs than Bluespec. Also, I haven’t yet learned how to write a series of sequential test cases in Bluespec like I did in C.</p>

<h3 id="bluespec-testbench">Bluespec Testbench</h3>
<p>Once we have our test cases in our <code class="language-plaintext highlighter-rouge">test_cases.vmh</code>, we can make <a href="https://github.com/mchanphilly/basic-multiplier/blob/main/MultiplierUnitTest.bsv">our Bluespec testbench</a> consume them. The way I did it is having our testbench instantiate a <code class="language-plaintext highlighter-rouge">BRAM</code> that loads from our hex file. Then, we can send <code class="language-plaintext highlighter-rouge">read</code>  requests to the <code class="language-plaintext highlighter-rouge">BRAM</code> for each test case, one line at a time.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">module</span><span class="w"> </span><span class="nf">mkMultiplierUnitTest</span><span class="p">(</span><span class="nc">Empty</span><span class="p">);</span><span class="w">
    </span><span class="k">let</span><span class="w"> </span><span class="nv">cfg</span><span class="w"> </span><span class="o">= </span><span class="nv">defaultValue</span><span class="p">;</span><span class="w">
    </span><span class="nv">cfg</span><span class="p">.</span><span class="nv">loadFormat</span><span class="w"> </span><span class="o">= </span><span class="kd">tagged</span><span class="w"> </span><span class="nc">Hex</span><span class="w"> </span><span class="dl">"</span><span class="s">test_cases.vmh</span><span class="dl">"</span><span class="p">;</span><span class="w">
    </span><span class="nc">BRAM1Port#</span><span class="p">(</span><span class="nc">MaxTestAddress</span><span class="p">,</span><span class="w"> </span><span class="nc">TestPacket</span><span class="p">)</span><span class="w"> </span><span class="nv">tests</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkBRAM1Server</span><span class="p">(</span><span class="nv">cfg</span><span class="p">);</span><span class="w">

    </span><span class="nc">MultiplierUnit</span><span class="w"> </span><span class="nv">dut</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkMultiplierUnit</span><span class="p">;</span><span class="w">
    </span><span class="c">// ... everything else ...</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>My testbench <code class="language-plaintext highlighter-rouge">mkMultiplierUnitTest</code> uses five rules:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">puts</code> submits a new read request for every line of our <code class="language-plaintext highlighter-rouge">BRAM</code></li>
  <li><code class="language-plaintext highlighter-rouge">question</code> receives the <code class="language-plaintext highlighter-rouge">BRAM</code> response, queries our multiplier-under-test, and enqueues the expected result.
    <ul>
      <li>Terminates if it detects our sentinel value <a href="https://en.wikipedia.org/wiki/Deadbeef"><code class="language-plaintext highlighter-rouge">deadbeef</code></a>.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">answer</code> receives the multiplier response and compares it to our expected result, which it then discards.
    <ul>
      <li>Terminates if we receive a wrong answer.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">tick</code> (minor) increments our cycle counter.</li>
  <li><code class="language-plaintext highlighter-rouge">terminate</code> (minor) ends our simulation if we end up stalling.</li>
</ul>

<p>We can’t test something whose interface we don’t know, so here’s our <code class="language-plaintext highlighter-rouge">MultiplierUnit</code> interface below. We only worry about two methods: one to <code class="language-plaintext highlighter-rouge">start</code> computation, and one to free the multiplier and get the <code class="language-plaintext highlighter-rouge">result</code>. Internally, the implementation should have these methods guarded so that they’re only callable when the module is ready to perform each method.</p>
<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">typedef</span><span class="w"> </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span><span class="w"> </span><span class="nc">Word</span><span class="p">;</span><span class="w">
</span><span class="kd">typedef</span><span class="w"> </span><span class="nc">Vector#</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nc">Pair</span><span class="p">;</span><span class="w">

</span><span class="kd">interface</span><span class="w"> </span><span class="nc">MultiplierUnit</span><span class="p">;</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">);</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="p">;</span><span class="w">
</span><span class="kd">endinterface</span><span class="w">
</span></code></pre></div></div>

<p>Our <a href="https://github.com/mchanphilly/basic-multiplier/blob/main/MultiplierUnitTest.bsv">entire testbench</a> then looks like this:</p>
<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">typedef</span><span class="w"> </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="w"> </span><span class="nc">MaxTestAddress</span><span class="p">;</span><span class="w">  </span><span class="c">// 10 bits for max 1024 tests</span><span class="w">
</span><span class="kd">typedef</span><span class="w"> </span><span class="nc">Vector#</span><span class="p">(</span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nc">TestPacket</span><span class="p">;</span><span class="w">

</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkMultiplierUnitTest</span><span class="p">(</span><span class="nc">Empty</span><span class="p">);</span><span class="w">
    </span><span class="k">let</span><span class="w"> </span><span class="nv">cfg</span><span class="w"> </span><span class="o">= </span><span class="nv">defaultValue</span><span class="p">;</span><span class="w">
    </span><span class="nv">cfg</span><span class="p">.</span><span class="nv">loadFormat</span><span class="w"> </span><span class="o">= </span><span class="kd">tagged</span><span class="w"> </span><span class="nc">Hex</span><span class="w"> </span><span class="dl">"</span><span class="s">test_cases.vmh</span><span class="dl">"</span><span class="p">;</span><span class="w">
    </span><span class="nc">MultiplierUnit</span><span class="w"> </span><span class="nv">dut</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkMultiplierUnit</span><span class="p">;</span><span class="w">  </span><span class="c">// design under test</span><span class="w">

    </span><span class="nc">BRAM1Port#</span><span class="p">(</span><span class="nc">MaxTestAddress</span><span class="p">,</span><span class="w"> </span><span class="nc">TestPacket</span><span class="p">)</span><span class="w"> </span><span class="nv">tests</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkBRAM1Server</span><span class="p">(</span><span class="nv">cfg</span><span class="p">);</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">MaxTestAddress</span><span class="p">)</span><span class="w"> </span><span class="nv">request_index</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkReg</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span><span class="w">
    </span><span class="nc">FIFO#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nv">expected</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFIFO</span><span class="p">;</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nv">cycles</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkReg</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nv">last_solved</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkReg</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span><span class="w">

    </span><span class="kd">function</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">conclude</span><span class="p">;</span><span class="w">
        </span><span class="kr">action</span><span class="w">
        </span><span class="nb">$display</span><span class="p">(</span><span class="dl">"</span><span class="s">Ended at </span><span class="se">%0d</span><span class="s"> cycles after solving the </span><span class="se">%0d</span><span class="s"> test</span><span class="dl">"</span><span class="p">,</span><span class="w"> </span><span class="nv">cycles</span><span class="p">,</span><span class="w"> </span><span class="nv">last_solved</span><span class="p">);</span><span class="w">
        </span><span class="nb">$finish</span><span class="p">;</span><span class="w">
        </span><span class="kr">endaction</span><span class="w">
    </span><span class="kd">endfunction</span><span class="w">

    </span><span class="kd">rule</span><span class="w"> </span><span class="nf">tick</span><span class="p">;</span><span class="w">
        </span><span class="na">cycles </span><span class="o">&lt;= </span><span class="nv">cycles</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">

    </span><span class="c">// This rule keeps us requesting</span><span class="w">
    </span><span class="kd">rule</span><span class="w"> </span><span class="nf">puts</span><span class="p">;</span><span class="w">
        </span><span class="na">request_index </span><span class="o">&lt;= </span><span class="nv">request_index</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
        </span><span class="k">let</span><span class="w"> </span><span class="nv">request</span><span class="w"> </span><span class="o">= </span><span class="nc">BRAMRequest</span><span class="p">{</span><span class="w">
            </span><span class="nv">write</span><span class="o">:</span><span class="w"> </span><span class="nv">unpack</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span><span class="w">
            </span><span class="nv">address</span><span class="o">:</span><span class="w"> </span><span class="nv">request_index</span><span class="w">
        </span><span class="p">};</span><span class="w">
        </span><span class="nv">tests</span><span class="p">.</span><span class="nv">portA</span><span class="p">.</span><span class="nv">request</span><span class="p">.</span><span class="na">put</span><span class="p">(</span><span class="nv">request</span><span class="p">);</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">

    </span><span class="c">// This rule queries the dut</span><span class="w">
    </span><span class="kd">rule</span><span class="w"> </span><span class="nf">question</span><span class="p">;</span><span class="w">
        </span><span class="nc">TestPacket</span><span class="w"> </span><span class="nv">current_test</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nv">tests</span><span class="p">.</span><span class="nv">portA</span><span class="p">.</span><span class="nv">response</span><span class="p">.</span><span class="na">get</span><span class="p">();</span><span class="w">
        </span><span class="nv">current_test</span><span class="w"> </span><span class="o">= </span><span class="nv">reverse</span><span class="p">(</span><span class="nv">current_test</span><span class="p">);</span><span class="w">
        </span><span class="c">// Reverse order because of the way the vmh is written/read</span><span class="w">

        </span><span class="nc">Pair</span><span class="w"> </span><span class="nv">operands</span><span class="w"> </span><span class="o">= </span><span class="nv">unpack</span><span class="p">({</span><span class="nv">current_test</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">current_test</span><span class="p">[</span><span class="mi">1</span><span class="p">]});</span><span class="w">
        </span><span class="nc">Pair</span><span class="w"> </span><span class="nv">results</span><span class="w"> </span><span class="o">= </span><span class="nv">unpack</span><span class="p">({</span><span class="nv">current_test</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span><span class="w"> </span><span class="nv">current_test</span><span class="p">[</span><span class="mi">3</span><span class="p">]});</span><span class="w">

        </span><span class="nv">dut</span><span class="p">.</span><span class="na">start</span><span class="p">(</span><span class="nv">operands</span><span class="p">);</span><span class="w">
        </span><span class="nv">expected</span><span class="p">.</span><span class="na">enq</span><span class="p">(</span><span class="nv">results</span><span class="p">);</span><span class="w">

        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">current_test</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">'hdeadbeef</span><span class="p">)</span><span class="w"> </span><span class="kr">begin</span><span class="w">
            </span><span class="nb">$display</span><span class="p">(</span><span class="dl">"</span><span class="s">deadbeef detected; finishing at </span><span class="se">%0d</span><span class="s"> cycles</span><span class="dl">"</span><span class="p">,</span><span class="w"> </span><span class="nv">cycles</span><span class="p">);</span><span class="w">
            </span><span class="na">conclude</span><span class="p">;</span><span class="w">
        </span><span class="kr">end</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">

    </span><span class="kd">rule</span><span class="w"> </span><span class="nf">answer</span><span class="p">;</span><span class="w">
        </span><span class="nc">Pair</span><span class="w"> </span><span class="nv">result</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="nv">dut</span><span class="p">.</span><span class="na">result</span><span class="p">;</span><span class="w">
        </span><span class="na">last_solved </span><span class="o">&lt;= </span><span class="nv">last_solved</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">
        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">result</span><span class="w"> </span><span class="o">!=</span><span class="w"> </span><span class="nv">expected</span><span class="p">.</span><span class="nv">first</span><span class="p">)</span><span class="w"> </span><span class="kr">begin</span><span class="w">
            </span><span class="nb">$display</span><span class="p">(</span><span class="dl">"</span><span class="s">Result was </span><span class="se">%x</span><span class="s"> but expected </span><span class="se">%x</span><span class="dl">"</span><span class="p">,</span><span class="w"> </span><span class="nv">result</span><span class="p">,</span><span class="w"> </span><span class="nv">expected.first</span><span class="p">);</span><span class="w">
            </span><span class="na">conclude</span><span class="p">;</span><span class="w">
        </span><span class="kr">end</span><span class="w">
        </span><span class="nv">expected</span><span class="p">.</span><span class="na">deq</span><span class="p">;</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">

    </span><span class="kd">rule</span><span class="w"> </span><span class="nf">terminate</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">cycles</span><span class="w"> </span><span class="o">&gt;</span><span class="w"> </span><span class="mi">'hFFFF</span><span class="p">);</span><span class="w">
        </span><span class="nb">$display</span><span class="p">(</span><span class="dl">"</span><span class="s">Emergency exit</span><span class="dl">"</span><span class="p">);</span><span class="w">
        </span><span class="nb">$finish</span><span class="p">;</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>
<p>We should test our testbench so we know it actually works. For that, we can write a dummy <code class="language-plaintext highlighter-rouge">MultiplierUnit</code> implementation that can receive requests and serve responses without the burden of producing correct answers. It should fail tests and trigger the failure message.</p>

<h3 id="dummy-implementation">Dummy Implementation</h3>

<p>Just as above, we’re using these definitions.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">typedef</span><span class="w"> </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">32</span><span class="p">)</span><span class="w"> </span><span class="nc">Word</span><span class="p">;</span><span class="w">
</span><span class="kd">typedef</span><span class="w"> </span><span class="nc">Vector#</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nc">Pair</span><span class="p">;</span><span class="w">

</span><span class="kd">interface</span><span class="w"> </span><span class="nc">MultiplierUnit</span><span class="p">;</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">);</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="p">;</span><span class="w">
</span><span class="kd">endinterface</span><span class="w">

</span><span class="kd">typedef</span><span class="w"> </span><span class="kd">enum</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="no">Idle</span><span class="p">,</span><span class="w">
    </span><span class="no">Busy</span><span class="p">,</span><span class="w">
    </span><span class="no">Ready</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="nc">MultiplierState</span><span class="w"> </span><span class="kd">deriving</span><span class="w"> </span><span class="p">(</span><span class="no">Bits</span><span class="p">,</span><span class="w"> </span><span class="no">Eq</span><span class="p">,</span><span class="w"> </span><span class="no">FShow</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<p>This is a “multiplier” that does nothing. Its sole purpose is to let us run our testbench and fail without stalling. Which it does, beautifully.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkMultiplierUnit</span><span class="p">(</span><span class="nc">MultiplierUnit</span><span class="p">);</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">MultiplierState</span><span class="p">)</span><span class="w"> </span><span class="nv">state</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkReg</span><span class="p">(</span><span class="no">Idle</span><span class="p">);</span><span class="w"> 
    </span><span class="nc">FIFO#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nv">last_inputs</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFIFO</span><span class="p">;</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">)</span><span class="w"> </span><span class="nf">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nc">Idle</span><span class="p">);</span><span class="w">
        </span><span class="nv">last_inputs</span><span class="p">.</span><span class="na">enq</span><span class="p">(</span><span class="nv">in</span><span class="p">);</span><span class="w">
        </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Ready</span><span class="p">;</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Ready</span><span class="p">);</span><span class="w">
        </span><span class="nv">last_inputs</span><span class="p">.</span><span class="na">deq</span><span class="p">;</span><span class="w">
        </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Idle</span><span class="p">;</span><span class="w">
        </span><span class="k">return</span><span class="w"> </span><span class="nv">last_inputs</span><span class="p">.</span><span class="nv">first</span><span class="p">;</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 1208.08 um^2
Critical-path delay: 145.5 ps
Cycles: N/A
Coverage: Only when inputs = outputs
</code></pre></div></div>
<p>The large area despite having almost nothing can probably be chalked up to the <code class="language-plaintext highlighter-rouge">FIFO</code> that contains a <code class="language-plaintext highlighter-rouge">Pair</code>, or equivalent to <code class="language-plaintext highlighter-rouge">Bit#(64)</code> of space.</p>

<p>A stopped clock can be correct. Since our dummy implementation returns its inputs as outputs, we pass the <code class="language-plaintext highlighter-rouge">0 times 0 = 0</code> test, albeit nothing else. Now that we know our testbench works, we can start worrying about implementing the multiplier.</p>

<h2 id="designs">Designs</h2>

<h3 id="first-implementation">First Implementation</h3>

<p>Let’s use the Hennessy and Patterson designs for integer multiplication. Let’s start with unsigned multiplication, and we’ll handle signed multiplication later. We’ll use their Radix-2 Multiplication and Division.</p>

<p>This works for non-negative integers. We would need to add some complexity for accommodating negative integers.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkMultiplierUnit</span><span class="p">(</span><span class="nc">MultiplierUnit</span><span class="p">);</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">MultiplierState</span><span class="p">)</span><span class="w"> </span><span class="nv">state</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkReg</span><span class="p">(</span><span class="no">Idle</span><span class="p">);</span><span class="w"> 
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nv">a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkRegU</span><span class="p">;</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nv">b</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkRegU</span><span class="p">;</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Word</span><span class="p">)</span><span class="w"> </span><span class="nv">p</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkRegU</span><span class="p">;</span><span class="w">
    </span><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">5</span><span class="p">))</span><span class="w"> </span><span class="nv">index</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkRegU</span><span class="p">;</span><span class="w">  </span><span class="c">// only need 0 through 31</span><span class="w">

    </span><span class="kd">rule </span><span class="nf">work</span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Busy</span><span class="p">);</span><span class="w">
        </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">33</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="p">(</span><span class="nv">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">'b1</span><span class="p">)</span><span class="w"> </span><span class="kp">?</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">p</span><span class="p">}</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">b</span><span class="p">}</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">p</span><span class="p">};</span><span class="w">  </span><span class="c">// (1); may need carry</span><span class="w">
        </span><span class="na">p </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">32</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
        </span><span class="na">a </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
        </span><span class="na">index </span><span class="o">&lt;= </span><span class="nv">index</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">

        </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">31</span><span class="p">)</span><span class="w"> </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Ready</span><span class="p">;</span><span class="w">
    </span><span class="kd">endrule</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">)</span><span class="w"> </span><span class="nf">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nc">Idle</span><span class="p">);</span><span class="w">
        </span><span class="na">a </span><span class="o">&lt;= </span><span class="kd">in</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span><span class="w">
        </span><span class="na">b </span><span class="o">&lt;= </span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span><span class="w">
        </span><span class="na">p </span><span class="o">&lt;= </span><span class="mi">0</span><span class="p">;</span><span class="w">
        </span><span class="na">index </span><span class="o">&lt;= </span><span class="mi">0</span><span class="p">;</span><span class="w">
        </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Busy</span><span class="p">;</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="w"> </span><span class="nf">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nc">Ready</span><span class="p">);</span><span class="w">
        </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Idle</span><span class="p">;</span><span class="w">
        </span><span class="k">return</span><span class="w"> </span><span class="nv">unpack</span><span class="p">({</span><span class="nv">p</span><span class="p">,</span><span class="w"> </span><span class="nv">a</span><span class="p">});</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>
<p>Our multiplier is a state machine that goes between <code class="language-plaintext highlighter-rouge">Idle</code>, <code class="language-plaintext highlighter-rouge">Busy</code>, and <code class="language-plaintext highlighter-rouge">Ready</code>. It waits for a request in <code class="language-plaintext highlighter-rouge">Idle</code> from <code class="language-plaintext highlighter-rouge">start</code>, then it sets a bunch of registers and transitions to <code class="language-plaintext highlighter-rouge">Busy</code>. It will stay in <code class="language-plaintext highlighter-rouge">Busy</code> for 32 cycles worth of <code class="language-plaintext highlighter-rouge">work</code>, incrementing the <code class="language-plaintext highlighter-rouge">index</code> by 1 each time. When it’s on its last cycle, it will transition to <code class="language-plaintext highlighter-rouge">Ready</code>, where it waits for our testbench to pick up the product through a <code class="language-plaintext highlighter-rouge">result</code> call. Once <code class="language-plaintext highlighter-rouge">result</code> is called, the multiplier transitions back to <code class="language-plaintext highlighter-rouge">Idle</code>, ready for the next request.</p>

<h4 id="algorithm">Algorithm</h4>
<p>The main operation is in the line labeled <code class="language-plaintext highlighter-rouge">(1)</code>. We’re basically doing <a href="https://en.wikipedia.org/wiki/Multiplication_algorithm#Long_multiplication">long multiplication</a> in binary, one digit at a time. Hennessy and Patterson explain it better in their textbook.</p>

<p>We have 3 registers: <code class="language-plaintext highlighter-rouge">a</code>, <code class="language-plaintext highlighter-rouge">b</code>, and <code class="language-plaintext highlighter-rouge">p</code>.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">a </span><span class="o">&lt;= </span><span class="kd">in</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span><span class="w">
</span><span class="na">b </span><span class="o">&lt;= </span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span><span class="w">
</span><span class="na">p </span><span class="o">&lt;= </span><span class="mi">0</span><span class="p">;</span><span class="w">
</span></code></pre></div></div>
<ul>
  <li><code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are our initial operands. <code class="language-plaintext highlighter-rouge">b</code> stays constant, and <code class="language-plaintext highlighter-rouge">a</code> changes as we run the algorithm. The lower bits of <code class="language-plaintext highlighter-rouge">a</code> correspond to the operand, and the upper bits will gradually be filled with the lower bits of our product. We make the space by consuming and throwing away the lowest bit of <code class="language-plaintext highlighter-rouge">a</code> each cycle.</li>
  <li><code class="language-plaintext highlighter-rouge">p</code> starts as 0 and is part of our running sum.</li>
  <li>At termination, <code class="language-plaintext highlighter-rouge">a</code> will hold the lower bits of our product and <code class="language-plaintext highlighter-rouge">p</code> will hold the upper bits.</li>
</ul>

<p>Each cycle, we perform two steps inside of <code class="language-plaintext highlighter-rouge">work</code>.</p>

<p>Step 1, we insect the LSB of <code class="language-plaintext highlighter-rouge">a</code> (<code class="language-plaintext highlighter-rouge">a[0]</code>) to determine our <code class="language-plaintext highlighter-rouge">new_p</code>.</p>
<ul>
  <li>If it’s <code class="language-plaintext highlighter-rouge">0</code>, we do nothing and just use <code class="language-plaintext highlighter-rouge">new_p = p</code>.</li>
  <li>If it’s <code class="language-plaintext highlighter-rouge">1</code>, we add <code class="language-plaintext highlighter-rouge">b</code> to <code class="language-plaintext highlighter-rouge">p</code> and use <code class="language-plaintext highlighter-rouge">new_p = p + b</code>.</li>
  <li>We do this addition in <code class="language-plaintext highlighter-rouge">Bit#(33)</code> in case of overflow.</li>
</ul>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Bit#</span><span class="p">(</span><span class="mi">33</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="p">(</span><span class="nv">a</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">'b1</span><span class="p">)</span><span class="w"> </span><span class="kp">?</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">p</span><span class="p">}</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">b</span><span class="p">}</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="nv">p</span><span class="p">};</span><span class="w">  </span><span class="c">// (1); may need carry</span><span class="w">
</span></code></pre></div></div>

<p>Step 2, we shift <code class="language-plaintext highlighter-rouge">new_p</code> and <code class="language-plaintext highlighter-rouge">a</code> right by one bit, and store the resulting <code class="language-plaintext highlighter-rouge">new_p &gt;&gt; 1</code> into <code class="language-plaintext highlighter-rouge">p</code>. We can think of <code class="language-plaintext highlighter-rouge">p</code> and <code class="language-plaintext highlighter-rouge">a</code> as stuck together, with <code class="language-plaintext highlighter-rouge">p</code> on the left and <code class="language-plaintext highlighter-rouge">a</code> on the right. The total 64-bit product will eventually be <code class="language-plaintext highlighter-rouge">{p, a}</code>. We make the space for the MSB of <code class="language-plaintext highlighter-rouge">new_p</code> by shifting out the <code class="language-plaintext highlighter-rouge">a[0]</code> that we consumed. Think of long multiplication.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">p </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">32</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
</span><span class="na">a </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
</span></code></pre></div></div>

<p>I believe there are some startup cycles because it takes our <code class="language-plaintext highlighter-rouge">BRAM</code> a little bit of time to get set. Asymptotically, I can see when running many tests that the cycles per test tends to 34. Internally, only 32 of those cycles correspond to work happening in the multiplier. The other 2 have to do with input/output.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 1273.08 um^2
Critical-path delay: 249.57 ps
Cycles: 272 cycles to the 7th test. (38.86 cycles per multiply (internal 32, external 34))
Coverage: Non-negative only.
</code></pre></div></div>
<p>I generate the area/delay numbers with <code class="language-plaintext highlighter-rouge">synth</code>, and the cycles and test numbers and my <code class="language-plaintext highlighter-rouge">$display</code> statements. Since we pass the non-negative tests, let’s extend our multiplier to work for signed multiplication.</p>

<h3 id="radix-2-signed-multiplication">Radix-2 Signed Multiplication</h3>
<p>We can focus on implementation because we already have tests that cover multiplication with signed values.</p>

<p>We use <a href="https://en.wikipedia.org/wiki/Booth's_multiplication_algorithm">Booth recoding</a> to multiply with signed numbers. The main difference is that at each step, we inspect the current and previous bit of <code class="language-plaintext highlighter-rouge">a</code> rather than just the current bit. Almost all of the change is just to our <code class="language-plaintext highlighter-rouge">work</code> rule, with almost everything else staying the same.</p>

<p>It’s a little tricky to explain the idea behind the algorithm, but it gets clearer as we have more cases with higher Radix. Essentially, we read our two-bit case <code class="language-plaintext highlighter-rouge">{a[0], last_a}</code> as a <a href="https://en.wikipedia.org/wiki/Two%27s_complement">two’s complement</a> number where the <code class="language-plaintext highlighter-rouge">a[0]</code> is a two’s complement number with one bit (so it just corresponds to <code class="language-plaintext highlighter-rouge">-1</code>) and the <code class="language-plaintext highlighter-rouge">last_a</code> is the <a href="https://en.wikipedia.org/wiki/Carry_(arithmetic)">carry</a> from the previous operation.</p>

<p>With each case:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">00</code> is <code class="language-plaintext highlighter-rouge">0(-1) + 0(1) = 0 + 0 = 0</code></li>
  <li><code class="language-plaintext highlighter-rouge">01</code> is <code class="language-plaintext highlighter-rouge">0(-1) + 1(1) = 0 + 1 = 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">10</code> is <code class="language-plaintext highlighter-rouge">1(-1) + 0(1) = -1 + 0 = -1</code></li>
  <li><code class="language-plaintext highlighter-rouge">11</code> is <code class="language-plaintext highlighter-rouge">1(-1) + 1(1) = -1 + 1 = 0</code></li>
</ul>

<p>We use these cases to determine what to do with <code class="language-plaintext highlighter-rouge">b</code> in our <code class="language-plaintext highlighter-rouge">new_p</code> addition. Depending on the case, we mux between <code class="language-plaintext highlighter-rouge">new_p = p</code>, <code class="language-plaintext highlighter-rouge">new_p = p + b</code>, or (new!) <code class="language-plaintext highlighter-rouge">new_p = p - b</code>.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Reg#</span><span class="p">(</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span><span class="w"> </span><span class="nv">last_a</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkRegU</span><span class="p">;</span><span class="w">
</span></code></pre></div></div>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">rule </span><span class="nf">work</span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Busy</span><span class="p">);</span><span class="w">
    </span><span class="c">// Booth recoding</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">33</span><span class="p">)</span><span class="w"> </span><span class="nv">p_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">p</span><span class="p">);</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">33</span><span class="p">)</span><span class="w"> </span><span class="nv">b_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">b</span><span class="p">);</span><span class="w">
    </span><span class="c">// This turns into a mux choosing between 3.</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">33</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w"> </span><span class="kr">matches</span><span class="w">
        </span><span class="mi">2'b01</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">2'b10</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="kr">default</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="p">};</span><span class="w">  </span><span class="c">// to massage compiler</span><span class="w">
    </span><span class="k">endcase</span><span class="p">;</span><span class="w">

    </span><span class="na">last_a </span><span class="o">&lt;= </span><span class="nv">a</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span><span class="w">
    </span><span class="na">p </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">32</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
    </span><span class="na">a </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">1</span><span class="p">]};</span><span class="w">
    </span><span class="na">index </span><span class="o">&lt;= </span><span class="nv">index</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w">

    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">31</span><span class="p">)</span><span class="w"> </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Ready</span><span class="p">;</span><span class="w">
</span><span class="kd">endrule</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 1612.23 um^2
Critical-path delay: 309.79 ps
Cycles: 443 cycles for 13 tests (34 cycles per multiply - 2 overhead = 32)
Coverage: All 32-bit integer multiply
</code></pre></div></div>

<p>Our job is complete if we’re okay with 34 cycles per multiply. Internally, there’s only 32 cycles worth of work; the last 2 cycles come from constant testbench overhead. But let’s just worry about the internal work.</p>

<p>But we might want something like a quicker multiplier. We don’t necessarily need a critical path of 310 ps if the rest of our arithmetic units make us run on a slower clock, and we may be able to spare more than 1600 um^2 of area if it means we can do more multiplication.</p>

<p>Let’s try and get better performance. There are a few ways to do better multiplication. This time, let’s just focus on higher-radix multiplication. <a href="#next-time">Next time</a>, we can explore other options.</p>

<h3 id="radix-4-multiplication">Radix-4 Multiplication</h3>

<p>We can make things faster by inspecting two bits at a time instead of one. We can then finish in 16 internal cycles rather than 32, since we’re doing 16 cycles of 2 bits each.</p>

<p>Our case statement becomes a little uglier, muxing between 5 options rather than 3. Most of the change happens in the <code class="language-plaintext highlighter-rouge">work</code> rule. That’s because we need to add up to <code class="language-plaintext highlighter-rouge">+/- 2b</code>, instead of just <code class="language-plaintext highlighter-rouge">+/- b</code> like last time.</p>

<p>For a syntactical aside, I wrote the variable for <code class="language-plaintext highlighter-rouge">2*b</code> as <code class="language-plaintext highlighter-rouge">b2</code> instead of <code class="language-plaintext highlighter-rouge">2b</code> because the Bluespec language specification requires that variable identifiers start with a lowercase letter. I raise syntax errors when identifiers start with numbers in my <a href="/projects/vscode-bsv/">extension for VS Code</a>, though not currently in my <a href="/projects/bluespec-lexer/">lexer for Rouge</a>.</p>

<p>The Bluespec language specification is a little contradictory though, since identifiers <em>can</em> start with a <code class="language-plaintext highlighter-rouge">$</code> or <code class="language-plaintext highlighter-rouge">_</code> instead of a letter. For spots of ambiguity, I defer to <a href="https://github.com/B-Lang-org/bsc">the compiler</a>. The Bluespec compiler raises a compile-time syntax error when it encounters a variable identifier starting with a number.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">rule </span><span class="nf">work</span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Busy</span><span class="p">);</span><span class="w">
    </span><span class="c">// Booth recoding</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">34</span><span class="p">)</span><span class="w"> </span><span class="nv">p_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">p</span><span class="p">);</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">34</span><span class="p">)</span><span class="w"> </span><span class="nv">b_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">b</span><span class="p">);</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">34</span><span class="p">)</span><span class="w"> </span><span class="nv">b2_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">({</span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="mi">1'b0</span><span class="p">});</span><span class="w">  </span><span class="c">// 2*b = b &lt;&lt; 1</span><span class="w">

    </span><span class="c">// Mux that chooses between 5; the default is optimized out.</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">34</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
        </span><span class="mi">3'b111</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b000</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b001</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b010</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b101</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b110</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b011</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b100</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="kr">default</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">};</span><span class="w">  </span><span class="c">// never happens</span><span class="w">
    </span><span class="k">endcase</span><span class="p">;</span><span class="w">

    </span><span class="na">p </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">33</span><span class="o">:</span><span class="mi">2</span><span class="p">]};</span><span class="w">
    </span><span class="na">a </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">2</span><span class="p">]};</span><span class="w">
    </span><span class="na">last_a </span><span class="o">&lt;= </span><span class="nv">a</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span><span class="w">
    </span><span class="na">index </span><span class="o">&lt;= </span><span class="nv">index</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">2</span><span class="p">;</span><span class="w">

    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">30</span><span class="p">)</span><span class="w"> </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Ready</span><span class="p">;</span><span class="w">  </span><span class="c">// on last step</span><span class="w">
</span><span class="kd">endrule</span><span class="w">
</span></code></pre></div></div>
<p>Our cases follow the same rules as before, where <code class="language-plaintext highlighter-rouge">a[1:0]</code> is now a 2-bit two’s complement number and the <code class="language-plaintext highlighter-rouge">last_a</code> is still the 1-bit carry.</p>

<p>e.g., <code class="language-plaintext highlighter-rouge">3'b011</code> corresponds to <code class="language-plaintext highlighter-rouge">0(-2) + 1(1) + 1(1) = 2</code>, so we add <code class="language-plaintext highlighter-rouge">2b</code>.</p>

<h4 id="compiler-optimizations">Compiler Optimizations</h4>

<p>Initially, I made available a <code class="language-plaintext highlighter-rouge">b2</code> register that uses a precomputed value of <code class="language-plaintext highlighter-rouge">2b</code> at the start, but then I realized it didn’t save any work because we can just do a single bit shift to double <code class="language-plaintext highlighter-rouge">b</code>, which is negligible in hardware.</p>

<p>Something interesting is that in Bluespec, the following statements result in the same circuit:</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// This is if we store `2b` in a register</span><span class="w">
</span><span class="na">b2 </span><span class="o">&lt;= </span><span class="mi">2</span><span class="o">*</span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]};</span><span class="w">  </span><span class="c">// "multiply"</span><span class="w">
</span><span class="na">b2 </span><span class="o">&lt;= </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]}</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]};</span><span class="w">  </span><span class="c">// addition</span><span class="w">
</span><span class="na">b2 </span><span class="o">&lt;= </span><span class="p">{</span><span class="kd">in</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="w"> </span><span class="mi">1'b0</span><span class="p">};</span><span class="w">    </span><span class="c">// simple fixed shift</span><span class="w">

</span><span class="c">// Total Circuit in Multiplier:</span><span class="w">
</span><span class="c">// Area: 2575.68 um^2</span><span class="w">
</span><span class="c">// Critical-path delay: 322.36 ps</span><span class="w">
</span></code></pre></div></div>

<p>The compiler sometimes performs these types of optimizations, but sometimes it doesn’t. For a Bluespec developer, there’s a careful balance between writing beautiful (potentially-) optimiz<em>able</em> code and ugly (pretty-sure-is-) optimal code. The optimization part of the Bluespec compiler is not so mature as software compilers like <a href="https://en.wikipedia.org/wiki/Clang">Clang</a>.</p>

<p>Because a multiplication by 2 only requires a fixed bit shift, we don’t need to store the value of <code class="language-plaintext highlighter-rouge">2b</code>. In terms of synthesis, choosing to perform the bit shift every time to save state reduces the area (because we save a <code class="language-plaintext highlighter-rouge">Bit#(34)</code> register) but very slightly increases the path. This is the design we’ll go with for Radix-4.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 2328.83 um^2
Critical-path delay: 324.62 ps
Cycles: 16435 cycles for 913 tests
</code></pre></div></div>

<p>Compared to radix-2 multiplication, we cut our internal cycles in half (now 16) while increasing the area by 44% and critical path only by less than 5%. The speed-up is close to double. It seems like this was well worth it.</p>

<h3 id="radix-8-multiplication">Radix-8 Multiplication</h3>

<p>Just as we went from 1 bit to 2 bits, we can inspect 3 bits at a time to further reduce the cycles. Of course, 32 is not a multiple of 3, so we need to handle the last cycle as a special case. We would get from 16 to about 11 cycles and hope it doesn’t cost us much in terms of delay and area.</p>

<p>You’ll notice that the cases become slightly more complicated. In Radix-2 we muxed between 3 choices. In Radix-4 we muxed between 5 choices. In Radix-8, we also work with <code class="language-plaintext highlighter-rouge">3b</code> and <code class="language-plaintext highlighter-rouge">4b</code>, so that gives us 9 choices (one for plus and one for minus). That’s just in terms of <strong>outcomes</strong>.</p>

<p>In terms of <strong>cases</strong>, in Radix-2 we had <code class="language-plaintext highlighter-rouge">2^2=4</code> cases; in Radix-4 we had <code class="language-plaintext highlighter-rouge">2^3=8</code> cases; in Radix-8 we’ll have <code class="language-plaintext highlighter-rouge">2^4=16</code> cases. Quite a lot to keep track of, but it’s fine once you know the pattern.</p>

<p>Like before, each case is itself like a two’s complement number that tells us what to do with <code class="language-plaintext highlighter-rouge">b</code>. In Radix-8, we look at 4 bits of <code class="language-plaintext highlighter-rouge">a</code>: <code class="language-plaintext highlighter-rouge">a[2:0]</code> and <code class="language-plaintext highlighter-rouge">last_a</code>. The <code class="language-plaintext highlighter-rouge">last_a</code> acts like a carry bit, so we just read <code class="language-plaintext highlighter-rouge">a[2:0]</code> as a two’s complement number, keeping in mind the carry. You’ll see if you inspect the cases carefully in the upcoming code excerpt.</p>

<p>We do need to take care of our loop terminating condition. With Radix-2 and Radix-4, we could just stop at previous index being 31 and 30 respectively because the current step would’ve brought the next index to 32 (or 0): a full reset. 32 is cleanly divisible by 2 and 1.</p>

<p>With Radix-8, we might want to stop at index 29, because 29 + 3 = 32. But 29 isn’t an index we would stop at because it’s not a multiple of 3. We can either stop at previous index being 30, bringing our next one to 33, or at 27, bringing our next one to 30. Both cases require us to do a special case at the end, either backtracking by a turn of 1 bit or doing one more turn of 2 bits.</p>

<p>I think it’s easier to do one more turn of 2 bits (which is just one step of Radix-4) than it is to do a backtrack. To backtrack, we’d need to also subtract <code class="language-plaintext highlighter-rouge">b</code> according to the last digit, as well as restore <code class="language-plaintext highlighter-rouge">last_a</code>. But either should work.</p>

<p>Since not much else is going on when we’re calling <code class="language-plaintext highlighter-rouge">result</code>, I’m putting the last step in the <code class="language-plaintext highlighter-rouge">result</code> method, though you could also put it in a separate rule or the same <code class="language-plaintext highlighter-rouge">work</code> rule. The tradeoff is that putting the step in <code class="language-plaintext highlighter-rouge">result</code> may spread out the critical path, which can help, hinder, or do nothing depending on whether processor’s critical path contains <code class="language-plaintext highlighter-rouge">work</code>, contains <code class="language-plaintext highlighter-rouge">result</code>, or contains neither. In exchange, we can shave off a cycle from <code class="language-plaintext highlighter-rouge">work</code>, giving us only 10 internal cycles.</p>

<p>There’s also the concern of instantiating new adders in <code class="language-plaintext highlighter-rouge">result</code>, but hold onto that thought for now.</p>

<h4 id="computing-3b">Computing 3b</h4>

<p>We should precompute 3b because rather than only requiring a shift like <code class="language-plaintext highlighter-rouge">b2</code> and <code class="language-plaintext highlighter-rouge">b4</code>, it requires an addition. If we do it in <code class="language-plaintext highlighter-rouge">work</code>, then we may have two layers of adders: one to compute <code class="language-plaintext highlighter-rouge">3b</code>, then another to add it to <code class="language-plaintext highlighter-rouge">new_p</code>. Conceptually, it might wreck our critical-path delay, so we precompute 3b in <code class="language-plaintext highlighter-rouge">start</code>. I also run the numbers later in <a href="#not-precomputing-3b">a small experiment</a>.</p>

<p>Unlike computing <code class="language-plaintext highlighter-rouge">b2</code>, computing a <code class="language-plaintext highlighter-rouge">b3</code> seems to give the compiler some trouble giving us an efficient circuit, so we need to tune by hand.</p>

<p>Check the differences in synthesis:</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">b3_ </span><span class="o">&lt;= </span><span class="mi">3</span><span class="o">*</span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span><span class="w">
</span><span class="c">// Area: 5222.38 um^2</span><span class="w">
</span><span class="c">// Critical-path delay: 411.48 ps</span><span class="w">

</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">b2</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">({</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="w"> </span><span class="mi">1'b0</span><span class="p">});</span><span class="w">
</span><span class="na">b3_ </span><span class="o">&lt;= </span><span class="nv">b2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span><span class="w">
</span><span class="c">// Area: 5261.75 um^2</span><span class="w">
</span><span class="c">// Critical-path delay: 341.64 ps</span><span class="w">

</span><span class="na">b3_ </span><span class="o">&lt;= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span><span class="w">
</span><span class="c">// Same as the b2 + b1 (second one)</span><span class="w">
</span></code></pre></div></div>

<p>Well, that’s interesting that the <code class="language-plaintext highlighter-rouge">3*signExtend(in[1]);</code> gets the compiler to produce something much heavier than repeated addition or our “clever” addition. The repeated addition seems to get transformed into the clever addition, but I’m surprised the Bluespec compiler doesn’t optimize the first into either the second or third. I’m not sure what it’s doing.</p>

<h4 id="new-rules">New Rules</h4>

<p>We precompute <code class="language-plaintext highlighter-rouge">b3</code>, but <code class="language-plaintext highlighter-rouge">b2</code> and <code class="language-plaintext highlighter-rouge">b4</code> are once again results of shifts. We add in the many new cases for <code class="language-plaintext highlighter-rouge">work</code>. Hopefully you can see the pattern.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">// Constants; we can pull them out of the `work` rule to share with `result`</span><span class="w">
</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">b_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">b</span><span class="p">);</span><span class="w">
</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">b2_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">({</span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="mi">1'b0</span><span class="p">});</span><span class="w">  </span><span class="c">// compiler expands lone 0</span><span class="w">
</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">b4_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">({</span><span class="nv">b</span><span class="p">,</span><span class="w"> </span><span class="mi">2'b0</span><span class="p">});</span><span class="w">
</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">p_</span><span class="w"> </span><span class="o">= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">p</span><span class="p">);</span><span class="w">

</span><span class="kd">rule </span><span class="nf">work</span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Busy</span><span class="p">);</span><span class="w">  </span><span class="c">// handles 10 cycles of 3 bits each</span><span class="w">
    </span><span class="c">// Booth recoding</span><span class="w">
    </span><span class="c">// Mux choosing between 9; default is optimized out.</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
        </span><span class="mi">4'b100_0</span><span class="o">:</span><span class="w">           </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b4_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b101_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b100_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b3_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b110_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b101_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b111_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b110_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b000_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b111_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b001_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b000_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b010_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b001_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b011_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b010_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b3_</span><span class="p">};</span><span class="w">
        </span><span class="mi">4'b011_1</span><span class="o">:</span><span class="w">           </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b4_</span><span class="p">};</span><span class="w">
        </span><span class="kr">default</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">};</span><span class="w">  </span><span class="c">// never happens</span><span class="w">
    </span><span class="k">endcase</span><span class="p">;</span><span class="w">
    </span><span class="na">p </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">34</span><span class="o">:</span><span class="mi">3</span><span class="p">]};</span><span class="w">
    </span><span class="na">a </span><span class="o">&lt;= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">3</span><span class="p">]};</span><span class="w">
    </span><span class="na">last_a </span><span class="o">&lt;= </span><span class="nv">a</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span><span class="w">
    </span><span class="na">index </span><span class="o">&lt;= </span><span class="nv">index</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">3</span><span class="p">;</span><span class="w">

    </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">index</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="mi">27</span><span class="p">)</span><span class="w"> </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Ready</span><span class="p">;</span><span class="w">  </span><span class="c">// on last step</span><span class="w">
</span><span class="kd">endrule</span><span class="w">
</span></code></pre></div></div>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">)</span><span class="w"> </span><span class="nf">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="nc">Idle</span><span class="p">);</span><span class="w">
    </span><span class="c">// ... other boilerplate ...</span><span class="w">
    </span><span class="na">b3_ </span><span class="o">&lt;= </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]);</span><span class="w">
</span><span class="kd">endmethod</span><span class="w">
</span></code></pre></div></div>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="nv">state</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="no">Ready</span><span class="p">);</span><span class="w">  </span><span class="c">// handles last 2 bits</span><span class="w">
    </span><span class="na">state </span><span class="o">&lt;= </span><span class="no">Idle</span><span class="p">;</span><span class="w">

    </span><span class="c">// Radix-2 case for last 2 bits</span><span class="w">
    </span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
        </span><span class="mi">3'b111</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b000</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b001</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b010</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b101</span><span class="p">,</span><span class="w"> </span><span class="mi">3'b110</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b011</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="mi">3'b100</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">p_</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
        </span><span class="kr">default</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">};</span><span class="w">  </span><span class="c">// never happens</span><span class="w">
    </span><span class="k">endcase</span><span class="p">;</span><span class="w">

    </span><span class="k">let</span><span class="w"> </span><span class="nv">p_</span><span class="w"> </span><span class="o">= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">33</span><span class="o">:</span><span class="mi">2</span><span class="p">]};</span><span class="w">
    </span><span class="k">let</span><span class="w"> </span><span class="nv">a_</span><span class="w"> </span><span class="o">= </span><span class="p">{</span><span class="nv">new_p</span><span class="p">[</span><span class="mi">1</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">a</span><span class="p">[</span><span class="mi">31</span><span class="o">:</span><span class="mi">2</span><span class="p">]};</span><span class="w">

    </span><span class="k">return</span><span class="w"> </span><span class="nv">unpack</span><span class="p">({</span><span class="nv">p_</span><span class="p">,</span><span class="w"> </span><span class="nv">a_</span><span class="p">});</span><span class="w">
</span><span class="kd">endmethod</span><span class="w">
</span></code></pre></div></div>

<h4 id="area-analysis">Area Analysis</h4>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 5261.75 um^2
Critical-path delay: 341.64 ps
Cycles: 10957 for 913 tests (10 internal cycles; 12 external cycles per)
</code></pre></div></div>

<p>Compared to Radix-4, we once again have about a 5% increase in critical path but a 125% increase in area. In exchange, we’ve gone from 16 internal cycles to 10 internal cycles, improving our time by over a third.</p>

<p>The large increase in area is because in Bluespec, we instantiate an adder with each call of <code class="language-plaintext highlighter-rouge">+</code>, except in very special cases where we have repeated subexpressions (e.g., two adders always perform identical calculations every cycle). So we’ve instantiated more adders. You may have noticed a similar effect when we did Radix-4 from Radix-2.</p>

<p>If we wanted, we could even <a href="#built-in-multiplier">instantiate a full multiplier</a> with <code class="language-plaintext highlighter-rouge">*</code>.</p>

<p>For an area-efficient multiplier implementation, we may want to write one that uses only one adder. We would mux in the proper multiple of <code class="language-plaintext highlighter-rouge">b</code> instead of muxing between the results of several adders. So far, we’ve been generous with the amount of area we’re prepared to use. <a href="#reduced-adder-design-for-radix-8">I discuss ways around it later</a>.</p>

<p>In fact, Hennessy and Patterson’s description of higher-radix multiplication (what we’re doing) is in their section <em>Speeding Up Multiplication with a <strong>Single</strong> Adder</em>. Nine adders is starting to get a little high for a single adder design.</p>

<p>It’s interesting that our area increased by 125%, since an initial count of our adders (just going from <code class="language-plaintext highlighter-rouge">+</code>) suggests we now have 8 + 1 + 4 = 13 adders to our initial 4. It’s such a clean number that I would guess the compiler optimized out a few adders. Cycle-by-cycle, our <code class="language-plaintext highlighter-rouge">result</code> adders perform the same calculations as a subset of the <code class="language-plaintext highlighter-rouge">work</code> adders, so it makes sense for the two components to share the same adders and just wire the sums to both <code class="language-plaintext highlighter-rouge">work</code> and <code class="language-plaintext highlighter-rouge">result</code>.</p>

<h4 id="not-precomputing-3b">Not Precomputing 3b</h4>
<p>What happens if we don’t precompute 3b? Let’s do a little experiment and compare to the precomputed numbers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Precomputed
Area: 5261.75 um^2
Critical-path delay: 341.64 ps

// Not precomputed
Area: 4830.56 um^2
Critical-path delay: 481.22 ps
</code></pre></div></div>
<p>As you can see, we increase our delay by 40% because <code class="language-plaintext highlighter-rouge">work</code> must have two layers of adders if we compute <code class="language-plaintext highlighter-rouge">b3</code> at runtime.</p>

<h3 id="radix-16-consideration">Radix-16 Consideration</h3>

<p>If we continue onto Radix-16, we can expect more adders (unless we start sharing adders between multiples of <code class="language-plaintext highlighter-rouge">b</code>, as discussed above).</p>

<p>For <code class="language-plaintext highlighter-rouge">work</code>, we’d need to account up to <code class="language-plaintext highlighter-rouge">+/- 8b</code>, meaning 16 adders in that rule alone.</p>

<p>For <code class="language-plaintext highlighter-rouge">start</code>, we previously had one adder because we needed to compute <code class="language-plaintext highlighter-rouge">b3</code>. Now, we also need to compute <code class="language-plaintext highlighter-rouge">b5</code>, <code class="language-plaintext highlighter-rouge">b6</code>, and <code class="language-plaintext highlighter-rouge">b7</code>. Let’s say we save <code class="language-plaintext highlighter-rouge">b6</code> by performing a shift on <code class="language-plaintext highlighter-rouge">b3</code>. Rather than having a second layer of adders in <code class="language-plaintext highlighter-rouge">start</code>, we can add one more cycle in between <code class="language-plaintext highlighter-rouge">start</code> and <code class="language-plaintext highlighter-rouge">work</code> to precompute <code class="language-plaintext highlighter-rouge">b5</code> and <code class="language-plaintext highlighter-rouge">b7</code> by adding <code class="language-plaintext highlighter-rouge">b2 + b3</code> and <code class="language-plaintext highlighter-rouge">b4 + b3</code>, which also saves us having to compute it every cycle. The precomputation cycle might be helpful if Bluespec doesn’t have a good 3-way single-layer adder as a built-in.</p>

<p>Total, that means we need 16 + 3 = 19 adders, over double our current 9 adders.</p>

<p>What would we get in return? As long as we can mux between 17 (versus previous 9) values without an issue, and as long as our <code class="language-plaintext highlighter-rouge">p_</code> can tolerate the fanout to all those adders, our critical path might not increase too much. We would just get a much higher area.</p>

<p>Our internal cycles should be 32/4 = 8 + 1 for setup, for 9 internal cycles. Our earlier Radix-8 implementation saved an internal cycle by bringing work from <code class="language-plaintext highlighter-rouge">work</code> to <code class="language-plaintext highlighter-rouge">result</code>, so we can think of it as going from 10 to 9 internal cycles, or 11 to 9 if we discount the saved cycle. It seems like a lot of added complexity and area to save one or two cycles per multiplication.</p>

<p>Let us not do Radix-16 for now.</p>

<p>The calculus might be different if we were doing 64-bit multiplication. In that case, we would be going from 21 cycles to 17 cycles. If we did this for 128-bit multiplication, we’d go from 42 to 32 cycles. It’s the saved-cycle in Radix-8 and the precomputation cycle in Radix-16 that results in a lot of overhead when we’re working with 32-bit multiplication.</p>

<h2 id="reduced-adder-design-for-radix-8">Reduced Adder Design for Radix-8</h2>

<p>All the above Radix-n designs are supposed to, in theory and as according to Hennessy and Patterson, use only a single adder. Because of the way we’ve implemented them, we’ve instantiated way more than a single adder.</p>

<p>In Bluespec, using the <code class="language-plaintext highlighter-rouge">+</code> operator, like calling any function, often results in the compiler  inlining and therefore synthesizing a brand new adder, unless it can determine it can get away with fewer, like if you have two adders that always compute the same value like what happened when we did <a href="#radix-8-multiplication">Radix-8</a>.</p>

<p>In theory, it <em>could</em> be the compiler that optimizes for things like area and critical-path delay, but the Bluespec compiler isn’t nearly as sophisticated as the C compiler. The developer often needs to optimize by hand for area and delay.</p>

<p>I would bet that there are more people who have contributed to C <a href="https://en.wikipedia.org/wiki/Optimizing_compiler">optimizing compilers</a> like <a href="https://en.wikipedia.org/wiki/GNU_Compiler_Collection">GCC</a> and <a href="https://en.wikipedia.org/wiki/Clang">Clang</a> than there are people who have used Bluespec <em>period</em>, so it isn’t quite fair a race. Perhaps optimizing RTL output for things like use of <code class="language-plaintext highlighter-rouge">+</code> isn’t even within scope for current Bluespec compiler engineers.</p>

<p>Let’s say we don’t want to instantiate so many adders. Let’s say we only want one.</p>

<p>If we want to add without synthesizing adders everywhere, we can instantiate our own <code class="language-plaintext highlighter-rouge">Adder</code> submodule and use that. Then, we can add by calling the corresponding method, and as long as we’re using a synthesized module, the compiler will throw a compilation error if it detects we’re overusing the single add port.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">interface</span><span class="w"> </span><span class="nc">Adder#</span><span class="p">(</span><span class="k">type</span><span class="w"> </span><span class="nv">t</span><span class="p">);</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="nv">t</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="nv">t</span><span class="w"> </span><span class="nv">a</span><span class="p">,</span><span class="w"> </span><span class="nv">t</span><span class="w"> </span><span class="nv">b</span><span class="p">);</span><span class="w">
</span><span class="kd">endinterface</span><span class="w">

</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkAdder</span><span class="p">(</span><span class="nc">Adder#</span><span class="p">(</span><span class="nv">t</span><span class="p">))</span><span class="w"> </span><span class="kr">provisos</span><span class="w"> </span><span class="p">(</span><span class="no">Bits#</span><span class="p">(</span><span class="nv">t</span><span class="p">,</span><span class="w"> </span><span class="nv">t_bits</span><span class="p">),</span><span class="w"> </span><span class="nc">Arith#</span><span class="p">(</span><span class="nv">t</span><span class="p">));</span><span class="w">
    </span><span class="kd">method</span><span class="w"> </span><span class="nv">t</span><span class="w"> </span><span class="nf">add</span><span class="p">(</span><span class="nv">t</span><span class="w"> </span><span class="nv">a</span><span class="p">,</span><span class="w"> </span><span class="nv">t</span><span class="w"> </span><span class="nv">b</span><span class="p">)</span><span class="w"> </span><span class="o">= </span><span class="nv">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nv">b</span><span class="p">;</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">

</span><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkAdder35</span><span class="p">(</span><span class="nc">Adder#</span><span class="p">(</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)));</span><span class="w">
    </span><span class="nc">Adder#</span><span class="p">(</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">))</span><span class="w"> </span><span class="nv">adder</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkAdder</span><span class="p">;</span><span class="w">
    </span><span class="k">return</span><span class="w"> </span><span class="nv">adder</span><span class="p">;</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>The first module is polymorphic, which gives us freedom in producing different adders for different values. To massage the compiler for synthesis, I put in the second module and hard-coded it to work for 35 bits, which allows us to force a single port for the adder.</p>

<p>The overall multiplier still gets synthesized either way, but I want to mandate a single port for the adder. Non-synthesized modules can “grow” new ports, and then we’re right where we started.</p>

<p>Let’s drop this into place for our Radix-8 design like so:</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">operand</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
    </span><span class="mi">4'b100_0</span><span class="o">:</span><span class="w">           </span><span class="p">{</span><span class="o">-</span><span class="w"> </span><span class="nv">b4_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b101_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b100_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="o">-</span><span class="w"> </span><span class="nv">b3_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b110_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b101_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="o">-</span><span class="w"> </span><span class="nv">b2_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b111_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b110_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="o">-</span><span class="w"> </span><span class="nv">b_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b000_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b111_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b001_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b000_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">b_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b010_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b001_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">b2_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b011_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b010_1</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="nv">b3_</span><span class="p">};</span><span class="w">
    </span><span class="mi">4'b011_1</span><span class="o">:</span><span class="w">           </span><span class="p">{</span><span class="nv">b4_</span><span class="p">};</span><span class="w">
    </span><span class="kr">default</span><span class="o">:</span><span class="w"> </span><span class="p">{</span><span class="mi">0</span><span class="p">};</span><span class="w">  </span><span class="c">// never happens</span><span class="w">
</span><span class="k">endcase</span><span class="p">;</span><span class="w">

</span><span class="k">let</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="nv">adder</span><span class="p">.</span><span class="nv">add</span><span class="p">(</span><span class="nv">p_</span><span class="p">,</span><span class="w"> </span><span class="nv">operand</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<h3 id="lower-area-higher-delay">Lower Area, Higher Delay</h3>
<p>We get the following synthesis changes:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 5261.75 um^2 -&gt; 2960.05 um^2 (remember our Radix-4 was 2328.83)
Critical-path delay: 341.64 ps -&gt; 520.25 ps  (big increase if we can't afford)
</code></pre></div></div>

<p>It’s strange that the critical path is even higher than when we had two layers of adders from our <a href="#not-precomputing-3b">non-precomputed b3 experiment</a>.</p>

<p>My first guess for the much greater critical-path delay is because of fan-in from the 9-way muxing on the <code class="language-plaintext highlighter-rouge">b</code> operands into the adder. Although, maybe that’s not it since we previously must’ve had fan-out from <code class="language-plaintext highlighter-rouge">p_</code> going into eight different adders. Hmmm. Unless fan-in is much more harmful than fan-out, that shouldn’t be it.</p>

<p>My second guess is that our “negative” operands require a lot of hardware before going into the adder, versus all this happening in one go with a built-in adder. We might effectively have <em>two</em> layers of adders, with an implied unary <code class="language-plaintext highlighter-rouge">0 - b_</code> for the first four cases to generate our negative operand, then going into the actual adder. That is, maybe in Bluespec, a <code class="language-plaintext highlighter-rouge">-</code> operator doesn’t instantiate an <code class="language-plaintext highlighter-rouge">a + (-b)</code>, but a genuine subtractor.</p>

<h3 id="experimentation">Experimentation</h3>
<p>I don’t have the diagnostic tools to determine whether my second guess is correct visually or from the command line, so let’s test that guess experimentally.</p>

<p>When I try to replace the more elegant operand case statement with two separate statements that go into a mux for an adder and a sub, we save some marginal area and path at the expense of a lot uglier code. I plan on reverting.</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Bit#</span><span class="p">(</span><span class="mi">35</span><span class="p">)</span><span class="w"> </span><span class="nv">operand</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
    </span><span class="mi">4'b100_0</span><span class="o">:</span><span class="w">           </span><span class="nv">b4_</span><span class="p">;</span><span class="w">
    </span><span class="mi">4'b101_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b100_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b3_</span><span class="p">;</span><span class="w">     
    </span><span class="mi">4'b110_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b101_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b2_</span><span class="p">;</span><span class="w">     
    </span><span class="mi">4'b111_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b110_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b_</span><span class="p">;</span><span class="w">     
    </span><span class="mi">4'b000_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b111_1</span><span class="o">:</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w">
    </span><span class="mi">4'b001_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b000_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b_</span><span class="p">;</span><span class="w">    
    </span><span class="mi">4'b010_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b001_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b2_</span><span class="p">;</span><span class="w">     
    </span><span class="mi">4'b011_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b010_1</span><span class="o">:</span><span class="w"> </span><span class="nv">b3_</span><span class="p">;</span><span class="w">     
    </span><span class="mi">4'b011_1</span><span class="o">:</span><span class="w">           </span><span class="nv">b4_</span><span class="p">;</span><span class="w">
</span><span class="k">endcase</span><span class="p">;</span><span class="w">

</span><span class="nc">Bool</span><span class="w"> </span><span class="nv">is_add</span><span class="w"> </span><span class="o">= </span><span class="kr">case</span><span class="w"> </span><span class="p">({</span><span class="nv">a</span><span class="p">[</span><span class="mi">2</span><span class="o">:</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="nv">last_a</span><span class="p">})</span><span class="w">
    </span><span class="mi">4'b100_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b101_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b100_1</span><span class="p">,</span><span class="w">
    </span><span class="mi">4'b110_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b101_1</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b111_0</span><span class="p">,</span><span class="w">
    </span><span class="mi">4'b110_1</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b000_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b111_1</span><span class="o">:</span><span class="w"> </span><span class="nc">False</span><span class="p">;</span><span class="w">
    </span><span class="mi">4'b001_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b000_1</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b010_0</span><span class="p">,</span><span class="w">
    </span><span class="mi">4'b001_1</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b011_0</span><span class="p">,</span><span class="w"> </span><span class="mi">4'b010_1</span><span class="p">,</span><span class="w">
    </span><span class="mi">4'b011_1</span><span class="o">:</span><span class="w">                     </span><span class="nc">True</span><span class="p">;</span><span class="w">
</span><span class="k">endcase</span><span class="p">;</span><span class="w">

</span><span class="k">let</span><span class="w"> </span><span class="nv">new_p</span><span class="w"> </span><span class="o">= </span><span class="p">(</span><span class="nv">is_add</span><span class="p">)</span><span class="w"> </span><span class="kp">?</span><span class="w"> </span><span class="nv">adder</span><span class="p">.</span><span class="nv">add</span><span class="p">(</span><span class="nv">p_</span><span class="p">,</span><span class="w"> </span><span class="nv">operand</span><span class="p">)</span><span class="w"> </span><span class="o">:</span><span class="w">
                       </span><span class="nv">adder</span><span class="p">.</span><span class="nv">sub</span><span class="p">(</span><span class="nv">p_</span><span class="p">,</span><span class="w"> </span><span class="nv">operand</span><span class="p">);</span><span class="w">
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 2948.08 um^2 (from 2960.05)
Critical-path delay: 494.37 ps (from 520.25)
Critical path: ~b[0] -&gt; p[31:0][30](DFF_in)
</code></pre></div></div>

<p>It’s a rather marginal change though, so I don’t think that was it. A third guess: maybe it’s because our adder is shared between too many spots: <code class="language-plaintext highlighter-rouge">work</code>, <code class="language-plaintext highlighter-rouge">start</code>, <code class="language-plaintext highlighter-rouge">result</code>.</p>

<p>If we then only use a shared adder in <code class="language-plaintext highlighter-rouge">work</code>, instantiating standalone adders for <code class="language-plaintext highlighter-rouge">start</code> and <code class="language-plaintext highlighter-rouge">result</code>, we get a delay of 432 ps, still much higher than 342 ps from the 9-adder implementation (and with obvious cost of higher area from more adders). But that changes our critical path to going through <code class="language-plaintext highlighter-rouge">last_a -&gt; p[31:0][31](DFF_in)</code>, so it could be an issue with our selector fan-out. Maybe that was it?</p>

<h3 id="exercise-for-the-reader">Exercise for the Reader</h3>
<p>I’m starting to run out of guesses. This is starting to creep into a lower level of abstraction than I’m accustomed to. (I got most of my expertise in <a href="/projects/processor/">big processor things</a>.) Where’s all the delay coming from? Consider this an exercise for the reader.</p>

<blockquote>
  <p>If you have an idea why our delay has ballooned so much, contact me and I’ll see about following up in a later post. You can play around with the source on the <a href="https://github.com/mchanphilly/basic-multiplier/tree/main">corresponding GitHub repo</a> as long as you have the Bluespec compiler and the <a href="https://github.com/minispec-hdl/minispec/tree/main/synth"><code class="language-plaintext highlighter-rouge">synth</code></a> tool. Otherwise, you can let me know and I can test it later.</p>
</blockquote>

<p>It might be an issue with our synthesis tool. It could be falling into a local minimum. Maybe we could also draw diagrams and compare the two designs.</p>

<p>My current method of synthesis is primitive and uses the <a href="https://github.com/minispec-hdl/minispec/tree/main/synth"><code class="language-plaintext highlighter-rouge">synth</code></a> tool in the <a href="https://github.com/minispec-hdl/minispec/tree/main">Minispec</a> repository. There may be a better way that can get us more information about the area and delay, but I haven’t worked through such a process yet. I would only be a little surprised if nobody’s made an easy way to get timing and area information for a Bluespec design.</p>

<p>Since we don’t have great tools for diagnosing timing and area issues, we can still think conceptually and look at the numbers we have, but we can’t yet draw strong conclusions about area and delay directly from our Bluespec.</p>

<h2 id="built-in-multiplier">Built-in Multiplier</h2>
<p>Bluespec offers a built-in multiplier as a primitive through the <code class="language-plaintext highlighter-rouge">*</code> operation. Here’s an implementation of our <code class="language-plaintext highlighter-rouge">MultiplierUnit</code> interface using it:</p>

<div class="language-bluespec highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="fm">(* synthesize *)</span><span class="w">
</span><span class="kd">module</span><span class="w"> </span><span class="nf">mkBuiltInMultiplierUnit</span><span class="p">(</span><span class="nc">MultiplierUnit</span><span class="p">);</span><span class="w">
    </span><span class="nc">FIFO#</span><span class="p">(</span><span class="nc">Bit#</span><span class="p">(</span><span class="mi">64</span><span class="p">))</span><span class="w"> </span><span class="nv">data</span><span class="w"> </span><span class="o">&lt;-</span><span class="w"> </span><span class="na">mkFIFO</span><span class="p">;</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">Action</span><span class="w"> </span><span class="nf">start</span><span class="p">(</span><span class="nc">Pair</span><span class="w"> </span><span class="kd">in</span><span class="p">);</span><span class="w">
        </span><span class="nv">data</span><span class="p">.</span><span class="nf">enq</span><span class="p">(</span><span class="na">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="na">signExtend</span><span class="p">(</span><span class="nv">in</span><span class="p">[</span><span class="mi">1</span><span class="p">]));</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">

    </span><span class="kd">method</span><span class="w"> </span><span class="k">ActionValue#</span><span class="p">(</span><span class="nc">Pair</span><span class="p">)</span><span class="w"> </span><span class="nf">result</span><span class="p">;</span><span class="w">
        </span><span class="nv">data</span><span class="p">.</span><span class="na">deq</span><span class="p">;</span><span class="w">
        </span><span class="k">return</span><span class="w"> </span><span class="nv">unpack</span><span class="p">(</span><span class="nv">data.first</span><span class="p">);</span><span class="w">
    </span><span class="kd">endmethod</span><span class="w">
</span><span class="kd">endmodule</span><span class="w">
</span></code></pre></div></div>

<p>It can do one multiplication per cycle. What’s the catch? You can probably guess: area and delay.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Area: 16272.28 um^2
Critical-path delay: 858.07 ps
Cycles: 914 cycles for 912 tests (basically one cycle per multiply)
</code></pre></div></div>

<p>Surprisingly, the penalty doesn’t seem untenably high. Wow. I only found this at the end of experimenting with the other multipliers.</p>

<p>For built-in multiplication compared to our many-adder interpretation of Radix-8, we have a 150% increase in delay and 200% increase in area. Compared to our single-adder interpretation of Radix-8, we have a 450% increase in area and 65% increase in delay. It’s a bit of a big increase, but the single cycle is crazy; we save 9 internal cycles, but we also save the external cycles too. <em>One multiplication per cycle!</em></p>

<p>It almost seems worth the cost. With my slow <a href="/projects/processor/">processor</a> and its 1000-1200 ps critical-path delay, maybe I could even embed the built-in multiplier as-is as a single cycle multiplier. I would only need to replace it if I started speeding up the rest of the stages.</p>

<p>The efficiency probably comes from the built-in multiplier being most likely programmed directly in Verilog, or a very low-level use of Bluespec. If we’re willing to give up the single cycle latency (e.g., and either wait or pipeline), we can definitely do better in Verilog, and we might be able to do better in Bluespec.</p>

<p>Maybe in a later post I’ll attempt to write a better multiplier in Verilog. In that case, it might still be useful to sketch things out first in Bluespec, just to check that the algorithm works cycle-wise. Once we have a Verilog module, we can still embed it in a Bluespec design for simulation and synthesis, just like an <a href="https://en.wikipedia.org/wiki/Semiconductor_intellectual_property_core">IP block</a>.</p>

<h2 id="next-time">Next Time</h2>

<p>Next time, we’ll look at some more advanced multiplier designs. I’m particularly interested in increasing throughput, either through using faster adders or by pipelining. And I’m planning on looking at designs that <em>intentionally</em> use many adders.</p>

<p>Our aim right now is to increase multiplier throughput while assuming that our compiler gives us reasonable but not necessarily optimal hardware. Because we would be trying to fit this multiplier as a functional unit in our <a href="/projects/processor/">processor</a>, we don’t have to worry all that much about the critical path as long as it’s below the critical path of all the other stages of the processor, and as long as the area isn’t prohibitively large.</p>

<p>Something else the reduced adder design allows us to do is to implement multipliers that use more special adders without just relying on the Bluespec built-in <code class="language-plaintext highlighter-rouge">+</code> operator. That might come in handy later if we look at designs that use particular adders, like <a href="https://en.wikipedia.org/wiki/Carry-save_adder">carry-save adders</a>. It might also be worth taking a peek at what state-of-the-art Bluespec processors <em>actually</em> use, in designs like <a href="https://github.com/bluespec/Flute">Flute</a> or <a href="https://github.com/bluespec/Toooba">Toooba</a>.</p>

<p>If we do implement adders, then we’ll want to construct a testbench similar to the one we constructed for multipliers. We might need to focus a post on adders in particular before focusing on applying them toward multipliers. All in due time.</p>

<p>At some point, I’ll write a similar post about basic division, since it’s the other half of the RISC-V “M” extension.</p>]]></content><author><name>Martin Chan</name><email>martinch@mit.edu</email></author><category term="post" /><summary type="html"><![CDATA[Technical blog post with worked examples of integer multipliers in Bluespec and a simple testbench. The multiplier was written with guidance from the Hennessy and Patterson textbook.]]></summary></entry></feed>