In the previous post, we discussed top-level requirements. Now we drill down one level, identify key components and apply our requirements to them. We also look around for existing cores or applicable specs that might fit the bill.

The Bus

The Bus, or interconnect, is the fabric stitching together the SoC internal components. For this project, the two most relevant SoC internal bus specifications are ARM’s AXI bus and the Open-Source Wishbone bus.

AXI is very powerful, very popular, and very complex. It scales up well to very big SoCs. However, I don’t think it scales down very well to simple SoCs, such as BoxLambda, where low latency and low complexity are more important than high bandwidth and scalability. Hence, for this project, I’m electing to go with Wishbone.

We’ll be using the Wishbone B4 specification.

Sticking to a well-defined internal bus specification certainly helps to meet the Modular Architecture Requirement. Whether we can also accommodate Partial FPGA Reconfiguration using a Wishbone Interconnect remains to be seen.

The Processor

Processor Word Size

Typical processor word sizes are 8-bit, 16-bit, 32-bit, and 64-bit. Which word size is the best fit for Boxlambda?

  • 8-bit: A good word size.
    • Pros:
      • An 8-bit word (i.e. a byte) is a good natural fit for a pixel value, an ASCII character code, or small integer values.
      • 8-bit processors, their programs, and their data are very compact.
      • 8-bit processors side-step some of the alignment issues seen with larger word sizes.
    • Cons:
      • An 8-bit word is too small to conveniently hold the values you need in a typical program - think calculations and table indices.
      • Toolchain support for higher-level languages is limited.
  • 16-bit: A clumsy compromise between 8-bit and 32-bits. Made sense when 32-bit processors were not readily available yet. Now, not so much.
  • 32-bit: Another good word size.
    • Pros: 32-bit words can hold most real-world numbers and cover a huge address space. 32-bit machines generally have good toolchain support.
    • Cons: Much bigger than its 8-bit counterpart, in terms of FPGA real estate, program size as well as data size.
  • 64-bit: A big and clunky word size, way too big to handle conveniently, intended for specialized use cases that don’t fit this project.

I’ve decided to go for a 32-bit processor. A 32-bit processor (and associated on-chip memory) will take a bigger chunk out of our FPGA real estate, but I think it’s worth it. I like the convenience of 32-bit registers, and a 32-bit processor may come with a regular GCC toolchain.

Processor Features

Next to a 32-bit word size, we’re looking for the following features for our microprocessor:

  • Ease of programming, meaning:
    • Easy and well-documented Instruction Set Architectures (ISA). We want to be able to program the machine at assembly language level.
    • Shallow Pipeline: It is relatively easy to reason about the behavior of a processor with a two-stage pipeline. It is not very easy to reason about the behavior of a processor with a six-stage pipeline.
    • Good toolchain support, such as GCC, so we can build a software ecosystem for our machine.
  • An accessible and well-documented implementation.
  • Has to fit our FPGA, with enough space to fit the other components.

With all that in mind, I think RISC-V is a great option.

  • Great ISA, building on lessons learned from previous popular processor architectures.
  • 32-bit support.
  • GCC toolchain support.
  • Open-Source.
  • Well-documented.
  • Very fashionable. Let’s ride that wave :-)

There are a lot of RISC-V implementations to choose from. The Ibex project seems like a good choice:

  • 32-bit RISC-V.
  • Hig-quality, well-documented implementation.
  • SystemVerilog based. My preferred HDL.
  • Supports a small two-stage pipeline parameterization.
  • Very active project.

The Memory Controller

SDRAM memory access is pretty complicated. Memory access requests get queued in the memory controller, scheduled, and turned into a sequence of commands that vary in execution time depending on the previous memory locations that were recently accessed.

There exists a class of memory controllers, called Static Memory Controllers, that absorb these complexities and by design create a fixed schedule for a fixed use case, resulting in very predictable behavior. Static Memory Controllers are far off the beaten path, however. Dynamic Memory Controllers are more common. Dynamic Memory Controllers can handle a variety of use cases with good performance on average. Unfortunately, they sacrifice predictability to achieve this flexibility.

Ideally, we would use an accessible, well-documented, open-source, static memory controller design. Unfortunately, I can’t find one. Rolling our own is not an option either. Doing so would require so much specific know-how, that it would kill this project. Pragmatically, our best option is to use Xilinx’s Memory Interface Generator (MIG) with the Arty A7 (or Nexys A7) parameters as published by Diligent.

The Xilinx memory controller falls squarely into the Dynamic Memory Controller class. How do we fit this into a platform that requires deterministic behavior? I think the best approach is to use a DMA engine to transfer data between SDRAM and on-chip memory. Fixed memory access latency to on-chip memory (from any bus master that requires it) can be guaranteed using an arbiter. We’ll revisit this topic when we’re discussing Boxlambda’s architecture.

A nice intro to RISC-V Assembly Programming