Showing results for 
Show  only  | Search instead for 
Did you mean: 

Shazam! Jan Gray gets 1680 RISC-V processors to dance on the head of a Xilinx Virtex UltraScale+ VU9P FPGA at Hot Chips

Xilinx Employee
Xilinx Employee
0 1 75.8K


Even though I knew this was coming, it’s still hard to write this blog post without grinning. Last week, acknowledged FPGA-based processor wizard Jan Gray of Gray Research LLC presented a Hot Chips poster titled “GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680-core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P.” Allow me to unpack that title and the details of the GRVI Phalanx for you.


Let’s start with 1680 “austere” processing elements in the GRVI Phalanx, which are based on the 32-bit RISC-V processor architecture. (Is that parallel enough for you?) The GRVI processing element design follows “Jan’s Razor”: In a chip multiprocessor, cut nonessential resources from each CPU, to maximize CPUs per die. Thus, a GRVI processing element is a 3-stage, user-mode RV321 core minus a few nonessential bits and pieces. It looks like this:



GRVI Processing Element.jpg


A GRVI Processing Element




Each GRVI processing element requires ~320 LUTs and runs at 375MHz. Typical of a Jan Gray design, the GRVI processing element is hand-mapped and –floorplanned into the UltraScale+ architecture and then stamped 1680 times into the Virtex UltraScale+ VU9P FPGA on a VCU118 Eval Kit.


Now, dropping a bunch of processor cores onto a large device like the Virtex UltraScale+ VU9P FPGA is interesting but less than useful unless you give all of those cores some memory to operate out of, some way for the processors to communicate with each other and with the world beyond the FPGA package, and some way to program the overall machine.


Therefore, the GRVI processing elements are packaged in clusters containing as many as eight processing elements with 32 to 128Kbytes of RAM, and additional accelerator(s). Each cluster is tied to the other on-chip clusters and to the external-world I/O through a HOPLITE router to a NOC (network on chip) with 100Gbps links between nodes. The HOPLITE router is an FPGA-optimized, directional router designed for a 2D torus network.


A GRVI Phalanx cluster looks like this:



GRVI Phalanx Cluster.jpg


A GRVI Phalanx Cluster




Currently, Gray’s paper says there a multithreaded C++ compiler with message-passing runtime layered on top of a RISC-V RV321MA GCC compiler with future plans to support OpenCL, P4, and other programming tools.


In development: an 80-core educational version of the GRVI Phalanx instantiated in the programmable logic of a “low-end” Zynq Z-7020 SoC on the Digilent PYNQ-Z1 board.


Now if all that were not enough (and you will find a lot more packed into Gray’s poster), there’s a Xilinx Virtex UltraScale+ VU9P available to you. It’s as near as your keyboard and Web browser on the Amazon AWS EC2 F1.2XL and F1.16XL instances and Jan Gray is working on putting the GRVI Phalanx on that platform as well.


Incredibly, it’s all in that Hot Chips poster.

1 Comment