02-15-2012 01:44 AM
I am excited that I can announce the first public release of the optimizing C compiler for the PicoBlaze processor. The compiler toolchain can be downloaded for academic and non-commercial use. The toolchain is built around the LLVM compiler framework and the elfutils low-level binary tools. It consists primarily of the optimizing C compiler, gas-like assembler, and ELF-based linker.
(I am sorry if someone considers this message a spam. I've read the User Guidelines, but I really think the compiler can be of some use to the comunity. There's even a FAQ entry in this board stating that PicoBlaze compiler is not necessary, hence I wanted to point out a counterexample.)
BIG FAT WARNING #1:
The software is NOT produced, reviewed, nor supported by Xilinx Inc. in any way! Please, do not blame any bugs or flaws in the compiler on the authors of the PicoBlaze processor or their company!
BIG FAT WARNING #2:
As of now the compiler must be considered an alpha quality. Do not commit yourself to it unless you understand all its pitfalls and limitations! We are a small research group in a publicly funded institute [www.utia.cz], the compiler was developed as a part of the SMECY [www.smecy.eu] research project. We cannot guarantee a technical support.
Space--the final frontier!
02-15-2012 11:31 AM
A good PicoBlaze(tm) compiler is a good thing. As you so clearly state, this is an alpha level (first look) tool, not intended for anything serious, and not supported, nor endorsed by Xilinx.
But, I for one, encourage those who have the time to help you, and play with it.
I am aware of others who have tried to provide c dcompilers for PicoBlaze (or similar), so the more power to the often over-looked, but increasingly ubiqutous PicoBlaze.
Xilinx San Jose
02-17-2012 04:28 AM
Since implementing my first ‘PSM’ back in 1993, I have always been delighted to hear about projects related to PicoBlaze as well as from customers that directly use it in so many ways within real designs. Thank you to the folks at UTIA for sharing their ‘C’ compiler project with the communality and I really hope that it leads to constructive feedback. Thank you also for so clearly setting expectations appropriately at this time.
The desire for a C compiler for PicoBlaze has been present for many years and is the subject of one of my FAQ’s on this forum.
My personal experience has been that the majority of people making a request for a C compiler have had an application beyond the practical capability of KCPSM3 anyway so it wasn’t really the lack of a compiler that was the issue; they just hadn’t realised what KCPSM3 was and was not suitable at implementing. I appreciate that compilers are often used to program small PIC Microcontrollers but such compilers are typically tuned to the specific devices and the type of applications people implement use those small and often very limited devices to do (e.g. the compiler may only support 8 and 16-bit integers and automate the ‘printing’ of values on a particular type of LCD display connected to specific pins on the device package).
So not surprisingly, I would be very interested to hear from the team at UTIA as to the types of practical application for an FPGA design that they have been able to use their compiler with KCPSM3 to implement. It would certainly help to further set the expectations of other readers.
Since September 2010 the KCPSM6 has been available for Spartan-6 , Virtex-6 and 7-Series designs. Not only does this support programs up to 4K instructions and a scratch pad memory of up to 256 bytes it also includes some new instructions directly intended to help in the support of multi-byte operations. More significantly, multi-byte functions can now be achieved with a logical and scalable coding style requiring less instructions. For example, to compare a 32-bit value using KCPSM6 you can use following...
COMPARE s0, 12
COMPARECY s1, 34
COMPARECY s2, 56
COMPARECY s3, 78
JUMP Z, equal
This type of code is quite an improvement over the equivalent code required by KCPSM3 and I really hoped at the time that I was designing KCPSM6 that it would have appeal for developers of compilers as well as just making assembler code easier to develop and more efficient.
Finally, I was interested to read a little on the UTIA web site about a variant of KCPSM3 to make it more suitable for their compiler. So I think it is worth mentioning that I have heard from several PicoBlaze users over the years that have used PicoBlaze to implement the controlling ‘heart’ of their own ‘micro-processor’ with much greater processing capability in terms of functionality rather than performance. In effect, PicoBlaze was used to implement the Microcode that emulates features that PicoBlaze does not have and to control additional logic attached to it in order to expand its capability to access memory etc. In fact, one example of doing this was to efficiently emulate an 8051 microcontroller and anyone can take a look at this here...
The interesting dynamic in this case is that PicoBlaze has helped to convert some FPGA fabric into a structure that can then be programmed using one of many different compilers already available on the market.
Clearly there are many angles to this space and the ‘correct’ solution depends on the needs of the user and their application. Personally speaking, I am most interested in how well a compiler can be used with PicoBlaze (KCSPSM3 and now KCPSM6 as supplied) directly so thanks again to UTIA for facilitating further investigations and almost certainly inspiring more projects.
Principal Engineer, Xilinx UK
02-17-2012 08:30 AM
Why do we employ KCPSM3 instead of KCPSM6?
1) we wanted to implement the same IP cores that use PicoBlaze on both the Virtex 4/5/6 and Spartan 6 devices.
2) more in-house experience with KCPSM3, evolutionary design process.
3) as we perform all our development effort in Linux OS, we needed a linux-enabled assembler. Until our compiler tool-chain was completed, we used the open-source picoasm. I think there was no assembler for KCPSM6 that can be used in Linux.
For what do we use PicoBlaze and the compiler?
We design a programmable accelerator core called Application Specific Vector Processor (ASVP). The ASVP implements a vector-based floating-point processing pipeline. The data processing pipeline is `programmed' by issuing VLIW-style vector instructions. The vector instructions are one-by-one loaded into a so-called `vector-instruction forming buffer' by the PicoBlaze processor and then executed by the data pipeline (called Vector Processing Unit).
This may all sound complicated, but it is in fact very simple: there is a bunch of registers, connected to PicoBlaze I/O ports. PicoBlaze simply sets the registers to a new configuration (thus forming a vector instruction), waits until the data pipeline is ready, then fires the vector instruction. Look in this picture: http://dl.dropbox.com/u/3124266/vector-unit-v3.pdf
Hence the PicoBlaze code is in principle very simple, but repetitive. It is very tiresome and error-prone to write long sequences of LOADs and OUTPUTs in assembler, with some occasional loop or subroutine call when we need to implement an iterative algorithm.
The compiler simplifies the process greatly. For example, to set a 16-bit vector length in the next vector instruction the programmer in the past had to write two 8-bit constant loads and two outputs to I/O ports. This is simple, but painful. Nowadays, all he has to do is to just call pb2dfu_set_cnt(400); in C source to set the vector length to 400 elements.
But! There is one condition: the compiler HAS TO know register allocation AND inlining optimizations to make this workable. Our compiler knows this.
Example: The pb2dfu_set_cnt() function, which sets the vector length, is defined in C (not in assembler!) this way:
/** Set the (input) vector length. */
static inline void pb2dfu_set_cnt(uint16_t cnt)
/* OUTPUT value, (port) */
__output_port(cnt & 0xff, REG_DFUVECL_L);
__output_port(cnt >> 8, REG_DFUVECL_H);
Note the convenient use of >> and & operators to extract the 8-bit parts of a 16-bit value. You think this will generate a horrible code like some other compilers do? Hell, no :-)
It will generate the code below, directly in the calling function, thus there isn't even a subroutine overhead:
load s4, -112
output s4, 33 ; 33 = REG_DFUVECL_L
load s4, 1
output s4, 34 ; 34 = REG_DFUVECL_H
All the compile-time constants were folded. Moreover, the compiler optimizes functions as a whole, hence if the constant '1' is required in some further code it will be reused in the register, not re-loaded again. Thus the generated code is sometimes better than from a human assembler programmer, and at a much lower cost (= development time).
What is missing in PicoBlaze to enable better compilers?
The COMPARECY instruction in KCPSM6 is a fine step, but what I've run into when implementing the compiler is an efficient support for software stack spilling. Right now, you have to use 3 instructions (LOAD, ADD, STORE) to push/pop a single value from a software stack. Moreover, this 3-instruction 'stack spilling code' is not inert: it modifies FLAGS due to the ADD instruction. This turned out to to be a great problem for the LLVM compiler framework, because the optimizing compiler sometimes wants to insert the spilling code in the middle of an ADD->ADDCY chain (or SUB->SUBCY), hence potentially corrupting processor state.... All this is explained in detail in the documentation: http://sp.utia.cz/smecy/pblaze-cc-v2/Users_Guide/i
Another issue (not directly related to the compiler) which bugged me for years is the inability in the KCPSM3/6 to stall an INPUT/OUTPUT operation from an external hardware. Very often it would suffice to prolong an I/O transaction just by one or two cycles to make the world a more pleasant place. See here: http://sp.utia.cz/smecy/pblaze-cc-v2/Users_Guide/i
We solved the above mentioned issues by in fact using the open-source PacoBlaze http://bleyer.org/pacoblaze/ and prototyping the required features in it. But PacoBlaze is at least 2.5x bigger than KCPSM3 and it is not 'kosher'.
I would be more than happy if at least some of the suggested improvements would make it into next PicoBlaze versions.
Space--the final frontier!
02-20-2012 05:39 AM
Thanks for all the discussions. You raise an interesting dynamic concerning your (as you describe it) simple application; being that in your application the higher level language and compiler makes writing code less prone to human errors. It is certainly the case that using the appropriate tools for a job is always a good idea and of course that statement isn’t limited to electronics! PicoBlaze is a ‘tool’ and is helpful within its capability so it is good to know when a compiler ‘tool’ enhances that use of its capability. Using either inappropriately will be the cause of trouble so is important for users to know when they make the most sense and exploit them for that.
Regarding what you consider PicoBlaze requires in order to better support compilers then I think you may have missed an opportunity available to PicoBlaze simply because it lives within programmable hardware. You recognised that the programmability of the hardware allowed you to modify an independent version of PicoBlaze in order to slightly adjust the instruction set and functionally (and thank you for noticing the impact that version had on area and cost). However, you didn’t appear to consider the option to add some compiler friendly peripherals to the existing processor instead.
If you really need a stack (of virtually any size) then this can easily be implemented separately in the Xilinx fabric using an up/down counter to act as an address pointer to a memory (using distributed RAM or one or more BRAMs). Connecting this structure to a pair of input and output ports on PicoBlaze, it can be arranged to automatically ‘push’ (store data and increment pointer) during an OUTPUT instruction and automatically ‘pop’ (decrement pointer and read data) during an INPUT instruction. Not only does this enable the stack to be implemented efficiently (area and performance) whatever the size required, but the INPUT and OUTPUT instructions used to ‘push’ and ‘pop’ have no effect on the flags which was one of your main issues. The nice thing about having programmable hardware is that you can easily customise peripherals to fit with an application. One really simple example that comes to mind is a fairly common requirement to manipulate the bit order within a byte or word (e.g. reverse bit order or nibble swapping) which is so easily achieved by writing the data to an output port and reading it back in via and input port and allowing the programmable interconnect to do all the hard work for you (i.e. connect the bits to the ports as required for your function).
Your point about ‘stalling’ an input or output operation can be services with KCPSM6 albeit not exactly in the same way that you have literally stretched the instruction in your modified version. The new feature is the ‘sleep’ control which for all kinds of reasons can be used to make KCPSM6 ‘freeze’ for any length of time. As described previously, the art is to exploit the hardware environment to define what will drive that ‘sleep’ control appropriately for the application. I recently saw a nice example from a user that had implemented an AXI bus master using KCPSM6 and has used the sleep control to make KCPSM6 wait for the AXI bus to respond following an OUTPUT instruction, so very similar to your example. I would point out that that is was a very deliberate feature of the KCPSM6 sleep mode to ensure that it fully completed an instruction before entering sleep mode. In the specific case of an OUTPUT instruction it means that the ‘write_strobe’ is still only active for one clock cycle. If KCPSM6 were to freeze with the strobe in the High state then this could cause the same value to be written many times to a FIFO or something similarly undesirable. Having covered this dangerous case it would be simple to add additional logic to service your case.
Finally, you should find that KCPSM6 is a superset of KCPSM3 so even if you ignore the additional features and instructions it will almost certainly run the code that your compiler generates today (be aware that different op-codes are required even though the instructions are the same). Multiple users have also reported back that the assembler supplied with KCPSM6 works fine using ‘Wine’ within a 64-bit Linux environment. Hopefully the syntax enhancements in the KCPSM6 assembler would also make the compiler development task easier too.
Principal Engineer, Xilinx UK
02-21-2012 01:59 AM
While I generally agree with you about the possibility to implement a function in a reconfigurable hardware as a peripheral instead of computing it in software (e.g. the bit manipulation you mentioned), the software stack in C is a little bit special. In C language the stack is not really a stack as we know it for example in hardware. It is in fact a memory region that is accessed randomly, i.e. it is not true that only the top element is accessed. Instead, the current top element is identified by the the stack pointer register (sF in our compiler). The crucial point is that the compiler needs to access any byte on the stack randomly at any time, typically using constant offsets added to the stack pointer. More on this can be found in Stack Computers, Chapter 7.
Hence an external hardware implementing C stack would be more complicated than you have described: it would consist of a scratch-pad memory, an adder, a stack-pointer register, and decoders. It would duplicate much of the functionality that is already available in the processor. Moreover, this solution would make our compiler completely unusable on a pure PicoBlaze. Thus we decided to go with a solution that is a little bit suboptimal for the original KCPSM3, but superb for a modified PacoBlaze (called PB3A in the documentation):
We noticed that FETCH/STORE instructions do not use the lowest 4 bits of their op-codes. Thus in PB3A we encode a 4-bit constant offset into the unused bits. The new field is called 'ff'. The semantics of FETCH/STORE is modified in a backward-compatible manner (because ff will be zero in old programs):
Syntax: FETCH sX, sY, ff
Semantics: sX := SCRATCHPAD[sY + ff]
Rationale: This modification allows to perform most stack references in a single instruction.
Regarding the sleep mode in KCPSM6: If I understand it correctly, it cannot be used to stall the processor in the middle of an INPUT/OUTPUT instruction. I am looking at page 79 of the KCPSM6 User Guide pdf. The uart_tx FIFO output signal buffer_half_full is used to put the processor in sleep; thus at most half of the FIFO can be used. However, with the port_busy signal in PB3A you can use the full capacity of the FIFO: just connect a FIFO buffer_full signal to PB3A port_busy and route the FIFO write_buffer signal through an AND-gate to disable a FIFO write when it is full.
More importantly the port_busy signal in PB3A can be used to make the processor wait for an input. For example when a FIFO is empty while the processor tries to read (dequeue) an element from it, the processor can be stalled in the middle of the INPUT until there is an element in the FIFO that can be read.
Space--the final frontier!
02-21-2012 03:53 AM
Just to make it clear (mainly for readers of this thread), I agree with your observations and understand your desires for a future version of PicoBlaze. I am listening and this kind of discussion does really have the potential to influence what makes it into a future version. However, engineering is nearly always a trade off and why I’m always so careful about what gets included and what has to be left out. To set expectation correctly, KCPSM6 is the architecture for Spartan-6, Virtex-6 and 7-Series devices so I don’t anticipate starting to implement the next variant for some time.
My primary objective has always been to make PicoBlaze so small that it can be included (often many times) in a design almost without consideration. As you have seen, even small changes to the architecture and the use of standard RTL synthesis can make it more than twice the size so we are not necessarily talking about increasing size by a few percent when we make a small modification.
The KCPSM6 version of PicoBlaze is 26 Slices and many users are discovering that even ‘inefficient’ use of KCPSM6 can often be more efficient than a pure hardware state machine or similar structure defined in HDL. It makes me think that even if your compiler is forced into producing what you would consider to be sub-optimum output based on the ‘limitations’ of the processor architecture today that it could still be ‘good enough’ for multiple applications.
My secondary objective has been to ensure that PicoBlaze is able to operate at a clock frequency greater than most people’s RTL code will achieve when synthesized. This is partly born out of a paranoia of having to support thousands of users and what would happen if they all has issues with their design performance because PicoBlaze was too slow. Happily, I appear to have succeed. More importantly, from a user perspective this non-issue over performance is a key factor in what makes PicoBlaze easy to use (i.e. anything that just works is naturally easy to use). Whilst PicoBlaze is rarely used to implement anything truly time critical, it nearly always has to sit on the same clock as the ‘system’. This then makes synchronous input and output easy and reliable to implement as well. In practice, it means that KCPSM6 is happy at just over 100MHz in the slowest Spartan-6 device and up to 240MHz in a faster Virtex-6 device.
Including an offset to a pointer in your architecture is a nice feature but it also implies the inclusion of logic in the path to the memory address inputs. Whilst an adder is physically quite small, the corresponding increase in propagation delay would result in a lower clock frequency performance.
You have correctly interpreted the ‘sleep’ control on KCPSM6 but what you probably overlooked whilst focussing purely on the functionality is that the sleep control has no impact on the clock frequency performance of the processor in the system as a whole. Your ‘port_busy’ variant for KCPSM3 would appear to suggest a single clock cycle path through system logic which must then disable the majority of the control logic within the processor. This almost certainly lowers clock rate performance quite a bit. As soon as the processor in the system is unable to meet the frequency of the clock in a user design then it immediately impacts ‘ease of use’. By the way, your ‘port_busy’ probably impacted the size of the macro in more ways than you might imagine as well.
As I said earlier, it’s all an engineering trade off and it really depends on what parameters are important for an individual application. We are simply proving that one processor cannot be a perfect fit for everything and why there are so many processors available in the market. Not forgetting that Xilinx also have MicroBlaze and now the Zynq device with compilers.
My final thought for today.... KCPSM6 has two banks of registers. It makes me think that this could be a way for your compiler to kind of separate the ‘management’ tasks from the computational tasks. One bank of 16 registers could be dedicated to local variable storage and for the implementation of calculations. The other bank could then be dedicated to flow control and management of your ‘stack’. Given the increased capacity for local variable storage it should be possible to fetch all data required from the stack and then perform the whole operation in one go so that the flags remain valid.
Principal Engineer, Xilinx UK
02-22-2012 10:30 AM
This thread shows two diferent interesting views :
- a hardware conservative view : the picaBlaze has to stay small and fast and perhaps not mentionned also easy to use. As a teatcher I completly agree with this view. I learned embbeded processor with picoBlaze and teach now picoBlaze and I am very happy with it
- a software developper view : it would be easier to program picoBlaze in C and then important to make picoBlaze to evolve. I also agree with this view.
Finaly both previous agrement perhaps mean that it's time to let picoBlaze where it is, and to develop a new architecture for a C compiler. The answer could be microBlaze but the gap between picoBlaze and microBlaze is too important.... and I see a place between 8-bit and 32-bit architecture, isn't it ?
02-24-2012 04:51 AM
I completely agree that there is an unoccupied space between the 8-bit PicoBlaze and the 32-bit MicroBlaze, probably for a 16-bit architecture. I can even supply a practical example as a case study :-)
In a past project called Apple-Core, long before the compiler, in one subsystem (Hardware Families of Threads) I wanted to use PicoBlaze as a control embedded processor. But soon after I started writing firmware I realized that manipulating 32-bit quantities in KCPSM3 was going to be a huge pain. For about a week I was trying to devise a peripheral to accelerate the functions, and it was going to be a relatively complex FSM with built-in memory structures...
Finally I realized that all I need is only a slightly more powerful processor than KCPSM3. So I took the open-source PacoBlaze and widened all its data paths to 16-bits (i.e. registers, scratchpad, I/O data, I/O addresses).
However, the interesting thing is that the standard 18-bit instruction encoding was kept, hence existing assembler/disassembler toolchain (picoasm) could be used. In PB instruction encoding the only issue is the 8-bit constant field (kk). I decided that all kk fields in the original PB instructions would be always zero-extended to 16-bits before being loaded into a register or used in arithmetics. Then I had to implement only two new instructions: LOADSE and LOADHI. The 'LOADSE sX, kk' instruction was used to load a sign-extended 8-bit value kk into the 16-bit sX register. The 'LOADHI sX, kk' was used to set the higher 8 bits of sX to kk (the lower 8-bits of sX are kept). These two instructions are used to load 16-bit quantities into registers (e.g. LOAD s1, 0x34; LOADHI s1, 0x12 => s1 = 0x1234).
Space--the final frontier!
02-27-2012 01:04 AM - edited 02-27-2012 01:06 AM
Let me jump into this discussion a little from the MicroBlaze point of view.
If you configure MicroBlaze to its minimum configuration it's about 500 LUTs.
PicoBlaze is around 100 LUTs or so, guess Ken will correct me on that.
A 16-bit processor that is C-compiler friendly would be around 200-250 LUTs (a quick estimate)
So that gap between MicroBlaze and a 16-bit processor is around 250-300 LUTs.
That might seems a lot for some but you have to remember that our FPGA devices is containing more and more LUTs.
Looking at the low-cost Spartan serie and pick the 2nd smallest device (the smallest and the largest is usually extreme cases).
Spartan3A_400A: 7168 (LUT4) => ~5000 LUT6 => 250 LUT is 5%
Spartan6: LX9 has 5720 (LUT6) => 250 LUT is 4%
Artix7: A200T has 134600 (LUT6) => 250 LUT is 0.2% (Picking the smallest A100T will give 0.4%)
So a difference of 250 LUT will be become less and less important as we get newer FPGA families and this will just continue as we move to newer process technologies.
For me the space between PicoBlaze and MicroBlaze is getting less important to fill for every new FPGA family.
If I look what is available with standard microcontrollers, it's either 8-bit or 32-bit, there is not many 16-bit processors left now.
If you need the absolute smallest processor, pick PicoBlaze.
If you want a high-level programming environment, I think MicroBlaze is the right answer.
You can check the latest MicroBlaze MCS core that was added in ISE 13.4.
It's a small microcontroller system with MicroBlaze, timers, gpio, interrupts and memory.
The delivery is through CoreGen and is included in Webpack so it's free to use.