08-05-2009 04:02 PM
I am testing my own board that has a Spartan X3CS700A with SDR SDRAM (16 bit wide) running at 50MHz, using EDK 11.2, mpmc RAM controller. I just compared execution speed to my Xilinx 500E starter kit with DDR. There is a big difference in speed and am thinking something is wrong.
On startup it copies a program from a serial flash to the RAM, then executes from RAM. The program is a small C integer benchmark that does some calculations on an 8K array of bytes. Here is the time required to run it:
Starter kit no cache = 1.25 seconds
My board no cache = 3.1 seconds
Starter kit 2K/2K cache = 76 milliseconds
My board 2K/2K cache: 1.5 seconds
When not cached, RAM is on the PLB 4.6 bus. When cached it is on the XCL bus. RAM passes 8/16/32 bit memory test OK.
I would expect about a 2x performance difference between the DDR on the starter kit and the SDR on my board, which is about what I get with no cache. But cache is helping a lot with starter kit, but not helping much on my board !
I am attaching my MHS file (cache setup). Does anyone have any ideas ?
08-06-2009 02:13 PM
Does the Xilinx 500E starter kit by some chance use 'DMA',
does it have the multi port memory interface as opposed to the PLB bus sdram controler.
the PLB bus was / is a known bottle neck which is why the multi port memory controler was introduced.
10-11-2009 07:58 PM
Did you ever get your memory cache going faster?
Comparing this MHS file to the later one you gave in response to my thread, I see you zero'd out the pipelines. Did that make the difference?
I that also why you run slower at 50 MHz? The data sheet says no pipelines speeds things up, unless you run the memory clock so fast your data misses the clock.
10-12-2009 10:32 AM
I ran a small integer benchmark (using Spartan 3A-700 with SDR SDRAM and MicroBlaze at 50MHz) ... also speed-tested parts of my actual app, and compared a few cache sizes, like 0K, 1K, 2K, 4K. Without caching, the MicroBlaze is about as fast as my 10 year old dog. But with it, it is quite fast, and 2K / 2K seemed like a good tradeoff. I need some BRAM for other stuff. With 2K / 2K cache, it runs almost as fast as my 3A Starter Kit with DDR2. My app is not only executing out of RAM, it is reading and writing a fair amount of data to/from RAM as it runs, so RAM thru-put makes big difference.
The breadboard of my design has been running well for about a month now, and has been very solid.
Disabling pipelines did improve speed, I looked at my notes and it was about 6%. So not a huge change. The reported fmax of the MPMC went down a little with this, as I remember - can't find notes on this. It is currently about 113MHz so am running well below fmax. XPS reports an fmax for the MicroBlaze of 82MHz. I ran for a while with 75MHz oscillator and it ran fine. But need this to be hi rel, so am being conservative.
By the way, another thing that improved speed was not optimizing the MicroBlaze for area ... set by PARAMETER C_AREA_OPTIMIZED = 0 in MHS file. There is a checkbox for this in the XPS GUI.
10-12-2009 10:59 AM
In your original post, your memory access was more than 20X slower than the starter board, when both already had a 2K/2K cache.
Now you're about as fast. How did you get this major speed up? You already had a cache, so what was the trick?
10-12-2009 01:44 PM
SDRAM is much slower than DDR2 and Cache is a must if you are using slower memory. I have a custom board using Spartan 3AN-700 with SDRAM. While building my project, I forgot to enable the I Cache and D Cache and everthing slowed to a craw!
10-13-2009 09:48 AM
I tried to figure out what change made such a big difference in speed, so I tried changing my MHS file back to about what I had in my original post, where it was so slow. Basically I made 3 changes to my current MHS file:
- Commented out all lines in the mpmc section that disabled pipelines and FIFOs
- Changed the line in the microblaze section C_AREA_OPTIMIZED from 0 to 1
- Removed the line C_SPLB0_NATIVE_DWIDTH = 32
Rebuilt, and ran my sieve integer benchmark, and it slowed down by 14%. So that is what those changes were giving me. But it does not explain why the setup in my original post was so slow.
Possible explanations I can think of .... (1) my caching was not working right for some unknown reason, or (2) I remember that part of my benchmark called a function to toggle an I/O line for measurement ... I later made this in-line. Maybe calling a function far away in memory space made the cache less effective.
One thing from my notes, at the point where it started running fast, the FPGA current I was measuring on the 1.2 volt line on my breadboard went from 62mA to 130mA. This makes me think something was not right with the cache at the time of my original post.