Sign In

Don't have a Xilinx account yet?

  • Choose to receive important news and product information
  • Gain access to special content
  • Personalize your web experience on Xilinx.com

Create Account

Username

Password

Forgot your password?
XClose Panel
Xilinx Home
Reply
Regular Contributor
jliu83
Posts: 68
Registered: ‎06-26-2008
0

Spartan 6 MCB Performance

A follow up to this discussion:

http://forums.xilinx.com/t5/Spartan-Family-FPGAs/Spartan6-gt-MCB-Performance/td-p/172526

 

I'd like to see if anyone out there has some measurable data on this.

 

For example, on a single DDR2 16bit width chip, using a 32 bit write MCB port.  If I fill the fifo of the MCB write port and send a command to the MCB with a user burst length of 64 (the max length), then I automatically lose half of my bandwidth on that transaction based solely on my configured port width?  It sounds like a terrible design if this is true!  I would think that since the transaction is on a continuous stretch of address, the MCB controller would optimize it so that the DRAM bursts would not waste bandwidth.

 

Did I miss understood something, or was the optimization implied in the response?

 

I have having trouble achieving write bandwith (read is perfectly fine).  Has anyone seen an improvement in performance by increasing the port clocks (ie. p0_wr_clk, p1_rd_clk ... etc).

 

What I have found out from my personal testing is that you must pre-emptively load/empty the fifos of the MCB before the empty flag goes high if you want to do full 64 word bursts.  You can do this with the p_wr_count or p_rd_count, but they have a certain latency attached to them (as explained by the MCB user guide).  If you wait for the empty flag to go high, and then start loading your 64 bit word, then you just wasted many valuable clock cycles.

 

For reads, after the command has been sent, and you are asserting p_rd_en for about 30 p_rd_clk cycles, you can issue another command for another burst of 64.  By the time the fetch from the DDR has been completed, the read buffer should be about empty and it will start fetching the data for the second command without wasting many cycles.

 

For writes, I have found that you need to wait a bit longer, maybe 40 or so p_wr_en cycles to avoid errors.

 

In the above case, the DDR2 is running at 600Mhz, the p_wr_clk and p_rd_clk are all about 100Mhz.

 

Anyone else have experience with the MCB?

 

Thanks,

-J

Expert Contributor
eteam00
Posts: 7,505
Registered: ‎07-21-2009
0

Re: Spartan 6 MCB Performance

I'd like to see if anyone out there has some measurable data on this.

 

Spartan-6 MCB performance data would be very interesting (and valuable information) to many designers.

 

For example, on a single DDR2 16bit width chip, using a 32 bit write MCB port.  If I fill the fifo of the MCB write port and send a command to the MCB with a user burst length of 64 (the max length), then I automatically lose half of my bandwidth on that transaction based solely on my configured port width?

 

I do not see how this must be true, if the user port clock frequency matches the DDR2 memory clock frequency (e.g. both memory and user fabric logic clocks are 200MHz).  In order for 'peak' bandwidth of memory and user port to match each other (assuming a user port data width which is double the data width of the memory), the two clock frequencies (for user port and for memory) must match.

 

What I have found out from my personal testing is that you must pre-emptively load/empty the fifos of the MCB before the empty flag goes high if you want to do full 64 word bursts.

 

This is an arcane subject which demands precision in the use of terminology.  At the risk of sounding tedious, please be more specific about "want to do full 64 word bursts".

 

You can do this with the p_wr_count or p_rd_count, but they have a certain latency attached to them (as explained by the MCB user guide).  If you wait for the empty flag to go high, and then start loading your 64 bit word, then you just wasted many valuable clock cycles.

 

Any (fabric port) clock cycles where the user port read/write FIFOs are empty represent 'lost' transfer bandwidth cycles.  This is the nature of FIFOs.  The full condition represents blocking state for the fill side of the FIFO and the empty condition represents a blocking state for the drain side of the FIFO.  Because the MCB FIFOs are designed for asynchronous input/output ports, there will be several clock cycles of latency in the FIFO empty/full flags.  This means that by the time the FIFO empty flag is asserted, the FIFO has been empty for more than a single clock cycle.

 

This presents an inherent conflict if the transfer size exactly matches the capacity of the FIFO.  A write operation to the memory cannot be started until the write data FIFO is full, and the fabric logic cannot fill the FIFO with the next set of data until the FIFO is empty.  A read operation cannot be started until the read data FIFO is empty (to avoid the risk of FIFO overflow).

 

For sequential accesses, there should be advantages (rather than disadvantages) in using the smallest data transfer size which matches burst length of the memory device.  In the case of DDR2 memory and Spartan-6 MCB, this is always 4.  With the small memory burst length, write transactions can be issued without the write data FIFO being full and read transactions can be issued without the read data FIFO being empty.  With DDR2 or DDR3 memory, sequences of consecutive read (or consecutive write) bursts can overlap without 'dead' cycles.

 

Note:  The distinction between memory burst length (between MCB and memory device) and user port burst length (between user port and MCB, numbering the user data words to be transferred) must be understood.

 

For writes, I have found that you need to wait a bit longer, maybe 40 or so p_wr_en cycles to avoid errors.

In the above case, the DDR2 is running at 600Mhz, the p_wr_clk and p_rd_clk are all about 100Mhz.

 

So the peak user port data transfer rate is about 400MB/sec (with 32-bit data port), and the peak memory transfer rate is 1200MB/sec (with 16-bit memory), in your description.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
Regular Contributor
jliu83
Posts: 68
Registered: ‎06-26-2008
0

Re: Spartan 6 MCB Performance

Bob brings up an excellent point in the discussion - that perhaps smaller user burst lengths might yeild better results.  Of course if you have a state machine running the user interface commands, you might lose a few cycles due to the state transitions required to send the extra commands.

 

I never really thought about it that way.


eteam00 wrote:

 

For sequential accesses, there should be advantages (rather than disadvantages) in using the smallest data transfer size which matches burst length of the memory device.  In the case of DDR2 memory and Spartan-6 MCB, this is always 4.  With the small memory burst length, write transactions can be issued without the write data FIFO being full and read transactions can be issued without the read data FIFO being empty.  With DDR2 or DDR3 memory, sequences of consecutive read (or consecutive write) bursts can overlap without 'dead' cycles.

 


With respect to the above comment, I'd like some clarifications, assuming sequential access.  For a DDR memory burst of 4, 16 bit wide, this would result in an access of 64 bits in one memroy burst.  On a 32 bit user port, I would access the memory using user port bursts of 2 for the best performance.

 

If this is the case, there are defintely some additional overhead that is incurred due to the increased number of times that the read/write command is issued from the user port.  For example, if I'm writing from a line buffer to memory with 1280 words, and I use user port bursts of 32, I lose 40 user clock cycles during which the commands are sent.  However I may lose more cycles waiting for my fifo to empty.  On the otherhand, if I use bursts of 2, then I lose 640 cycles to just send commands, UNLESS I can keep the p_wr_en high, and feeding data, during the cycle which the command is issued.  The examples on the MCB user guide always has the p_rd_en/p_wr_en low when the command is issued, so I don't know if this is possible.  However, in the former method, according to Bob, I will not need to check for p_wr_full.

 

Does any one have experience with this?  Can anyone confirm that you can issue a command while feeding data into or out of the port?

 

Thanks,

-J

Expert Contributor
eteam00
Posts: 7,505
Registered: ‎07-21-2009
0

Re: Spartan 6 MCB Performance

For example, if I'm writing from a line buffer to memory with 1280 words, and I use user port bursts of 32, I lose 40 user clock cycles during which the commands are sent.

 

Commands can be written to the command FIFO concurrently with writes to the write data FIFO.

Commands can be written to the command FIFO concurrently with reads from the read data FIFO.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
Regular Contributor
jliu83
Posts: 68
Registered: ‎06-26-2008

Re: Spartan 6 MCB Performance

So I finished some testing of my ownon the MCB.  I'd like to share my results on the forum.

 

My test consists of a custom implementation of a frame buffer (written from scratch).  The video input is 1280 by 1024 at 60 Hz, 32 bits.  It feeds two line buffers, alternating lines, which then gets fed into the MCB.  On the buffer out side, the opposite is true (two line buffers read alterating lines out of the MCB, reconstructs video).

 

According to Bob's post, the most efficent user port burst lengths would be 2.  I took Bob's idea and implemented the state machine, polling for p_wr_full.  When p_wr_full is asserted, the state machine would pause until p_wr_full is no longer asserted, at which point more data is written to the MCB fifo.

 

After this set up, I modified the input buffer state machine to write user burst lenghts of 4, and 8, and using "p_wr_count" as the signal to pause/unpause the fifo loading.  If there were enough space for 4 or 8 words of data, then the state machine would proceed.

 

Also using Bob's idea, the cycle during which the command is sent is not wasted.  During the write/read command, the data is changing and p_wr_en is asserted.

 

The results:

at BL = 2, line buffer cannot write fast enough for the line before the next line begins

at BL = 4, line buffer can write fast enough, but leaves a tiny bit of margin

at BL = 8, line buffer has plenty of margin.

 

My guess is that when a command is sent to the MCB, a few cycles is wasted in the internal state machines just to process the command.  This is just a guess, and I have no insight on how the MCB interior is designed.

 

The good thing is that my video buffer works with one input and one output.  Putting two inputs or two outputs however still results in too many wait states, and as a result, the lines cannot be written fast enough (Ie, any combination of 3 input/output buffer causes malfunction).  Might need a second chip to have two confurrent buffers (the application requires two video channels).

 

-J

Expert Contributor
eteam00
Posts: 7,505
Registered: ‎07-21-2009
0

Re: Spartan 6 MCB Performance

[ Edited ]

J,

 

Excellent post!  I'm sure many of the readers will find this interesting and useful.

 

Is BL=8 giving you better performance than BL=64?

 

Have you considered increasing the user port width, increasing the user port state machine clock frequency, or both?

For example,  memory peak bandwidth is 1.2GB/sec (at 300MHz memory clock), and state machine can only do up to 400MB/sec (4 bytes x 100MHz).

 

One last question:  How are your video pixels 32 bits rather than 24 bits (3 colour channels x 8 bits per channel)?

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
Regular Contributor
jliu83
Posts: 68
Registered: ‎06-26-2008
0

Re: Spartan 6 MCB Performance

Yes, BL=8 gives better performance than BL=64, I believe this is mainly because I need to poll p_wr_empty, a signal that has a lot more latency it seems.  I just tested with BL=16, with no gains on BL=8, which means the state machine is probably running as efficiently as possible.

 

With respect increasing user port width.  I have considered this, but my application requires at least 4 ports (two buffer in, two buffer out).  Right now there is only one DDR2 chip on the dev board (Atlys from Diligent), so I'm limited to using a 32 bit ports.

 

I have considered increasing the user port state machine clock, but this is rather a tricky affair.  Since the video coming in is at 108 Mhz, I am using this clock out of convenience.  Because the user port already async with respect to the clock from the DDR2 memory, there's no reason to add another async element and feed the data into the memory at a different clock frequency.  Perhaps this is something I can look into to increase performance, although I expect this path to be very messy.

 

As for your last comment, I am wasting 8 bits with every memory address.  I need to create a pixel packing state machine that translate 24 bits into 32, with a data enable that indicates when to feed into the line buffers.  I have hoping the bandwidth would be large enough so that I didn't need to do this step, but it seems like I have no choice if I want to avoid adding a second DDR chip.  This would save bandwidth by 25%, for EACH buffer in/out port, which is significant.

Expert Contributor
eteam00
Posts: 7,505
Registered: ‎07-21-2009
0

Re: Spartan 6 MCB Performance

As for your last comment, I am wasting 8 bits with every memory address.  I need to create a pixel packing state machine that translate 24 bits into 32, with a data enable that indicates when to feed into the line buffers.  I have hoping the bandwidth would be large enough so that I didn't need to do this step, but it seems like I have no choice if I want to avoid adding a second DDR chip.  This would save bandwidth by 25%, for EACH buffer in/out port, which is significant.

 

The more practical view is that 4-byte transfers require 33% more bandwith than the needed 3-byte transfers.  Your penalty is 33%, not 25%.

 

If you do not need the extra bandwidth, then don't bother.

 

A video line is 1280 x 3 bytes.  Using BL=8 (32 bytes) transactions, a video line is 120 transactions.

If you need the bandwidth optimisation, then good luck with your new and improved state machine(s)!

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.
Regular Contributor
jliu83
Posts: 68
Registered: ‎06-26-2008

Re: Spartan 6 MCB Performance

A follow up to the post.  I have completed the dual frame buffer and the circuit seems to be stable and functionning correctly.  There are no artifacts and the line buffers are being read/written within the allotted time requirements.  In the end these are the numbers I ended up using to get the frame buffer to work.

 

1.  Boost memory speed to 800MHz.  This had to be done to increase the read/write speed.

2.  For reads, use user port burst length 64, and use px_rd_full.  This seems to give least latency.

3.  For writes, use user port burst lenght 32, and use px_wr_count to see if there is enough space in the write buffer.  This is a minor improvements from burst lenths 16 and 8.

4.  Pixel packing required for all 4 interfaces to work (2 write, 2 read).

 

Originally I did not notice a difference going between burst lenghts of 4 to 8 to 16.  However, it seems like clocks were wasted during commands being issued to the MCB interface.  This was noticeable only when all 4 buffer cores were implemented.  The margins were definitely better using larger user burst lenghts.

 

The pixel packing/unpacking machine takes the 24 bit color, and packages the into 32 bit pixels (and vice versa).  I was unable to read from the line buffers within the alotted time without this step (writting seems to be okay).

 

Throughput is calculated as follows:

 

Perchannel:

1280x1024x60 Hz, at 24 bit color = 236 Mbyte/second

 

Total of 2 write channels, 2 read channels

(2 + 2) * 236 MByte/second = 944 Mbyte/second

 

At 800 MHz, the max theoretical is 1600 MByle/second, the throughput is then

944/1600 ~= 59%

 

Not great, but I'll take it.  I tried turning down the clock rate to 700MHz, but again, the line buffer reads could not achieve the required speed.  Hope this helps people out there trying to use the core.

 

-J

 

Expert Contributor
eteam00
Posts: 7,505
Registered: ‎07-21-2009
0

Re: Spartan 6 MCB Performance

[ Edited ]

1.  Boost memory speed to 800MHz.  This had to be done to increase the read/write speed.

 

What is the operating clock frequency for the fabric logic accessing the memory controller on the user side?  Are you still using 108MHz?

 

The memory boost from 667 to 800MT/sec was a very smart move.  This puts the peak bandwidth of the memory roughly at parity with the peak bandwidth of the aggregate of the four user ports.

 

Consider the bandwidth mismatch between a single 32-bit user port operating at 108MHz vs. 16-bit memory interface operating at 800MT/sec.  This represents a single-port bandwidth imbalance of almost 4:1.  This presents some interesting performance consequences -- please keep reading.

 

2.  For reads, use user port burst length 64, and use px_rd_full.  This seems to give least latency.

 

As the READ DATA FIFO depth is 64 words, you would need to completely drain the READ DATA FIFO before issuing a subsequent READ command.  There might be a resulting loss of read bandwidth on this port with this dependency, but this effect is probably dwarfed by the beneficial counter-effect of keeping the memory interface busy with READ traffic (rather than inter-mingled READ and WRITE traffic) for as long as possible.  Please keep reading...

 

3.  For writes, use user port burst length 32, and use px_wr_count to see if there is enough space in the write buffer.  This is a minor improvements from burst lenths 16 and 8.

 

My guess is that the bandwidth loss you are seeing is the result of switching between READ and WRITE transactions to the DRAM.  The fabric side bandwidth of any single user port is much too low to keep the memory interface occupied for very long, and this forces the memory controller to service READ transactions while the WRITE queue is empty, and vice versa.

 

Check the memory device datasheet for the READ followed by WRITE and WRITE followed by READ delays.  They are substantial!  For a -25E speed grade device (DDR2-800, CL5), the number of DQ bus "dead' cycles between

  • a READ and a WRITE is roughly 1 CK cycle (2.5nS @ 400MHz memory clock)
  • a WRITE and a READ is roughly 8 CK cycles (20nS @ 400MHz memory clock)

If READs are always followed by WRITEs, and vice versa, each 32-word transaction consumes roughly 20.5 memory clock cycles rather than the peak performance mode figure of 16 memory clock cycles.  Peak memory bandwidth is effectively 1.25GB/sec rather than 1.6GB/sec, a 22% decrease.  These figures ignore many details (refresh, precharge, etc. etc.), but the issue is made clear enough to earn your attention.

 

You can perhaps recover some of the interleaved READ/WRITE penalty cycles by overlapping the WRITE commands of the two WRITE ports and overlapping the READ commands of the two READ ports.  WRITE, WRITE, READ, READ, WRITE, WRITE, READ, READ will be more efficient than WRITE, READ, WRITE, READ, WRITE, READ, WRITE, READ.  You do not have enough user-port bandwidth to issue more than one READ or WRITE transaction from a single port before all of the other 3 ports are serviced, so coordinating the activity of two ports is the next-to-best approximation of increased user-port bandwidth.

 

Originally I did not notice a difference going between burst lenghts of 4 to 8 to 16.  However, it seems like clocks were wasted during commands being issued to the MCB interface.

 

This could be explained by the interleaved READ - WRITE turnaround problem.

 

If your memory bandwidth problems are settled, there is no justificable reason for flogging the problem further, unless you have enough time and energy to indulge in curiosity.  If it happens that curiosity must be served, then perhaps experimenting with 64-word (rather than 32-word) WRITE accesses and alternate priority arbitration schemes might help recover additional memory bandwidth.

 

While each single user port is woefully imbalanced with respect to the 800MT/sec memory interface, the aggregate of all 4 user ports comes very close to peak bandwidth parity with the memory interface.  In such a case, any lost cycles (specifically filling or draining DATA FIFOs) on the user port side for any of the 4 user ports will represent lost memory interface bandwidth.  Any design optimisations which will keep the user-port FIFOs from either EMPTY or FULL condition should result in improved memory buffer efficiency.

 

-- Bob Elkind

SIGNATURE:
README for newbies is here: http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

Summary:
1. Read the manual or user guide. Have you read the manual? Can you find the manual?
2. Search the forums (and search the web) for similar topics.
3. Do not post the same question on multiple forums.
4. Do not post a new topic or question on someone else's thread, start a new thread!
5. Students: Copying code is not the same as learning to design.
6 "It does not work" is not a question which can be answered. Provide useful details (with webpage, datasheet links, please).
7. You are not charged extra fees for comments in your code.
8. I am not paid for forum posts. If I write a good post, then I have been good for nothing.