cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
venuadabala
Visitor
Visitor
1,570 Views
Registered: ‎12-09-2020

AXI4-Lite interface

Jump to solution

Hi,

I want to use BRAM as Instruction cache memory for a processor. so I have used AXI4-Lite interface for the BRAM from IP catalog, however I am not sure if I have to use AXI4 controller also and also the AXI4-Lite is not responding.

Tags (2)
0 Kudos
1 Solution

Accepted Solutions
dgisselq
Scholar
Scholar
1,238 Views
Registered: ‎05-21-2015

@venuadabala ,

Were this my product, I'd infer the BRAM and build the entire design in RTL ... so I'm not sure I would agree that you are on the right track.  You might be ... for a different design approach than the one I am familiar with or the one I would recommend, but certainly not for the design methodology that I would use.

Dan

View solution in original post

0 Kudos
15 Replies
EajksEajks
Participant
Participant
1,536 Views
Registered: ‎11-21-2020

To be really sure what you mean, a block diagram showing blocks and bus characteristics (protocol, width, frequency). It is also unclear what your problem really is ad what you expect from us.

This said, I do not think that AXI4-Lite is the right protocol for implementing a cache memory interface : neither for the processor-cache interface, nor for the cache-memory interface.

venuadabala
Visitor
Visitor
1,522 Views
Registered: ‎12-09-2020

I need to create a instruction memory inside the FPGA to access instructions for the processor core, I am thinking of using BRAM from IP catalog. However this BRAM comes with 2 interfaces one is native and the second is AXI4/AXI4-Lite. The native interface seems to respond well however AXI4-Lite responding is not at all responding. My processor is utilizing a lot of signals so I need AXI4-Lite interface. The processor in my design is using custom interface, so I need to translate this interface to AXI4-Lite protocol. But however I wish to know how the AXI4-Lite interface works. I am sorry for any mistakes this is my first time involving in a forum community.

This is my required interface which I wish to translate to AXI4-Lite protocol is in  https://core-v-docs-verif-strat.readthedocs.io/projects/cv32e40p_um/en/latest/instruction_fetch.html

I have instantiated the BRAM using AXI4-Lite using the following, I gave the necessary inputs and COE file also, but it is not responding, the signals are going to high impedance state.

blk_mem_gen_0 U1(

.rsta_busy(rsta),
.rstb_busy(rstb),
.s_aclk(clk),
.s_aresetn(rst_n),

.s_axi_awaddr(write_addr),
.s_axi_awvalid(write_addr_valid),
.s_axi_awready(write_addr_ready),

.s_axi_wdata(write_data),
.s_axi_wstrb(write_data_strb),
.s_axi_wvalid(write_data_valid),
.s_axi_wready(write_data_ready),

.s_axi_bresp(write_resp),
.s_axi_bvalid(write_resp_valid),
.s_axi_bready(write_resp_ready),

.s_axi_araddr(read_addr),
.s_axi_arvalid(read_addr_valid),
.s_axi_arready(read_addr_ready),

.s_axi_rdata(read_data),
.s_axi_rresp(read_data_resp),
.s_axi_rvalid(read_data_valid),
.s_axi_rready(read_data_ready) );

0 Kudos
EajksEajks
Participant
Participant
1,510 Views
Registered: ‎11-21-2020

@venuadabala 

Could you share a timing diagram showing that it does not repond properly, please?

0 Kudos
EajksEajks
Participant
Participant
1,500 Views
Registered: ‎11-21-2020

@venuadabala 

A timing diagram could tell what is going on and why it does not work. But, as you describe it, your design makes no sense : you want to go from a rather basic rd interface to the complexity of AXI4-Lite then back to the basic interface of the BRAM memory (which in itself is not so different from your processor interface). You only add complexity, maybe latency or cycles and therefore will decrease the overall processor performance. The objective of the level-1 cache memory is to feed the processor pipeline as fast as it can and therefore its design needs to be as simple as possible. A BRAM with an AXI4-Lite interface is a BRAM with an additional wrapper. So the premise to keep the design as simple as possible is already dead wrong...

0 Kudos
dgisselq
Scholar
Scholar
1,473 Views
Registered: ‎05-21-2015

@venuadabala ,

I don't get it: why would you want to implement a new AXI interface for the CV32E40P?  Don't they already have one?  It's public, and available on github.  You should just be able to clone it and use it--unless this is an academic exercise of some sort.

Xilinx's AXI block RAM controller gets *horrible* performance--especially for non-burst accesses which are all AXI-lite supports.  You are likely to get one read every four clock cycles--not counting the cost of the interconnect, which will be a minimum of another two clock cycles per access--so you are now looking at a CPU that will require six clock cycles to read an instruction from an on-chip memory.  If this is the time it takes for your instruction cache to read an instruction, you are then talking about a CPU with a CPI (clocks per instruction) of six or more.  Not very fast.

Others have commented on the structure of your design:

CPU -> Fetch -> (AXI infrastructure) -> AXI BRAM (acting as cache)

I don't think this was what your instructor was intending.  A much better implementation would be:

CPU -> Fetch (w/ integrated block RAM for cache) -> (AXI infrastructure) -> Any memory(ies), to include possibly block RAM, flash, or SDRAM

This latter implementation makes a lot more sense.  If you do it right, the CPU should be able to access a new instruction every clock cycle (except cache misses)--a lot faster than every 6th clock cycle.

As a point to beware of, I just wrote something like this myself recently.  It's a fun exercise.  Beware that the AXI bus cannot abort any ongoing transactions.  Hence, if you request instructions for address 0, 1, 2, 3 and then the CPU jumps to address 7, the requests for addresses 0, 1, 2, and 3 will need to complete before you can get the result of address 7.  (You can almost request address 7 immediately, you just can't abandon prior requests that might be outstanding.)

Dan

Tags (1)
dpaul24
Scholar
Scholar
1,471 Views
Registered: ‎08-07-2014

@venuadabala ,

I do not have any experience in Instruction cache implementation. But from my knowledge of AXI4Lite, I can say that this protocol would be slow for interfacing b/w a uP and its cache.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos
dgisselq
Scholar
Scholar
1,463 Views
Registered: ‎05-21-2015

@dpaul24 ,

The AXI-lite protocol isn't slow.  AXI-lite will support a 1-transaction per cycle speed (throughput) which is as fast as any protocol can go.  Xilinx's AXI-lite infrastructure, however, is slower than molasses in the wintertime.  The AXI block RAM cannot support one burst per cycle, but rather has a burst over head of three more cycles.  A better implementation would be able to support one transaction per cycle across multiple bursts.  Xilinx's interconnect is also known for throttling AXI-lite transactions down to one transaction in flight at a time.  This is horrible performance, especially when it is possible to build an interconnect that maintains one transaction per clock.  The good news is that, once you get past the interconnect, Xilinx's MIG can handle AXI-lite transactions at (nearly) one beat per clock of throughput.

Further, given that the CV32E40P is an ASIC chip, it would make sense to use something other than Xilinx's (poor) AXI infrastructure.  The good news is that there's enough open source IP that doing so is actually quite possible.

Dan

0 Kudos
EajksEajks
Participant
Participant
1,413 Views
Registered: ‎11-21-2020

@dgisselq 

Xilinx MIG can handle AXI-Lite transactions at one beat per clock cycle? How's that possible? If behind your MIG you have a DDR3/DDR4 memory, transactions on the DDR bus will be forced to be 4W or 8W even if it means you have to drop/fill data to match the burst length of the DDR with the burst length of your AXI transaction (= 1 in case of AXI4-Lite).

Back to the processor case, you'll have to connect your self-designed AXI4-Lite based BRAM based cache memory to have the performance you expect. Best case it means that AXI_ARADDR will be directly connected to the BRAM address port and AXI_RDATA will be directly connected to the BRAM output port (but bypassing the BRAM output register). It will produce a synchronous output but with a significant output delay. Considering that you can continuously read and play around with AXI_ARVALID/READY and AXI_RVALID/RREADY to fulfill the protocol requirements.

BUT... if you have a cache, it also you have cache hits and cache misses. Cache misses mean you'll have to fetch the instructions from another source (l2/l3/ddr) and multiplex instructions coming from the BRAM and the ones from other levels of the memory hierarchy. Therefore a direct connection from BRAM to AXI_RDATA won't be possible and you'll probably have to introduce another pipeline stage. It will mean that you'll either have a higher-order pipelined processor so that you have 2 cycles for instruction fetch OR you read instructions two at a time having then in average 1 instruction per cycle.

If we really talk about performances, it's where I would worry when I see the processor interface provided by @venuadabala 

0 Kudos
dgisselq
Scholar
Scholar
1,394 Views
Registered: ‎05-21-2015

@EajksEajks ,

Yes, the MIG can handle one beat per clock cycle of throughput.  There's still lag involved of 20-24 cycles, but it can handle that kind of throughput.  Any good memory controller, whether for instructions or data, should be able to fill up a lot of that bandwidth.

As for a "self-designed" BRAM controller, best case is a single clock cycle delay when reading from a block RAM.  Achieving this delay would mean you'd need to dump the Xilinx AXI infrastructure (which would introduce at least two if not three cycles of delay).  Xilinx's AXI BRAM controller cannot achieve that performance.

One thing I don't see is how connecting the cache to the write half of any "self-designed" BRAM controller would impact the speed of the read side of that controller.  (We're talking I-cache, not D-cache here.)  I don't see how anything would get in the way of the read datapath--since you won't need to multiplex it with anything else.  (Only the write datapath of the BRAM will be connected to any SDRAM type of memory ...)  If you optimize reads properly, writes should be able to take place concurrently--with sufficient surrounding logic to make certain that nothing is reporting read results back until the respective cache line has been filled following a miss.  Am I missing something here?

Dan

0 Kudos
EajksEajks
Participant
Participant
1,367 Views
Registered: ‎11-21-2020

@dgisselq

Thank you for remembering e what throughput is. The point is : if you have AXI4-Lite, which doesn't support bursts, on one side of the MIG, you won't have one beat per clock cycle of throughput. To reach the one beat per cycle throughtput, after the initial 20-24 cycleslatency you mention, you need to have 1) AXI because it supports both bursts and separate pipelined address data transactions and 2) bursts whose number of bytes is equal to at least 4 or 8 times the external memory bus width.

Of course I would dump the AXI infrastructure. What's the point of having a pipelined processor and cache memories if you insert an AXI infrastructure between both !?!?! For the sake of losing cycles?

I never mentionned the BRAM write path. If did so or if it can be understood so, it might be due to my poor english. I know that an instruction-cache is read-only... on the processor side someone has to bring the instructions in the cache I'm teasing you. Considering cache design and the amount of thing you have to do (check tags and so on), BRAM accesses will have to be a multiple of the instruction path width. It is probably better to directly work at the cache line level which would be 4 or 8 instructions. This would allow hiding cycles and introducing optimization with the objective to be as close as possible to an average access time of one instruction per cycle.

0 Kudos
dgisselq
Scholar
Scholar
1,357 Views
Registered: ‎05-21-2015

@EajksEajks ,

I've run tests on the MIG.  It doesn't introduce a burst to burst delay.  As a result, it will not slow AXI-lite down (like the rest of Xilinx's infrastructure will).  I've run those tests, and seen that result.

As for your third paragraph, we seem to be talking around each other and ... I must be misunderstanding what you are saying.  Didn't you say that "a direct connection from BRAM to AXI_RDATA won't be possible and you'll probably have to introduce another pipeline stage"?  What would this stage accomplish?

Dan

0 Kudos
EajksEajks
Participant
Participant
1,344 Views
Registered: ‎11-21-2020

@dgisselq 

There is no burst in AXI4-Lite. If you do not experience "burst" to "burst" delay - let's call it "transaction" to "transaction" delay - the only reason I see is :

  •  you have an AXI bus width or maximum achievable AXI throughput (frequency times bus width) much smaller than the actual DDR throughput. If I take a recent design of mine, 64b-width DDR4-2400 has a peak data throughput of 8bytes/transaction * 2400MT/s = 19.2GB/s. To cope with this throughput an AXI bus running @ 200MHz has to be 768bits or 96bytes width (and in fact 1024b as it must be a power of 2). If you do AXI4-Lite transactions you'll indeed see no "transaction" to "transaction" delay because a single AXI4-Lite transfer would be 128bytes of 2 x 8W bursts on the DDR4 interface satisfying the way DDR works. But here we talk about AXI4-Lite at the processor interface and its 32b width. In this case you'll experience delays as a single AXI4-Lite transaction doesn't have enough data for a 4W or 8W transfers. Even if you consider cache miss and subsequent cache line fetch you'll fetch probably 4 instructions at max and you won't be able to produce a 4W transfer on the DDR. This would only happen with 8 instructions at once. And even if 4W transfers are allow, I a not sure you can reach peak bandwidth ; I tend to think it is only achievable with 8W transfers.
  • other alternative you have back to back AXI4-Lite transactions with consecutive addresses and the MIG is able to group them together. I know that some DDR controller are able to do that but I am not sure the MIG can. And anyway on instruction flow or instruction cache miss you won't enjoy such a huge range of consecutive AXI4-Lite transactions.

Last comment about my 3rd paragraph. If you have a cache, it means you can have cache hits or cache misses. Therefore the instruction path going to the processor may have two different sources depending how you implement the cache and if you want to short-circuit the BRAM on instruction cache refill to lower the penalty on instruction miss (alternative is to aalways read from the BRAM but you might have the penalty to write, then read the BRAM... any way it's very implementation and architecture dependent). In this case you'll have two paths to multiplex.

0 Kudos
venuadabala
Visitor
Visitor
1,256 Views
Registered: ‎12-09-2020

Hi, thanks  for the info. But I just want to make sure that I am on the right track.

use BRAM with native protocol and build a logic that supports the extra signals that are needed by my CV32E40P interface. Is it right?

0 Kudos
dgisselq
Scholar
Scholar
1,239 Views
Registered: ‎05-21-2015

@venuadabala ,

Were this my product, I'd infer the BRAM and build the entire design in RTL ... so I'm not sure I would agree that you are on the right track.  You might be ... for a different design approach than the one I am familiar with or the one I would recommend, but certainly not for the design methodology that I would use.

Dan

View solution in original post

0 Kudos
dpaul24
Scholar
Scholar
1,206 Views
Registered: ‎08-07-2014

@venuadabala ,

I would recommend the native protocol.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos