cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
akboken
Adventurer
Adventurer
5,377 Views
Registered: ‎10-19-2015

Resetting a BRAM in an efficient way

 

Hi All.

 

I have a local static buffer which accumulates some input data overtime. I want to have a reset logic (when I want to) that resets the contents of LOCAL_BUFFER. One thing about LOCAL_BUFFER is that it stores data from different iterations of example_top function (that is way it is declared as a static). 

 

I am wondering what is the best way to reset the LOCAL_BUFFER ? LOCAL_BUFFER is implemented as a BRAM. 

 

 

example_top (int a[], int b[], int reset)
{

static int LOCAL_BUFFER[10][10];


// is this an efficient way to reset a buffer ???
if(reset==1) {
for(int i=0;i<10;i++) { 
   for(int i=0;i<10;i++) { 
   LOCAL_BUFFER[i][j] = 0;
   }
}
}


}
Tags (4)
0 Kudos
8 Replies
hpoetzl
Voyager
Voyager
5,370 Views
Registered: ‎06-24-2013

Hey @akboken,

 

// is this an efficient way to reset a buffer ???
if (reset==1) {
    for (int i=0; i<10; i++) { 
        for (int i=0; i<10; i++) { 
            LOCAL_BUFFER[i][j] = 0;
        }
    }
}

Two things I see here ...

  1. You are using the same loop variable and 'j' is undefined 8-)
  2. Your loop will take 100 cycles (50 cycles on dual port BRAM) to finish

Depending on your application, the 100/50 cycles might be fine or even required for the reset, but there are other ways to do it.

 

Options I see here are:

  • Automatically reset the buffer element after read-out.
  • Take a 'note' that a reset is scheduled and 'reset' the value on the next update.

Hope this helps,

Herbert

-------------- Yes, I do this for fun!
0 Kudos
akboken
Adventurer
Adventurer
5,354 Views
Registered: ‎10-19-2015

Hi Hpoetzl,

 

Could you please give an example of following options ? I really interested in automatically resetting and resetting in the next update options you mentioned ? Do you have a code example to do that ?

 

Kind regards,

Akboken.

 

  • Automatically reset the buffer element after read-out.
  • Take a 'note' that a reset is scheduled and 'reset' the value on the next update.
0 Kudos
hpoetzl
Voyager
Voyager
5,342 Views
Registered: ‎06-24-2013

Hey @akboken,

 

I really interested in automatically resetting and resetting in the next update options you mentioned?

Do you have a code example to do that?

 

So the 'reset on read-out' is quite simple: somewhere in your code you 'retrieve' the accumulated value and in many cases this is the perfect place to 'reset' this specific element. For example, if you are counting events, you're often interested in the number of events since the last read-out to accumulate.

 

The 'take a note' approach is based on moving the 'needs reset' information into local memory (i.e. out of the BRAM) and can happen in two ways (the specific solution depends on the requirements of your design).

 

The first approach is to have a single 'flag' which basically says: 'we are in reset'. The trick here is to assert this flag long enough so that every element gets reset once and/or all elements get out of reset at the same time. Note that this approach only works if it is guaranteed that each element will be accessed in a well known timeframe. Also note that this doesn't need to 'lose values' during the 'reset' period, it can still work properly.

 

An example here is if you constantly cycle through all your 100 elements to update them, then the reset procedure only needs to keep the 'flag' for 100 cycles and every element will reset properly (i.e. instead of e.g. incrementing the element value, you just set it to zero. same loop, different path).

 

The second approach here is to use a vector or array with a bit for each element (so in your case, a 100bit vector or 100 flags) which you can easily set or reset in a single cycle. Each bit there represents the respective element and signals that the next 'action' taken is or includes a reset. Note that this can usually be done in a smart way without disturbing the 'normal business'.

 

Here some code fragments (of the top of my head) to illustrate:

uint1 valid[100] = {0};

// increment element at 'index'
if (valid[index]) {
     counter[index]++;
} else {
     counter[index] = 0;
     valid[index] = 1;
}
// add value to element at 'index'
if (valid[index]) {
    sum[index] += value;
} else {
    sum[index] = value;
    valid[index] = 1;
}
// reset on read-out
if (valid[index])
    return sum[index];
else
    return 0;

Hope this helps,

Herbert

-------------- Yes, I do this for fun!
0 Kudos
dgisselq
Scholar
Scholar
5,320 Views
Registered: ‎05-21-2015

Let me offer my own quick two cents ...

 

Another approach to resetting memory, which I have used often, is ... not to.

 

Here's what I mean: I'm usually reading and writing memory's sequentially.  Any time I want to "reset" the memory, I'll set the next read/write pointers to zero, and mark the memory as "not-yet written to."  By then keeping track of the read and write pointers, I can tell which parts of the memory have been written and which are in their "reset" state.  Logic then "reads" memory still in the "reset" state as whatever value you want it to be, even though the memory itself hasn't been reset.

 

This solution isn't all that general.  It's quite application dependent when determining whether or not you can use this approach.  Still, I've managed to use this approach quite successfully for block averager's, other digital filters, FFT's, home-made scope's, and more.

 

Dan

 

u4223374
Advisor
Advisor
5,298 Views
Registered: ‎04-26-2015

@akboken Can you post a bit more code and/or describe the algorithm? In particular, it'd be good to see where this array gets read and written.

 

We've seen three techniques posted here (reset during readout to prepare for the next cycle, store an "empty" flag, and storing pointers that mark which RAM has been initialized). Each works well in some situations and poorly in others. Sometimes, however, the straightforward approach (as you posted) is the only suitable one, and in that case the focus changes to making it as quick as possible. Array partitioning and loop unrolling can often cut the time down substantially, and might even reduce resource consumption.

0 Kudos
bigbrett
Adventurer
Adventurer
5,193 Views
Registered: ‎07-08-2016

I would be incredibly interested in a follow up post, maybe with a toy example, as I've run into this issue many times.

0 Kudos
u4223374
Advisor
Advisor
5,173 Views
Registered: ‎04-26-2015

@bigbrett I'm working on it now.

 

First example: using markers to show which section of RAM is "valid" and which section is "invalid". This is easily done for something where you know that some long, contiguous section is "valid" and some other section is "invalid". A FIFO or ring buffer is a very common example. What I've implemented here is a stack, which is even easier than a ring buffer or FIFO because it only has one index:

#include <stdint.h>

#define OP_PUSH 0
#define OP_POP 1
#define OP_CLEAR 2

#define STACK_LENGTH 256

void stackOps(uint8_t * data, uint8_t operation) {

    static uint16_t stackTop = 0;
    static uint8_t stack[STACK_LENGTH];

    switch (operation) {
    
    case OP_PUSH:
        if (stackTop < STACK_LENGTH) {
            stack[stackTop] = *data;
            stackTop++;
        }
        break;
        
    case OP_POP:
        if (stackTop > 0) {
            stackTop--;
            *data = stack[stackTop];
        } else {
            *data = 0;
        }
        break;
        
    case OP_CLEAR:
        stackTop = 0;
        break;
        
    }
}


The stack never gets "cleared" (in that elements are never written to zero, unless you do that explicitly). Instead, when you tell it to clear, it just resets the stack top marker (showing how many elements are in the stack) to zero. Since there's no way to read elements above the top marker, and you can't increase the top marker without overwriting the elements as you do that, this approach effectively clears the stack with a simple one-line operation.

u4223374
Advisor
Advisor
5,161 Views
Registered: ‎04-26-2015

The next three examples will cover computing the median (middle value) and mode (most common value) of a data stream, by first building a histogram of the stream and then gathering statistics from the histogram.

First up, the obvious approach. We don't want to waste time clearing the histogram, so we fully partition it to allow the clear to happen in one clock cycle.

#include <stdint.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>
#include <hls_stream.h>

typedef ap_axiu<8,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

void findMedian0(data_stream & streamIn, uint8_t * median, uint8_t * mode) {

    // Array -> cannot be initialized to zero "quickly"...
    uint8_t histogram[256];
    // ... so we partition it into registers.
    #pragma HLS ARRAY_PARTITION variable=histogram dim=1 complete

    for (int i = 0; i < 256; i++) {
#pragma HLS LOOP_TRIPCOUNT min=256 max=256
#pragma HLS UNROLL
        histogram[i] = 0;
    }

    // 8-bit count of the total elements -> can be initialized to zero in one cycle.
    uint8_t totalElements = 0;

    while (true) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=128 max=255
        data_element dataIn = streamIn.read();
        uint8_t num = dataIn.data;
        // Check whether this address in the histogram is valid.

        histogram[num] ++;
        totalElements++;
        if (dataIn.last) {
            break;
        }
    }

    // Now find the median and mode.
    // For a true median we need to read two values, at least if totalElements is even.
    // If totalElements is even then these both come out at the same value, and it still works.
    uint8_t sum = 0;
    uint8_t median0 = 0;
    uint8_t median1 = 0;
    bool found0 = false;
    bool found1 = false;

    uint8_t modeTmp = 0;
    uint8_t modeCount = 0;

    // Find the median. We do this by going through the array and counting how many elements were found,
    // until we reach half the total value. When totalElements is even, we actually need to find two values
    // ((totalIndex/2) +/- 0.5).

    for (int i = 0; i < 256; i++) {
#pragma HLS PIPELINE II=1
        uint8_t value = histogram[i];

        uint8_t newSum = sum + value;
        if ((sum < ((totalElements - 1) / 2)) && (newSum >= ((totalElements - 1) / 2))) {
            median0 = i;
        }
        if ((sum < (totalElements / 2)) && (newSum >= (totalElements / 2))) {
            median1 = i;
        }

        sum = newSum;
        if (value > modeCount) {
            modeTmp = i;
            modeCount = value;
        }

    }

    *median = (median0 + median1) / 2;
    *mode = modeTmp;
}


Key stats - 390 to 517 cycles, 2161 FFs, 3352 LUTs.

This works, but it's pretty expensive in hardware.

Approach #2 is to have a 256x1-bit register array to mark which elements are cleared. That way, rather than building huge 256x8 multiplexers, HLS only has to build a 256x1 multiplexer.

#include <stdint.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>
#include <hls_stream.h>

typedef ap_axiu<8,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

void findMedian1(data_stream & streamIn, uint8_t * median, uint8_t * mode) {

    // Array -> cannot be initialized to zero "quickly".
    uint8_t histogram[256];
    // 256-bit "valid" signal -> can be initialized to zero in one cycle.
    // HLS can equally well do this with a fully-partitioned array of bool.
    ap_uint<256> valid = 0;
    // 8-bit count of the total elements -> can be initialized to zero in one cycle.
    uint8_t totalElements = 0;

    while (true) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=128 max=255
        data_element dataIn = streamIn.read();
        uint8_t num = dataIn.data;
        // Check whether this address in the histogram is valid.
        if (valid[num]) {
            // If it is, increment it.
            histogram[num] ++;
        } else {
            // If it's not, reset it to 1 and mark it valid.
            histogram[num] = 1;
            valid[num] = 1;
        }
        totalElements++;
        if (dataIn.last) {
            break;
        }
    }

    // Now find the median and mode.
    // For a true median we need to read two values, at least if totalElements is even.
    // If totalElements is even then these both come out at the same value, and it still works.
    uint8_t sum = 0;
    uint8_t median0 = 0;
    uint8_t median1 = 0;
    bool found0 = false;
    bool found1 = false;

    uint8_t modeTmp = 0;
    uint8_t modeCount = 0;

    // Find the median. We do this by going through the array and counting how many elements were found,
    // until we reach half the total value. When totalElements is even, we actually need to find two values
    // ((totalIndex/2) +/- 0.5).

    for (int i = 0; i < 256; i++) {
#pragma HLS PIPELINE II=1
        // Disregard invalid histogram values.
        uint8_t value;
        if (valid[i]) {
            value = histogram[i];
        } else {
            value = 0;
        }
        uint8_t newSum = sum + value;
        if ((sum < ((totalElements - 1) / 2)) && (newSum >= ((totalElements - 1) / 2))) {
            median0 = i;
        }
        if ((sum < (totalElements / 2)) && (newSum >= (totalElements / 2))) {
            median1 = i;
        }

        sum = newSum;
        if (value > modeCount) {
            modeTmp = i;
        }

    }

    *median = (median0 + median1) / 2;
    *mode = modeTmp;
}


Key stats - 519 to 773 cycles, 380 FFs, 496 LUTs. The extra cycles are because the first loop is now at II=2; persuading HLS to pipeline a histogram loop at II=1 when using block RAM is tricky because of data dependency issues. on the other hand, for a ~50% increase in runtime, we've reduced the LUT usage by over 85%.


Approach #3 relies on the fact that we do actually step through the whole array at the end, looking for the median and mode. If the array is static (so it holds values between iterations) then we can just clear it here so it's ready for the next round.

#include <stdint.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>
#include <hls_stream.h>

typedef ap_axiu<8,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

void findMedian2(data_stream & streamIn, uint8_t * median, uint8_t * mode) {

    // Array -> cannot be initialized to zero "quickly".
    static uint8_t histogram[256] = {0}; // This initialization only occurs at power-up (which the FPGA can do).

    // 8-bit count of the total elements -> can be initialized to zero in one cycle.
    uint8_t totalElements = 0;

    while (true) {
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_TRIPCOUNT min=128 max=255
        data_element dataIn = streamIn.read();
        uint8_t num = dataIn.data;
        // Check whether this address in the histogram is valid.

        histogram[num] ++;
        totalElements++;
        if (dataIn.last) {
            break;
        }
    }

    // Now find the median and mode.
    // For a true median we need to read two values, at least if totalElements is even.
    // If totalElements is even then these both come out at the same value, and it still works.
    uint8_t sum = 0;
    uint8_t median0 = 0;
    uint8_t median1 = 0;
    bool found0 = false;
    bool found1 = false;

    uint8_t modeTmp = 0;
    uint8_t modeCount = 0;

    // Find the median. We do this by going through the array and counting how many elements were found,
    // until we reach half the total value. When totalElements is even, we actually need to find two values
    // ((totalIndex/2) +/- 0.5).

    for (int i = 0; i < 256; i++) {
#pragma HLS PIPELINE II=1
        uint8_t value = histogram[i];
        histogram[i] = 0; // Clear the histogram.

        uint8_t newSum = sum + value;
        if ((sum < ((totalElements - 1) / 2)) && (newSum >= ((totalElements - 1) / 2))) {
            median0 = i;
        }
        if ((sum < (totalElements / 2)) && (newSum >= (totalElements / 2))) {
            median1 = i;
        }

        sum = newSum;
        if (value > modeCount) {
            modeTmp = i;
        }

    }

    *median = (median0 + median1) / 2;
    *mode = modeTmp;
}


Key stats - 518 to 772 cycles, 109 FFs, 208 LUTs. Better than a 90% saving in LUTs compared to the original approach. The downside is that now you do have to actually go through the whole histogram at the end; whereas with the other approaches you could abort the loop early in some cases (if you've found both medians and the current modeCount is larger than the number of data elements remaining) (not implemented in my code). However, worst-case runtime would still be the same.


Each of these works best in certain situations.

- The first one would be fine if you only had, say, 16 bins in the histogram - a 16-way mux is straightforward for HLS. Resource consumption would be similar to any other approach.

- The second approach makes sense in this sort of situation (ie about 256 bins), especially when you do want to be able to abort the final loop early (or, alternatively, you're just not doing the final loop at all).

- The third approach makes a lot of sense for higher bit depths. I've previously done a 16-bit histogram on a 640x480 image. A 16-bit histogram requires 65536 bins. Needless to say, if you ask HLS to build a 65536-input multiplexer (either 8-bit for the first approach or 1-bit for the second), it's not particularly happy. With the third approach there are no multiplexers that depend on the size of the histogram, so it scales much better.