cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Oli
Visitor
Visitor
152 Views
Registered: ‎03-17-2021

"cache" scheme + pipelining // dataflow-stream

Hi all,

I'm pretty new to FPGA/HLS world, so excuse me for any big mistake this message could include...

I'm trying to improve an HLS implementation of a problem that follows this pattern (simplified):

float cont_var[LENGTH];                                                         
float out[LENGTH]; 
int   cont_idx[LENGTH];                                                         
float *sparse_var;                                                              
                                                                                
for(int i=0;i<LENGTH;i++) {                                                     
  int idx = cont_idx[i];                                                        
  float v = sparse_var[idx];                                                    
  out[i] = v * cont_var[i];                                                      
}                                                                               

 

The idea is that cont_var, out and cont_idx are contiguous values that can be moved between DDR<->BRAM using wide data types and/or burst. sparse_var, however, is accessed with an indirection, and thus cannot benefit from the same access pattern. Note that indexes for accessing sparse_var might be repeated and thus, we can access more than once to the same values of v for a given invocation of this kernel and, additionally, there's some spatial/temporal locality in the accesses to sparse_var than can be exploited.

This loop can be pipelined, and that improves performance. However, we are still loading the items of sparse_var each time, paying the penalty for going to DDR and not leveraging data locality.

My first idea was to add a "cache scheme", simplified:

float cont_var[LENGTH]; 
float out[LENGTH];                                                                                                                 
int   cont_idx[LENGTH];                                                         
float *sparse_var;                                                              
                                                                                
// "cache" structures                                                           
float sparse_var_cache[CACHE_SIZE_ELEMS];                                       
int   sparse_var_tag[CACHE_SIZE_TAGS];                                          
                                                                                
for(int i=0;i<LENGTH;i++) {                                                     
  int idx = cont_idx[i];                                                        
  float v;                                                                      
  int tag = sparse_var_tag[fx1(idx)]                                            
  if (tag != idx) {                                                             
    // miss, fetch ELEMENTS instead of one data                                 
    memcpy(sparse_var_cache + fx2(idx), sparse_var + idx, ELEMENTS);            
    sparse_var_tag[fx1(idx)] = idx;                                             
  }                                                                             
  // get value from BRAM, not DDR                                               
  v = sparse_var_cache[fx(idx)];                                                
  out[i] = v * cont_var[i];                                                      
}                                                                               

 

Without pipelining the loop, this implementation works bettern than the previous one (also without pipelining). The problem is that if I try to pipeline this implementation, it performs worse than the simpler one (with direct memory access).

As far as I've been able to see, looks like without any HLS pragmas, Vivado does nothing in the simple version, but pipelines the memcpy in the second implementation (thus, there's the benefit of BRAM memory access + pipelining in the memcpy function)

If we try to pipeline both versions with HLS pragmas, Vivado can pipeline the first one but, due to inter-iterations dependences on the cache memory, it cannot get good results in the  II of the second version. Also, it loses the pipeline implementation of the memcpy. That yields very poor performance, since the II for the loop is very high.

Looks like the only way to reduce the II would be to reduce the size of the data we're moving on each miss and even doing that, I don't know if the II could be reduced to improve the initial version.

The ideal scenario here would we a "pipelined" version that "stalls" the pipelined operations after the current one if this one obtains a "miss" from the cache, and enables operations after the cache has been loaded.

I supose that, in the original version (pipelined, without cache), Vivado cannot guarantee the time needed to load data from DDR (or, at least, not exactly), so there should be some kind of stall mechanism to apply in that case.

I was wondering whether applying dataflow/stream would make sense here, to achieve that effect. Perhaps splitting the memory load from the computation in different functions, enabling parallelism among them with dataflow and using streams to perform the communication?

 

void check_cache(double *sparse_values, int *cont_idx, float *sparse_var) {     
  // "cache" structures                                                         
  float sparse_var_cache[CACHE_SIZE_ELEMS];                                     
  int   sparse_var_tag[CACHE_SIZE_TAGS];                                        
                                                                                
  for(int i=0;i<LENGTH;i++) {                                                   
    idx = cont_idx[i];                                                          
    int tag = sparse_var_tag[fx1(idx)]                                          
    if (tag != idx) {                                                           
      // miss, fetch ELEMENTS instead of one data                               
      memcpy(sparse_var_cache + fx2(idx), sparse_var + idx, ELEMENTS);          
      sparse_var_tag[fx1(idx)] = idx;                                           
    }                                                                           
    sparse_values[i] = sparse_var_cache[fx(idx)];                               
  }                                                                             
}                                                                               
                                                                                
void do_product(double *sparse_values, float *cont_var, float *out) {           
  for(int i=0;i<LENGTH;i++) {                                                   
#pragma pipeline II=1                                                           
    out[i] = sparse_values[]                                                    
  }                                                                             
}                                                                               
                                                                                
void top_function(...) {                                                        
  ...                                                                           
  float cont_var[LENGTH];                                                       
  float out[LENGTH];                                                            
  int   cont_idx[LENGTH];                                                       
  float *sparse_var;                                                            
                                                                                
  double sparse_values[LENGTH];                                                 
#pragma HLS dataflow                                                            
#pragma HLS stream var=sparse_values depth=5                                    
                                                                                
  check_cache(sparse_values,cont_idx,sparse_var);                               
  do_product(sparse_values,cont_var,out);                                       
}

 

Would that make sense? My understanding is that it will allow the pipelined loop in do_product to have II=1 *but* stall if the cache operations take longer to provide input in the stream FIFO. However, I'm not really sure if this results in any benefit, since looks like the check_cache function would continue to be problematic regarding loop pipelining or other HLS optimizations.

Or, perhaps, there's no way to implement this in HLS?

Any hints will be wellcome.

Best,

Oli

 

0 Kudos
0 Replies