cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
agailey
Explorer
Explorer
891 Views
Registered: ‎02-08-2018

Xilinx SDK application code has a very long execution time on MicroBlaze

Jump to solution

I am launching a MicroBlaze application onto a VC707 FPGA development board that implements a machine vision algorithm on image data.  A part of the algorithm goes through each pixel of an image and groups the pixels into clusters.  This part of the code, shown below, does some dynamic memory allocation, which is why I have it on MicroBlaze instead of the FPGA fabric. 

The problem is that execution time for this nested loop is long (127 seconds) while on a regular CPU, execution time is only around 2.57 milliseconds.  I know that a regular CPU has a higher clock frequency than the 100 MHz clock frequency on my MicroBlaze, but this does not seem to be sufficient to explain the multiple order-of-magnitude difference in execution time between a regular CPU and MicroBlaze C application.  Any ideas?

// Clustering
struct cluster *cluster_list[SIZE_ARR];
int num_clusters = 0, num_points_added = 0;
int dx[4] = {1, 0, -1, 1}, dy[4] = {0, 1, 1, 1}; // neighbors
int x, y;
xil_printf("Starting ...\n");
for (y = 1; y < HEIGHT-1; ++y)
{
    for (x = 1; x < WIDTH-1; ++x)
    {
        u8 v0 = threshim.buf[y*threshim.stride + x];
        if (v0 == 127)
            continue;
        u64 rep0 = uf.parent[y*WIDTH+x];

        int nbr;
        for (nbr = 0; nbr < 4; nbr++)
       {
            u8 v1 = threshim.buf[y*threshim.stride + dy[nbr]*threshim.stride + x + dx[nbr]];
            u64 cluster_id1;
            if (v0 + v1 == 255)
           {
               num_points_added++;
               u64 rep1 = uf.parent[y*WIDTH + dy[nbr]*WIDTH + x + dx[nbr]];
               if (rep0 < rep1)
                   cluster_id1 = (rep1 << 32) + rep0;
               else
                   cluster_id1 = (rep0 << 32) + rep1;
               // Add the point to a cluster with cluster_id1 or make new cluster
               int c = 0;
               for (c = 0; c < num_clusters; ++c)
              {
                   if (cluster_id1 == cluster_list[c]->cluster_id)
                       break;
               }

               struct pt p;
               p.x = 2*x + dx[nbr];
               p.y = 2*y + dy[nbr];
               p.gx = dx[nbr]*((int) v1-v0);
               p.gy = dy[nbr]*((int) v1-v0);

               if (c == num_clusters) // add new cluster
              {
                   num_clusters++;
                   cluster_list[c] = (struct cluster *) malloc(sizeof(struct cluster));
                   cluster_list[c]->cluster_id = cluster_id1;
                   cluster_list[c]->num_points = 1;
                   cluster_list[c]->points = p;
                   cluster_list[c]->last_point = &cluster_list[c]->points;
              }
              else
             {
                  cluster_list[c]->last_point->next = (struct pt*) malloc(sizeof(struct pt));
                  memcpy(cluster_list[c]->last_point->next, &p, sizeof(struct pt));
                  cluster_list[c]->num_points++;
                  cluster_list[c]->last_point = cluster_list[c]->last_point->next;
             }
        }
    }
}
}
xil_printf("Added %d points to %d clusters\n", num_points_added, num_clusters);

0 Kudos
1 Solution

Accepted Solutions
dgisselq
Scholar
Scholar
764 Views
Registered: ‎05-21-2015

@agailey,

I'd really need to dig to know.

I did some digging a while back regarding why "blinky" seems so slow on a MicroBlaze or, for that mattern any other CPU.  Along the way, I've learned some things:

  1. As fast as I could make a data cache, access still took 2 cycles per operation to access something in the cache.  It also took  at least one cycle to determine if a bus operation needed to start--and that would happen on the next cycle after that.  For the cache misses, and you are probably dealing with a lot of them, it took not only the two access clocks, but also the clocks necessary to load a full cache line--often 28 cycles per hit for my implementations.  (A wider bus might well have been faster.)
  2. As fast as I could build a crossbar interconnect, it took 2 cycles to come in: one to register the address, one to get the grant.  The AXI crossbars took two cycles on the way back: one to come from the slave to the return channel, and a second in the return channels skid buffer.
  3. When building an SDRAM core, the only way I could ever optimize the core was for sequential access.  If multiple sequential accesses were sent to the core, I could read from one bank while activating the next, allowing uninterrupted sequential access--at least until the next REFRESH.  Still, doing this required several cycles of delay that had to be filled by the CPU.
  4. In my own SDRAM cores, unoptimized, accesses required one cycle to ACTIVATE the memory, another cycle to issue the read command, another two cycles to wait for the data, and a third (or fourth) cycle to process the return.  Even before those cycles, it took me at least two clocks to process something coming in from the bus.
  5. Xilinx's SDRAM core requires (roughly) 20 clock cycles to access any RAM.

I guess my point is, there are a lot of variables and ... you may find that the memory bus is more of a bottleneck than you are aware of.

You might also find that Zynq handles its memory accesses much better--at least I'd expect it to, I haven't measured it.

Dan

View solution in original post

0 Kudos
7 Replies
yannickl
Xilinx Employee
Xilinx Employee
867 Views
Registered: ‎11-03-2016

Clock frequency is not the only difference between a microblaze and a recent processor. Caches, DMA, multiple ALUs, branch predictors and/or branch speculation and many more. Very likely, you also have your main memory connected to a slower bus DDR. There are a few ways forward for improving performance of your design, but I'll limit this post to answering your question.

0 Kudos
860 Views
Registered: ‎07-30-2019

new to embedded system.. need help with this

0 Kudos
yannickl
Xilinx Employee
Xilinx Employee
843 Views
Registered: ‎11-03-2016

Well obviously, a good start would be revisiting your design. You said that this part of code was on microblaze because it has dynamic memory allocation. Can you remove that dynamic memory allocation? Are there other algorithms implemented in FPGA fabric? How does your image buffer get filled? What is scale of ARRAY_SIZE? I can't do your homework, but we can debate design decision.

0 Kudos
agailey
Explorer
Explorer
816 Views
Registered: ‎02-08-2018

@yannicklI did try removing the dynamic memory allocation, but in order to do that, this part of the algorithm would need to be rewritten, and the rewritten version requires more computation.  The reason this part of the algorithm depends on dynamic memory allocation is that the pixels of the image are being grouped into clusters, and we do not know whether we have a large number of small clusters or a small number of large clusters.  If we assume maximum number of clusters and maximum size per cluster, then the memory usage is huge.  Furthermore, a single pixel can belong to more than one cluster.

The array size is 240 * 320 pixels, or 76800.  I load in the pixel values of the sample image manually one by one because I have not found a way to load the image data automatically from file or camera feed; and also, using this sample image frame, I already know exactly what the correct outputs are.

This nested loop is only a part of the machine vision algorithm that I have on the FPGA.  Much of the rest of it is on the hardware fabric rather than the MicroBlaze.

0 Kudos
dgisselq
Scholar
Scholar
800 Views
Registered: ‎05-21-2015

Let me just ask the easy and simple question: Have you turned on compiler optimizations?  -O3 perhaps?

A more difficult question might be whether or not you are doing anything to make certain your various clusters are located in memory together.  That is, does the data cache help or hurt?

Dan

0 Kudos
agailey
Explorer
Explorer
782 Views
Registered: ‎02-08-2018

@dgisselqYes, great questions.  I checked the compiler settings and I noticed that no compiler optimizations were turned on.  After trying the -O3 optimization, I found that the execution time for this nested loop had reduced from 127 seconds to about 35 seconds.  Significant improvement, but still slow.

My clustering code uses arrays of pointers, so I do not think that the clusters (or even points within the same cluster) are together in memory.  I did not think that this would have a detrimental effect on execution time.  If anything, I thought that memory would be more efficiently utilized when the clusters are divided up into smaller segments of memory rather than in one large segment of memory.  Do you suppose that because the clusters are not all together in memory that the data access time penalty is greater?

Thanks

0 Kudos
dgisselq
Scholar
Scholar
765 Views
Registered: ‎05-21-2015

@agailey,

I'd really need to dig to know.

I did some digging a while back regarding why "blinky" seems so slow on a MicroBlaze or, for that mattern any other CPU.  Along the way, I've learned some things:

  1. As fast as I could make a data cache, access still took 2 cycles per operation to access something in the cache.  It also took  at least one cycle to determine if a bus operation needed to start--and that would happen on the next cycle after that.  For the cache misses, and you are probably dealing with a lot of them, it took not only the two access clocks, but also the clocks necessary to load a full cache line--often 28 cycles per hit for my implementations.  (A wider bus might well have been faster.)
  2. As fast as I could build a crossbar interconnect, it took 2 cycles to come in: one to register the address, one to get the grant.  The AXI crossbars took two cycles on the way back: one to come from the slave to the return channel, and a second in the return channels skid buffer.
  3. When building an SDRAM core, the only way I could ever optimize the core was for sequential access.  If multiple sequential accesses were sent to the core, I could read from one bank while activating the next, allowing uninterrupted sequential access--at least until the next REFRESH.  Still, doing this required several cycles of delay that had to be filled by the CPU.
  4. In my own SDRAM cores, unoptimized, accesses required one cycle to ACTIVATE the memory, another cycle to issue the read command, another two cycles to wait for the data, and a third (or fourth) cycle to process the return.  Even before those cycles, it took me at least two clocks to process something coming in from the bus.
  5. Xilinx's SDRAM core requires (roughly) 20 clock cycles to access any RAM.

I guess my point is, there are a lot of variables and ... you may find that the memory bus is more of a bottleneck than you are aware of.

You might also find that Zynq handles its memory accesses much better--at least I'd expect it to, I haven't measured it.

Dan

View solution in original post

0 Kudos