cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
engerd
Visitor
Visitor
174 Views
Registered: ‎01-20-2021

Latency values which are given by Vitis HLS and calculated in real time are incosistent

I used vector adding example as shown in Vitis Flow 101 – Part 3 : Meet the Vector-Add Example:

 

extern "C" {
  void vadd(
    const unsigned int *in1, // Read-Only Vector 1
    const unsigned int *in2, // Read-Only Vector 2
    unsigned int *out,       // Output Result
    int size                 // Size in integer
  )
  {

#pragma HLS INTERFACE m_axi port=in1 bundle=aximm1
#pragma HLS INTERFACE m_axi port=in2 bundle=aximm2
#pragma HLS INTERFACE m_axi port=out bundle=aximm1

    for(int i = 0; i < size; ++i)
    {
      out[i] = in1[i] + in2[i];
    }
  }
}

 

With size is hardcoded as 4096, Vitis HLS given me following synthesis results:

vadd.png

It is clearly seen here that latency is calculated as 212 us

But when I run the kernel on the hardware using the following host program (I only added q.finish() for synchronization and used std::chrono library for time measuring purposes)

/**********
Copyright (c) 2018, Xilinx, Inc.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its contributors
may be used to endorse or promote products derived from this software
without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
**********/

#define CL_HPP_CL_1_2_DEFAULT_BUILD
#define CL_HPP_TARGET_OPENCL_VERSION 120
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY 1
#define CL_USE_DEPRECATED_OPENCL_1_2_APIS

#define DATA_SIZE 4096

#include <vector>
#include <unistd.h>
#include <iostream>
#include <fstream>
#include <CL/cl2.hpp>

// Forward declaration of utility functions included at the end of this file
std::vector<cl::Device> get_xilinx_devices();
char* read_binary_file(const std::string &xclbin_file_name, unsigned &nb);

// ------------------------------------------------------------------------------------
// Main program
// ------------------------------------------------------------------------------------
int main(int argc, char** argv)
{
// ------------------------------------------------------------------------------------
// Step 1: Initialize the OpenCL environment 
// ------------------------------------------------------------------------------------ 
    cl_int err;
    std::string binaryFile = (argc != 2) ? "vadd.xclbin" : argv[1];
    unsigned fileBufSize;    
    std::vector<cl::Device> devices = get_xilinx_devices();
    devices.resize(1);
    cl::Device device = devices[0];
    cl::Context context(device, NULL, NULL, NULL, &err);
    char* fileBuf = read_binary_file(binaryFile, fileBufSize);
    cl::Program::Binaries bins{{fileBuf, fileBufSize}};
    cl::Program program(context, devices, bins, NULL, &err);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE, &err);
    cl::Kernel krnl_vector_add(program,"vadd", &err);
    
    // Variables for measuring runtime
    std::chrono::high_resolution_clock::time_point buffer_start;                    // Süre ölçmek için gerekli değişkenler tanımlanıyor
    std::chrono::high_resolution_clock::time_point buffer_finish;
    
    std::chrono::high_resolution_clock::time_point write_start;
    std::chrono::high_resolution_clock::time_point write_finish;
    
    std::chrono::high_resolution_clock::time_point arg_start;
    std::chrono::high_resolution_clock::time_point arg_finish;

    std::chrono::high_resolution_clock::time_point arg_start_2;
    std::chrono::high_resolution_clock::time_point arg_finish_2;

    std::chrono::high_resolution_clock::time_point map_start;
    std::chrono::high_resolution_clock::time_point map_finish;

    std::chrono::high_resolution_clock::time_point kernel_start;
    std::chrono::high_resolution_clock::time_point kernel_finish;
    
    std::chrono::high_resolution_clock::time_point read_start;
    std::chrono::high_resolution_clock::time_point read_finish;
    
    std::chrono::high_resolution_clock::time_point opencl_start;
    std::chrono::high_resolution_clock::time_point opencl_finish;

    //std::cout << "Initialization is done...\n";

// ------------------------------------------------------------------------------------
// Step 2: Create buffers and initialize test values
// ------------------------------------------------------------------------------------
    // Create the buffers and allocate memory 

    buffer_start = std::chrono::high_resolution_clock::now();

    cl::Buffer in1_buf(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY,  sizeof(int) * DATA_SIZE, NULL, &err);
    cl::Buffer in2_buf(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY,  sizeof(int) * DATA_SIZE, NULL, &err);
    cl::Buffer out_buf(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY, sizeof(int) * DATA_SIZE, NULL, &err);

    q.finish();

    buffer_finish = std::chrono::high_resolution_clock::now();

    //std::cout << "Buffers are created...\n";

    // Map buffers to kernel arguments, thereby assigning them to specific device memory banks

    arg_start = std::chrono::high_resolution_clock::now();

    krnl_vector_add.setArg(0, in1_buf);
    krnl_vector_add.setArg(1, in2_buf);
    krnl_vector_add.setArg(2, out_buf);

    q.finish();

    arg_finish = std::chrono::high_resolution_clock::now();   

    //std::cout << "Arguments are mapped for the first time...\n";
   
    // Map host-side buffer memory to user-space pointers

    map_start = std::chrono::high_resolution_clock::now();

    int *in1 = (int *)q.enqueueMapBuffer(in1_buf, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int) * DATA_SIZE);
    int *in2 = (int *)q.enqueueMapBuffer(in2_buf, CL_TRUE, CL_MAP_WRITE, 0, sizeof(int) * DATA_SIZE); 
    int *out = (int *)q.enqueueMapBuffer(out_buf, CL_TRUE, CL_MAP_WRITE | CL_MAP_READ, 0, sizeof(int) * DATA_SIZE);

    q.finish();

    //std::cout << "Buffers are mapped...\n";

    map_finish = std::chrono::high_resolution_clock::now();   

    // Initialize the vectors used in the test
    for(int i = 0 ; i < DATA_SIZE ; i++){
        in1[i] = rand() % DATA_SIZE;
        in2[i] = rand() % DATA_SIZE;
        out[i] = 0; 
    }

    //std::cout << "Vectors are initialized...\n";

// ------------------------------------------------------------------------------------
// Step 3: Run the kernel
// ------------------------------------------------------------------------------------
    // Set kernel arguments

    arg_start_2 = std::chrono::high_resolution_clock::now();

    krnl_vector_add.setArg(0, in1_buf);
    krnl_vector_add.setArg(1, in2_buf);
    krnl_vector_add.setArg(2, out_buf);

    q.finish();

    arg_finish_2 = std::chrono::high_resolution_clock::now();  

    //std::cout << "Arguments are mapped for the second time...\n" ;

    //krnl_vector_add.setArg(3, DATA_SIZE);

    // Schedule transfer of inputs to device memory, execution of kernel, and transfer of outputs back to host memory
    
    // Write Data
    write_start = std::chrono::high_resolution_clock::now();
    opencl_start = std::chrono::high_resolution_clock::now();
    
    q.enqueueMigrateMemObjects({in1_buf, in2_buf}, 0 /* 0 means from host*/); 

    q.finish();

    write_finish = std::chrono::high_resolution_clock::now();

    //std::cout << "Data is written\n";
    //std::cout << "Starting kernel\n";
    
    // Run Kernel

    kernel_start = std::chrono::high_resolution_clock::now();

    q.enqueueTask(krnl_vector_add);

    q.finish();

    kernel_finish = std::chrono::high_resolution_clock::now();

    //std::cout << "Kernel is done. Reading data...\n";

    // Read Data

    read_start = std::chrono::high_resolution_clock::now();

    q.enqueueMigrateMemObjects({out_buf}, CL_MIGRATE_MEM_OBJECT_HOST);

    q.finish();

    read_finish = std::chrono::high_resolution_clock::now();
    opencl_finish = std::chrono::high_resolution_clock::now();

    //std::cout << "Finished reading data\n";

    // Wait for all scheduled operations to finish
    //q.finish();
    
    // Write timing results to the console
    std::chrono::duration<double, std::micro> buffer_time;
    std::chrono::duration<double, std::micro> write_time;
    std::chrono::duration<double, std::micro> arg_time;
    std::chrono::duration<double, std::micro> arg_time_2;
    std::chrono::duration<double, std::micro> map_time;
    std::chrono::duration<double, std::micro> kernel_time;
    std::chrono::duration<double, std::micro> read_time;
    std::chrono::duration<double, std::micro> opencl_time;
    
    buffer_time = buffer_finish  - buffer_start;
    write_time  = write_finish   - write_start;
    arg_time    = arg_finish     - arg_start;
    arg_time_2  = arg_finish_2   - arg_start_2;
    map_time    = map_finish     - map_start;
    kernel_time = kernel_finish  - kernel_start;
    read_time   = read_finish    - read_start;
    opencl_time  = opencl_finish - opencl_start;
    
    std::cout << "--- Timing Results are ---\n";
    std::cout << "Buffer time is                    :" << buffer_time.count()   << "\n";     
    std::cout << "Write time is                     :" << write_time.count()    << "\n";     
    std::cout << "First argument mapping time is    :" << arg_time.count()      << "\n";     
    std::cout << "Second argument mapping time is   :" << arg_time_2.count()    << "\n";     
    std::cout << "Mapping time is                   :" << map_time.count()      << "\n";     
    std::cout << "Kernel running time is            :" << kernel_time.count()   << "\n";     
    std::cout << "Read time is                      :" << read_time.count()     << "\n";     
    std::cout << "Total time is                     :" << opencl_time.count()   << "\n";


// ------------------------------------------------------------------------------------
// Step 4: Check Results and Release Allocated Resources
// ------------------------------------------------------------------------------------
    bool match = true;
    for (int i = 0 ; i < DATA_SIZE ; i++){
        int expected = in1[i]+in2[i];
        if (out[i] != expected){
            std::cout << "Error: Result mismatch" << std::endl;
            std::cout << "i = " << i << " CPU result = " << expected << " Device result = " << out[i] << std::endl;
            match = false;
            break;
        }
    }

    delete[] fileBuf;

    std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl; 
    return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}



// ------------------------------------------------------------------------------------
// Utility functions
// ------------------------------------------------------------------------------------
std::vector<cl::Device> get_xilinx_devices() 
{
    size_t i;
    cl_int err;
    std::vector<cl::Platform> platforms;
    err = cl::Platform::get(&platforms);
    cl::Platform platform;
    for (i  = 0 ; i < platforms.size(); i++){
        platform = platforms[i];
        std::string platformName = platform.getInfo<CL_PLATFORM_NAME>(&err);
        if (platformName == "Xilinx"){
            std::cout << "INFO: Found Xilinx Platform" << std::endl;
            break;
        }
    }
    if (i == platforms.size()) {
        std::cout << "ERROR: Failed to find Xilinx platform" << std::endl;
        exit(EXIT_FAILURE);
    }
   
    //Getting ACCELERATOR Devices and selecting 1st such device 
    std::vector<cl::Device> devices;
    err = platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
    return devices;
}
   
char* read_binary_file(const std::string &xclbin_file_name, unsigned &nb) 
{
    if(access(xclbin_file_name.c_str(), R_OK) != 0) {
        printf("ERROR: %s xclbin not available please build\n", xclbin_file_name.c_str());
        exit(EXIT_FAILURE);
    }
    //Loading XCL Bin into char buffer 
    std::cout << "INFO: Loading '" << xclbin_file_name << "'\n";
    std::ifstream bin_file(xclbin_file_name.c_str(), std::ifstream::binary);
    bin_file.seekg (0, bin_file.end);
    nb = bin_file.tellg();
    bin_file.seekg (0, bin_file.beg);
    char *buf = new char [nb];
    bin_file.read(buf, nb);
    return buf;
}

I get the following results:

Buffer time is                                    :525.11
Write time is                                      :670.05
First argument mapping time is        :185.75
Second argument mapping time is   :191.31
Mapping time is                                 :650.77
Kernel running time is                       :604.62
Read time is                                     :429.46
Total time is                                       :1704.36

There is a clear discrepancy between latency values given by Vitis HLS and real runtime values! Real latency is %196 longer than what is shown by Vitis HLS.
 
What is the reason for this? Why this overhead occurs and what can I do to eliminate this.

Also, Write Time and Read Time (Time spent transferring data between host buffer and kernel buffer) seemed too much to me. Is there a more efficient way to transfer data between buffers.
 
Thanks in advance.
0 Kudos
0 Replies