cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
904 Views
Registered: ‎04-04-2019

Large Accuracy Loss on ZCU102 deploying Custom Network

Jump to solution

Hello,

I have a custom network that I have deployed on the ZCU102. It passes quantization and compilation without any errors or warning. The Tensorflow graph, before quantization, has 80% and then it drops to 70% after quantization. However when testing on the ZCU102 board, the accuracy decreases to 25%.

Does anyone have any idea why this may be happening? The network has 3 separate inputs, which get convolved then concatenated, before going through some fully connected layers at the output. I'm not sure if this accuracy loss has to do with the multiple network inputs. Although it is supported, there are no Xilinx examples of networks with multiple inputs. It's briefly mentioned on this other forum post that there has not been much testing on this. I'm also pretty sure that I'm using the DNNDK API correctly, but just in case, my code looks like this:

"""Attach to DPU driver and prepare for runing"""
n2cube.dpuOpen()

"""Create DPU Kernels"""
kernel = n2cube.dpuLoadKernel(KERNEL_CONV)

"""Create DPU Tasks from DPU Kernel"""
task = n2cube.dpuCreateTask(kernel, 0)
input_len = n2cube.dpuGetInputTensorSize(task, KERNEL_CONV_INPUT0)

"""Get the output tensor size from FC output"""
size = n2cube.dpuGetOutputTensorSize(task, KERNEL_FC_OUTPUT)
    
"""Get the output tensor channel from FC output"""
channel = n2cube.dpuGetOutputTensorChannel(task, KERNEL_FC_OUTPUT)

"""Get output scale of FC"""
outputScale = n2cube.dpuGetOutputTensorScale(task, KERNEL_FC_OUTPUT)

"""Load image to DPU"""
n2cube.dpuSetInputTensorInHWCFP32(task,KERNEL_CONV_INPUT0,img0,input_len)
n2cube.dpuSetInputTensorInHWCFP32(task,KERNEL_CONV_INPUT1,img1,input_len)
n2cube.dpuSetInputTensorInHWCFP32(task,KERNEL_CONV_INPUT2,img2,input_len)

"""Model run on DPU"""
n2cube.dpuRunTask(task)
softmax = np.zeros(size,dtype=float32)
        
"""Get FC result"""
conf = n2cube.dpuGetOutputTensorAddress(task, KERNEL_FC_OUTPUT)
                
"""Run softmax"""
softmax = n2cube.dpuRunSoftmax(conf, channel, size // channel, outputScale)

 

This code is mainly borrowed from the inception_v1_mt.py file from the Vitis AI DNNDK example. Any guidance would be greatly appreciated, thank you!

0 Kudos
1 Solution

Accepted Solutions
528 Views
Registered: ‎04-04-2019

Turns out the issue was that the ZCU102 image we were using had the DPU Channel Augmentation enabled by default. This was conflicting with the Dilation layers that we were using in our Tensorflow model.

The solution was to get a patched version of the Vitis AI Compiler which can augment the channels that specifically don't have Dilation in them. 

A workaround is to disable the channel augment option in  your Vivado design, but this could negatively affect your DPU performance.

View solution in original post

0 Kudos
9 Replies
jasonwu
Moderator
Moderator
866 Views
Registered: ‎03-27-2013

Hi lgarrido@harris.com ,

 

Yes, the multi input sollution is not well tested.

From your code I can see you are using dpuSetInputTensorInHWCFP32.

But in https://github.com/Xilinx/Vitis-AI/blob/master/mpsoc/vitis_ai_dnndk_samples/inception_v1_mt_py/inception_v1_mt.py

dpuSetInputImage2 is used, would you show me where is the reference for that?

And you are just using same len_size for the 3 input node, are they in the same side?

Best Regards,
Jason
-----------------------------------------------------------------------------------------------
Please mark the Answer as "Accept as solution" if the information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
-----------------------------------------------------------------------------------------------
839 Views
Registered: ‎04-04-2019

Yes, I'm using dpuSetInputTensorInHWCFP32 because I thought I read somewhere that dpuSetInputImage2 wouldn't work. However I just tested it and it does work, although it doesn't change the output values of the network.

Yes, all three inputs take images that are the same size.

Is it possible that I'm not reading the values correctly from the output? The DNNDK manual, pg. 43, shows something like:

dpuSetInputImage2(taskConv, CONV_INPUT_NODE, image); 
dpuRunTask(taskConv);
/* Get FC result and convert from INT8 to FP32 format*/
dpuGetOutputTensorInHWCFP32(taskConv, FC_OUTPUT_NODE,FCResult, channel);
CPUCalcSoftmax(FCResult, channel, softmax);

Where as the code I borrowed from calls dpuGetOutputTensorAddress() instead. I'm not sure if my issue is there?

Also it does some to recognize some classes pretty well, with 50% accuracy, but in general in thinks most classes look like one specific class. Attached is the Confusion Matrix of the FPGA output, theoretically it should ~70% accuracy mainly across the diagonal.

 

test_fpga_conf_mat.png
0 Kudos
jheaton
Xilinx Employee
Xilinx Employee
800 Views
Registered: ‎03-21-2008

Do you have any average pooling layers with non-zero padding? 

0 Kudos
797 Views
Registered: ‎04-04-2019

No average pooling. Just using Conv2D with dilation and some with strides, in addition to some Concat, FC, and Flatten layers.

0 Kudos
jasonwu
Moderator
Moderator
756 Views
Registered: ‎03-27-2013

Hi lgarrido@harris.com ,

 

What did you do with the img0, img1, img2 before feeding them to dpuSetInputTensorInHWCFP32() function?

Since the reference code you point out is not using this API I would suggest you to refer to this example instead: https://github.com/Xilinx/Vitis-AI/blob/master/mpsoc/vitis_ai_dnndk_samples/mini_resnet_py/mini_resnet.py

And do you use TensorFlow to training the model?

If so you need to double check the pre-processing both on quantization and deployment.

The accuracy drop rate on quantization is similar to the situation that BGR/RGB data format mismatch can be handled below:

https://www.xilinx.com/html_docs/vitis_ai/1_1/lnk1576063785776.html

Best Regards,
Jason
-----------------------------------------------------------------------------------------------
Please mark the Answer as "Accept as solution" if the information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
-----------------------------------------------------------------------------------------------
0 Kudos
743 Views
Registered: ‎04-04-2019

Thanks for the response,

I don't do any preprocessing to the images except from cast them from 64bit floats to 32bit floats. They are already numpy arrays at this point, and I do this to make sure they comply with the dpuSetInputTensorInHWCFP32() function.

We train in Keras and do preprocessing before training. When running training, I save a subset of the validation images right before they are passed to the network. That is, three sets of 128x128x1 validation images, which also get used for quantization in Vitis. All the preprocessing happens in Keras, and doesn't need to be done again once the data is saved for further FPGA/VAI evaluation. The advantage to this is that the exact same images are fed to the different networks, giving us a 1:1 comparison.

Since our network takes images that have only 1 channel, I'm not sure if that BGR/RGB example still applies.

The only difference between my code and the mini Resnet example, as far as I noticed, is that they seem to flatten the input image before setting the image to the tensor. I tried that and it did not make a difference.

I know that there are some debugging tools for evaluating the DPU model, but I wanted to check if anyone had a similar problem and solution before going down that route. Maybe that's the best way forward? I appreciate your help so far.

0 Kudos
jasonwu
Moderator
Moderator
712 Views
Registered: ‎03-27-2013

Hi lgarrido@harris.com ,

 

The pre-processing should be exactly the same during training/quantization/deployment.

These would be particular for different network. And most of the accuracy problems are caused by the mismatch of the pre-processing steps.

Here are 2 different examples I have written for Numpy array and TF.data.

And you can see that the pre-proecessing are a little different from each other.

https://github.com/gewuek/flower_classification_dnndk_v1

https://github.com/gewuek/flower_classification_dnndk_v2

Best Regards,
Jason
-----------------------------------------------------------------------------------------------
Please mark the Answer as "Accept as solution" if the information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
-----------------------------------------------------------------------------------------------
687 Views
Registered: ‎04-04-2019

The preprocessing is exactly the same because I'm saving the 64 bit precision float image matrix from Keras training to use again in quantization. This saves me effort from rewriting functions and adding python modules to the Vitis AI docker container, and also ensures consistency between the data.

The TF/Vitis quantization dataset is exactly the same as the Keras validation/test dataset. The only difference is during deployment when casting from 64 to 32 bit precision floats to satisfy DNNDK setInputTensor() function only supporting 32 bits.

I'm going to retrain the network with 32 bit precision images, this will ensure the same exact dataset all the way through from Keras to TF to DPU deployment. If deploying a network trained with 32 bit image matrix doesn't work then I'm guessing there is no other way other than using the Vitis AI and DNNDK debugging tool to figure out why the accuracy loss.

It would be nice if Xilinx can do research/verify Multi Input networks and if they deploy on DPU with the same accuracy as on the Host machine. The Vitis tools allow networks with multiple inputs as an option but not having any examples or research showing that this feature is fully supported and reliable makes it tough to move forward with the Vitis AI platform when using multiple input CNN architectures. Thanks for all your advice so far, if I have any issues/breakthroughs I will report back.

 

0 Kudos
529 Views
Registered: ‎04-04-2019

Turns out the issue was that the ZCU102 image we were using had the DPU Channel Augmentation enabled by default. This was conflicting with the Dilation layers that we were using in our Tensorflow model.

The solution was to get a patched version of the Vitis AI Compiler which can augment the channels that specifically don't have Dilation in them. 

A workaround is to disable the channel augment option in  your Vivado design, but this could negatively affect your DPU performance.

View solution in original post

0 Kudos