frame

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Incorrect inference results from a minimal tensorflow model

Hi,

I have a minimal example of a trivial tensorflow (v1.4) conv net that I train (to overfitting) with only two examples, freeze, convert with mvNCCompile, and then test on a compute stick.

The code, and steps, are fully described in the github repo movidius_minimal_example

No steps have warnings or errors; but the inference results I get on the stick are incorrect.

What should be my next debugging step?

Thanks,
Mat

note. mvccheck does fail, but i'm unsure if it's because of the structure of my minimal example...

$ mvNCCheck graph.frozen.pb -in imgs -on output
mvNCCheck v02.00, Copyright @ Movidius Ltd 2016

/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py:766: DeprecationWarning: builtin type EagerTensor has no __module__ attribute
  EagerTensor = c_api.TFE_Py_InitEagerTensor(_EagerTensorBase)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/tf_inspect.py:45: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if d.decorator_argspec is not None), _inspect.getargspec(target))
USB: Transferring Data...
USB: Myriad Execution Finished
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result:  (1, 1)
1) 0 0.46216
Expected:  (1, 1)
1) 0 0.79395
------------------------------------------------------------
 Obtained values 
------------------------------------------------------------
 Obtained Min Pixel Accuracy: 41.789668798446655% (max allowed=2%), Fail
 Obtained Average Pixel Accuracy: 41.789668798446655% (max allowed=1%), Fail
 Obtained Percentage of wrong values: 100.0% (max allowed=0%), Fail
 Obtained Pixel-wise L2 error: 41.789667896678964% (max allowed=1%), Fail
 Obtained Global Sum Difference: 0.331787109375
------------------------------------------------------------

Comments

  • 18 Comments sorted by Votes Date Added
  • Reducing the model by removing the convolutional layers and it works. So I've got something to work on, it's something about the convolutions...

  • If I use tf.layers.conv2d(model, filters=5, kernel_size=3) or slim.conv2d(model, num_outputs=5, kernel_size=3) I get the same results.

    With padding=SAME they both give incorrect inference results,

    With padding=VALID they both thrown an exception...

    Traceback (most recent call last):
      File "./test_inference_on_ncs.py", line 29, in <module>
        output, _user_object = graph.GetResult()
      File "/usr/local/lib/python3.5/dist-packages/mvnc/mvncapi.py", line 264, in GetResult
        raise Exception(Status(status))
    Exception: mvncStatus.MYRIAD_ERROR
    

    I can't see what's fundamentally different between my definition of the conv2d layer compared to the models defined in the slim model zoo ....

  • OK.... so a conv layer with >=8 outputs works (as mentioned in this post)

    I agree with this comment though; what is the approach we should use for a fully convolutional image to image architecture (e.g. U-Net pix2pix) where we want the last output layer to be either 1 or 3 channels representing either a black and white or RGB image ?

  • The method I'm going to use is just output 8 channels in final layer and for training (and inference) just slice off the first for loss calculation. It's unneeded work, but will get me going.

  • @matpalm Can you post your frozen model so I can try to reproduce this issue on my bench?

  • edited March 27 Vote Up0Vote Down

    Sure, I'll do that today. Have a bundle of related cases I have repro steps for. I'll fork that GitHub repo I mentioned above with test cases for each & I'll check the models in too.

  • see this github repo for models / reproduction steps etc https://github.com/matpalm/movidius_bug_reports

    conv_with_8_filters works
    conv_with_6_filters (same model but with =6 channels) fails

  • also added an example of deconv failing with padding='SAME' under deconv_padding_same

  • also added an example of output shape being wrong after conv -> deconv stack conv_deconv_output_shape_wrong

  • at this point i can't get any workaround to work. all combos i can think of to hack my way out of not being able to use num_channels=1 don't work. so i'll put this project on hold & either check again next SDK release or sooner if you have things you'd like me to further test...

  • @matpalm I'll be reviewing this issue today and I will get back to you if I need/find anything. Thanks.

  • Great, thanks! No rush from my end; I'm going to be away for a couple of weeks. I have other cases but didn't get time to make test cases for them; I suspect they might be all the same underlying problem...

  • @matpalm I have been able to reproduce your issues. At the moment, our SDK requires the output from convolution layers to be >= 8 as you've already seen. A possible workaround is to add a conf file to current working directory and name it the same name as your pb or meta file. For example if your model's name is "model.meta" then you create a brand new file called "model.conf".

    In the conf file, add the convolution layer(s) (with an output that is less than 8) and in the line right below it, add the line "generic_spatial". This will choose a generic spatial convolution function. See the example below. Make sure to have an additional empty line as the last line in the conf file or else the SDK won't parse it correctly. Let me know if this helps. Thanks.

    conv1
    generic_spatial
    conv2
    generic_spatial
    conv3
    generic_spatial
    
  • Thanks @Tome_at_Intel .

    Got to try this today but still doesn't work sorry. The mode of failure is the same as far as I can see; output from running frozen network on host differs from running on compute stick...

    e.g.

    host_positive_prediction (1,) [ 1.]
    host_negative_prediction (1,) [  1.38149336e-09]
    ncs_positive_prediction (1,) [ 0.99902344]
    ncs_negative_prediction (1,) [ 1.]
    

    Just to confirm I've done the config correctly though;

    When I include a conf file ....

    e1/Conv2D
    generic_spatial
    

    ... and run ./test.sh conv_with_6_filters I see the output mvNCCompile include I see the output

    Spec opt found opt_conv_generic_spatial  1<< 10
    Layer (a) e1/Conv2D use the optimisation mask which is:  0x400
    0 0x80000000
    Layer fully_connected/MatMul use the generic optimisations which is:  0x80000000
    0 0x80000000
    Layer output use the generic optimisations which is:  0x80000000
    

    ( whereas when I include an empty conf file I see just

    0 0x80000000
    Layer e1/Conv2D use the generic optimisations which is:  0x80000000
    0 0x80000000
    Layer fully_connected/MatMul use the generic optimisations which is:  0x80000000
    0 0x80000000
    Layer output use the generic optimisations which is:  0x80000000
    

    )

    So it appears to be picking it up the config (?) but doesn't work?

  • @matpalm Thanks for reporting this issue. I went back and ran your ./test.sh script multiple times without the conf file. It seems that I am able to get a passing result sometimes from the script using conv_with_6_filters.

    host_positive_prediction (1,) [0.49975544]
    host_negative_prediction (1,) [0.49975544]
    ncs_positive_prediction (1,) [0.5]
    ncs_negative_prediction (1,) [0.5]
    PASS conv_with_6_filters
    

    However sometimes I get failing results when running the test.sh script:

    host_positive_prediction (1,) [0.49975544]
    host_negative_prediction (1,) [0.49975544]
    ncs_positive_prediction (1,) [0.5]
    ncs_negative_prediction (1,) [0.492]
    FAIL conv_with_6_filters
    

    Not sure why this is happening this way when the same network is generated each time.

  • yeah, apologies on my part, this is a fault of my overly simply reproduction script.... to save time i've set things up to do a minimal training run to try to build a classifier that maps one example to 0.0 and another to 1.0. the script runs a simple optimiser loop, for a very short time, and it's possible that the optimisation fails & in these cases i see what you've reported here; both the host_positive_prediction & host_negative_prediction are 0.5 . i see this sometimes when i run the script. the workaround is to rerun the script when this happens so you get host_positive_prediction 1.0 and host_negative_prediction 0.0. i should fix this, even if the optimisation takes longer it's better to be more realiable with the reproduction...

  • ( let me fix this so it reproducible every run; i'll ping you again when that's done )

  • Hey @Tome_at_Intel after being away for a bit I've revisited this, thanks for waiting.

    Have reduced it to an even more minimal example that for me trains every time (made more stable by using a network that does a simpler regression now). Can you take another look? https://github.com/matpalm/movidius_bug_reports

    Running ./test.sh conv_with_regression three times I get ...

    expected positive_prediction [10]
    expected negativee_prediction [5]
    host_positive_prediction (1,) [ 10.]
    host_negative_prediction (1,) [ 5.]
    ncs_positive_prediction (1,) [ 4.50390625]
    ncs_negative_prediction (1,) [ 4.19921875]
    
    expected positive_prediction [10]
    expected negativee_prediction [5]
    host_positive_prediction (1,) [ 10.]
    host_negative_prediction (1,) [ 5.]
    ncs_positive_prediction (1,) [ 5.109375]
    ncs_negative_prediction (1,) [ 4.82421875]
    
    expected positive_prediction [10]
    expected negativee_prediction [5]
    host_positive_prediction (1,) [ 10.]
    host_negative_prediction (1,) [ 4.99999952]
    ncs_positive_prediction (1,) [ 5.5]
    ncs_negative_prediction (1,) [ 5.03125]
    
Sign In or Register to comment.