frame

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Sign In

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

[TF] NANs appear after concat

Hello!
I have mobilenet's like graph where the end operation is concatV2.
All layers before concat give results without NANs but the final concat layers give the NANs:
https://drive.google.com/open?id=1WydK5qRbfTTK8dUDBN5qfftP72P2sQWi

mvNCCheck graph_optim.pb -in=image -on TfPoseEstimator/feat_concat
Result: (28, 28, 864)
1) 677375 nan
2) 461212 nan
3) 461240 nan
4) 461239 nan
5) 461238 nan
Expected: (1, 28, 28, 864)
1) 676897 32.22853
2) 653518 32.084774
3) 653723 32.062202
4) 653946 31.285995
5) 654040 29.77494

node {
  name: "TfPoseEstimator/feat_concat"
  op: "ConcatV2"
  input: "TfPoseEstimator/Conv2d_3_pool"
  input: "TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu"
  input: "TfPoseEstimator/MobilenetV1/Conv2d_11_pointwise/Relu"
  input: "TfPoseEstimator/feat_concat/axis"
  attr {
    key: "N"
    value {
      i: 3
    }
  }
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
  attr {
    key: "Tidx"
    value {
      type: DT_INT32
    }
  }
}

I've checked all inputs
input: "TfPoseEstimator/Conv2d_3_pool"
input: "TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu"
input: "TfPoseEstimator/MobilenetV1/Conv2d_11_pointwise/Relu"

and they give OK results (as an example for TfPoseEstimator/Conv2d_3_pool):

mvNCCheck   /home/fast/openpose/tf-pose-estimation/chk2/graph_optim.pb  -in=image  -on TfPoseEstimator/Conv2d_3_pool

Result: (28, 28, 96)
1) 72639 15.2
2) 72653 14.516
3) 72659 13.28
4) 73311 12.97
5) 73695 12.97
Expected: (1, 28, 28, 96)
1) 72639 15.215018
2) 72653 14.516435
3) 72659 13.285161
4) 74559 12.965318
5) 73695 12.963984

(inputs also pass every test)

Comments

  • 8 Comments sorted by Votes Date Added
  • @jokilokis Thanks for pointing this out. The bug has to do with the last layer being a concat layer. We're working on a release candidate for this issue at the moment.

  • deleted

  • If I add proxy (for example max pool with kernel=1) after 'TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu:0' and replace original node in concat with this proxy, it becomes OK!

    conc_in_1 = graph.get_tensor_by_name('TfPoseEstimator/Conv2d_3_pool:0')
    conc_in_2 = graph.get_tensor_by_name('TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu:0')
    conc_in_3 = graph.get_tensor_by_name('TfPoseEstimator/MobilenetV1/Conv2d_11_pointwise/Relu:0')
    

    proxy= tf.nn.max_pool(conc_in_2,ksize=[1, 1, 1, 1],strides=[1, 1, 1, 1], padding='SAME', name='proxy')
    test_conc1 = tf.concat([conc_in_1, proxy, conc_in_3], 3, name='test_conc')

    Result:  (28, 28, 864)
    1) 676897 32.22
    2) 653723 32.16
    3) 653946 31.42
    4) 654040 29.88
    5) 653920 29.81
    Expected:  (1, 28, 28, 864)
    1) 676897 32.2097
    2) 653723 32.097595
    3) 653518 32.080677
    4) 653946 31.31227
    5) 654040 29.789398
    

    So obviously it's very insulting bug that can be fixed!

  • wow, thanks @Tome_at_Intel, does it mean I need to add fake layer to be the last layer?

  • I want to know that when can this bug be fixed?

  • @songyoff lets try my approach with fake pooling after some input node to concat.
    TfPoseEstimator/feat_concat/Proxy 0.7 651.8 1.985

    as I can see pooling s very fast operation on stick but it helps to avoid some compiler optimizations that lead to bugs

    proto representation as an example what I mean:

            node {
              name: "TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu/Proxy"
              op: "MaxPool"
              input: "TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu"
              attr {
                key: "T"
                value {
                  type: DT_FLOAT
                }
              }
              attr {
                key: "data_format"
                value {
                  s: "NHWC"
                }
              }
              attr {
                key: "ksize"
                value {
                  list {
                    i: 1
                    i: 1
                    i: 1
                    i: 1
                  }
                }
              }
              attr {
                key: "padding"
                value {
                  s: "SAME"
                }
              }
              attr {
                key: "strides"
                value {
                  list {
                    i: 1
                    i: 1
                    i: 1
                    i: 1
                  }
                }
              }
            }
    
    node {
      name: "TfPoseEstimator/feat_concat"
      op: "ConcatV2"
      input: "TfPoseEstimator/Conv2d_3_pool"
      input: "TfPoseEstimator/MobilenetV1/Conv2d_7_pointwise/Relu/Proxy"
      input: "TfPoseEstimator/MobilenetV1/Conv2d_11_pointwise/Relu"
      input: "TfPoseEstimator/feat_concat/axis"
      attr {
        key: "N"
        value {
          i: 3
        }
      }
      attr {
        key: "T"
        value {
          type: DT_FLOAT
        }
      }
      attr {
        key: "Tidx"
        value {
          type: DT_INT32
        }
      }
    }
    

    Hope it helps while the fix is coming

  • @jokilokis thanks a lot, it's a good trick before the bug is fixed

  • @jokilokis , I have met the same problem with you. Adding a proxy can get normal output of 'feat_concat', but the latter layer still got nan. Have you successfully run the pose network on the stick?

Sign In or Register to comment.