์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•˜๋Š” ์šฉ๋„์˜ ๊ธ€ ์ž…๋‹ˆ๋‹ค.

10 minute read

์ด ํฌ์ŠคํŠธ๋Š” ์ œ๊ฐ€ ๊ฐœ์ธ์ ์ธ ์šฉ๋„๋กœ ์ •๋ฆฌํ•˜๋Š” ์šฉ๋„์˜ ๊ธ€ ์ž…๋‹ˆ๋‹ค.


Faster RCNN in pytorch

ใ„ด ๋‚ด๊ฐ€ ๋ณธ ํฌ์ŠคํŠธ ์ค‘์—์„œ ๊ฐ€์žฅ ์ž˜ ์„ค๋ช…ํ•œ ํฌ์ŠคํŠธ๋‹ค!

๋จ„์•ฝ ์„ค๋ช…์„ ์ผ์ผ์ด ๋ณด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๋ฐ”๋กœ ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์˜ ๋งํฌ๋ฅผ ์ฐธ์กฐํ•˜์ž.

ใ„ด link



  1. Region Proposal Network (RPN)
  2. RPN Loss Function
  3. ROI Pooling
  4. ROI Loss Function

RPN Network์€ โ€œLocationโ€๊ณผ โ€œObjectnessโ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค!

๊ทธ๋ ‡๊ฒŒ RPN Network์—์„œ ์ƒ์„ฑํ•œ proposal ์ค‘์—์„œ ๊ฐ€์žฅ ๋†’์€ $N$๋ฅผ ์ทจํ•œ๋‹ค!

์ด top-$N$ proposal๋“ค์„ Fast R-CNN network๋กœ ๋ณด๋‚ธ๋‹ค.

Fast R-CNN Network๋Š” โ€œlocationโ€๊ณผ โ€œclassificationโ€์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.


VGG16์„ feature extractor๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

VGG16์€ RPN network์™€ Fast R-CNN network ๋ชจ๋‘์—์„œ backbone ๋„คํŠธ์›Œํฌ๋กœ ์‚ฌ์šฉ๋œ๋‹ค!


Anchor Boxes

์šฐ๋ฆฌ๋Š” feature extractor์ธ VGG16์„ ํ†ตํ•ด์„œ 800 x 800์˜ ์ž…๋ ฅ์ด๋ฏธ์ง€๋ฅผ 50 x 50์˜ feature map์œผ๋กœ ์ค„์˜€๋‹ค.

์ด๋•Œ 50 x 50์˜ feature map์˜ ๊ฐ ํ”ฝ์…€์€ 16 x 16 pixels์— ๋Œ€์‘๋œ๋‹ค!

์šฐ๋ฆฌ๋Š” 50 x 50์˜ feature map์˜ ๊ฐ ํ”ฝ์…€์„ ์ค‘์‹ฌ์œผ๋กœ ์‚ผ๋Š” anchor box๋“ค์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋œ๋‹ค.

์ด๋•Œ, ํฌ๊ธฐ์™€ ์ข…ํšก๋น„๊ฐ€ ๊ฐ๊ฐ 3๊ฐœ ์žˆ์œผ๋ฏ€๋กœ ๊ฐ ํ”ฝ์…€์— 9๊ฐœ ๋ชจ์–‘์˜ anchor box๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  anchor box๋Š” [y1, x1, y2, x2]์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ

๋”ฐ๋ผ์„œ feature map์˜ ๊ฐ ํ”ฝ์…€์— (9, 4)์˜ anchor์— ๋Œ€ํ•œ Tensor๊ฐ€ ํ• ๋‹น๋˜๋Š” ์…ˆ์ด๋‹ค!

Q. Faster R-CNN์—์„œ feature extractor๊ฐ€ ๊ผญ 50 x 50 ์‚ฌ์ด์ฆˆ์˜ faature map์„ ๋งŒ๋“ค๋„๋ก ๋””์ž์ธํ•ด์•ผ ํ•˜๋Š” ๊ฑธ๊นŒ?


โ€œNote that single ground-truth object may assign positive labels to multiple anchors.โ€ -> ๋‹น์—ฐ!

โ€œc) We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. d) Anchors that are neither positive nor negative do not contribute to the training objective.โ€

  • argmax_ious: ๊ฐ anchor box๊ฐ€ ์–ด๋–ค gt-box์™€ ๋” ํฐ IoU๋ฅผ ๊ฐ–๋Š”์ง€์— ๋Œ€ํ•œ ๋ฐฐ์—ด
  • max_ious: ์ „์ฒด IoU์—์„œ argmax_ious๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ถ”์ถœํ•œ max ๊ฐ’!

max_ious ๊ฐ’์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐ anchor box์— positive / negative / none์„ ๋ถ€์—ฌํ•  ์ˆ˜ ์žˆ์Œ!!

์กฐ๊ธˆ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ๋Š” threshold๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ์˜ค์ง max IoU์— ๋Œ€ํ•ด์„œ๋งŒ positive (1)์„ ๋ถ€์—ฌํ•œ๋‹ค๋ฉด,

  • gt_argmax_ious: label์— ๊ด€๊ณ„ ์—†์ด ๊ฐ€์žฅ ํฐ IoU๋ฅผ ๊ฐ–๋Š” anchor box์˜ idx๊ฐ€ ์ €์žฅ๋œ ๋ฐฐ์—ด!

์˜ gt_argmax_ious๋ฅผ ์ด์šฉํ•  ์ˆ˜๋„ ์žˆ์Œ!

(๋‹จ, ์ด ๊ฒฝ์šฐ negative๋ฅผ ๋งค๊ธฐ๋Š”๊ฒŒ ์ข€ tricky ํ•˜๊ฒ ๊ตฐโ€ฆ)

RPN ํ•™์Šต

์ „์ฒด anchor๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด, positive anchor ๋ณด๋‹ค negative anchor์˜ ์ˆ˜๊ฐ€ ํ›จ์”ฌ ๋งŽ์„ ๊ฒƒ์ด๋‹ค. ๋˜๋Š” ๊ทธ ๋ฐ˜๋Œ€๋„ ์ƒ๊ธธ ์ˆ˜ ์žˆ๊ณ . ๋งŽ์•„ positivei-negative anchor ์‚ฌ์ด ๋ถˆ๊ท ํ˜•์ด ์žˆ๋‹ค๋ฉด, RPN์„ ํ•™์Šต์‹œํ‚ค์ง€ ์ข‹์ง€ ์•Š์„ ๊ฒƒ์ด๋ผ๊ณ  Faster R-CNN ๋…ผ๋ฌธ์€ ์ƒ๊ฐํ–ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ์ˆ˜์— ์ œํ•œ์„ ๋‘ฌ์„œ (ex: 256) ์ด ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.

Now we need to randomly sample #(positive samples) from the positive labels and ignore (-1) the remaining ones. In some cases we get less than #(positive samples), in that we will randomly sample (#(sample) โ€” #(positive)) negative samples (0) and assign ignore label to the remaining anchor boxes. This is done using the following code.


BBox regression

(์ง€์—ฝ) np.iinfo(), np.finfo()๋Š” ๊ฐ๊ฐ int์™€ float์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ํ•จ์ˆ˜๋‹ค!

์˜ˆ๋ฅผ ๋“ค์–ด, np.finfo(np.float32).min, np.finfo(np.float32).max, np.finfo(np.float32).eps ๋“ฑ์œผ๋กœ

โ€˜์ตœ์†Ÿ๊ฐ’โ€™, โ€˜์ตœ๋Œ“๊ฐ’โ€™, โ€˜ํ‘œํ˜„ ๊ฐ€๋Šฅํ•œ ๊ฐ€์žฅ ์ž‘์€ ๊ฐ’(machine epsilon)โ€˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค!

// ๊ทผ๋ฐ ํฌ์ŠคํŠธ์—์„œ ๋‚˜์˜ค๋Š” ๋ฐฉ์‹์€ ์ข€ ๊ณผ๋ฏผ ๋ฐ˜์‘ ๊ฐ™๊ธฐ๋„?

RPN Network Architecture

  • RPN network(?) -> predict the location of the box โ€œinside the anchorโ€.

To generate region proposals, we โ€œslideโ€ a small network over the convolutional feature map output.

This feature is fed into two sibling fully connected layers.

  • A box regression layer
  • A box classification layer


pred_cls_scores and objectness_scores are used as inputs to the proposal layer, which generate a set of proposal which are further used by โ€œRoI networkโ€.

์•„ํ•˜! pred_cls_scores๋ž‘ objectness_scores๊ฐ€ ๋˜๊ฒŒ ๋น„์Šทํ•˜๊ฒŒ ๋Š๊ปด์กŒ๋Š”๋ฐ ๊ทธ ์ด์œ ๋ฅผ ์•Œ์•˜๋‹ค!!

์ด๊ฒŒ objectness_scores๋Š” background-ness์™€ object-ness์— ๋Œ€ํ•œ prediction ๊ฐ’์„ ๊ฐ€์ง„ pred_cls_scores์—์„œ object์— ๋Œ€ํ•œ score ๋ถ€๋ถ„๋งŒ ์ถ”์ถœํ•œ ๊ฑฐ๊ตฌ๋‚˜!

Generating proposals to feed Fast R-CNN network

โ€œThe Faster R_CNN says, RPN proposals highly overlap with each other. To reduced redundancy, we adopt non-maximum suppression(NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. After an ablation study, the authors show that NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following we training Fast R-CNN using 2000 RPN proposals. During testing they evaluate only 300 proposals, they have tested this with various numbers and obtained this.โ€

์˜คํ™! ํ•™์Šต ๋•Œ์™€ ํ…Œ์ŠคํŠธ ํ•  ๋•Œ์˜ RPN proposal์˜ ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๊ตฌ๋‚˜!!

  1. convert the loc predictions from the rpn network to bbox [y1, x1, y2, x2] format.
  2. clip the predicted boxes to the image // ์Œ?! ์ด๊ฑด ์ข€ ๋‚˜์ค‘์— ํ•ด๋„ ๋  ๊ฒƒ ๊ฐ™์€๋ฐโ€ฆ?
  3. Remove predicted boxes with either height or width < threshold (min_size).
  4. Sort all (proposal, score) pairs by score from highest to lowest.
  5. Take top pre_nms_topN (e.g. 12000 while training and 300 while testing).
  6. Apply nms threshold > 0.7
  7. Take top pos_nms_topN (e.g. 2000 while training and 300 while testing)


์•„๋ž˜์˜ ์ˆ˜์‹์„ ํ†ตํ•ด ๋“œ๋””์–ด RPN์„ ์™„๋ฒฝํžˆ ์ดํ•ดํ–ˆ๋‹ค๋Š” ๋Š๋‚Œ์ด ๋“ค์—ˆ๋‹ค!!

  • x = (w_{a} * ctr_x_{p}) + ctr_x_{a}
  • y = (h_{a} * ctr_x_{p}) + ctr_x_{a}
  • h = np.exp(h_{p}) * h_{a}
  • w = np.exp(w_{p}) * w_{a}

and later convert to y1, x1, y2, x2 format

์ฆ‰, โ€œProposalโ€œ์ด๋ผ๋Š” ๊ฒƒ์€ anchor box ๋‚ด๋ถ€์—์„œ์˜ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ฒƒ์ด๋‹ค!

๊ทธ๋ž˜์„œ Proposal์€ anchor์™€ ํ•จ๊ป˜ ์›€์ง์ด๋Š” ์กด์žฌ๋ผ๋Š” ๊ฑฐ์ง€!

์—ฌ๊ธฐ์„œ predicted_loc์ด ๊ณง Proposal์ด ๋˜๋Š” ๊ฑฐ์ง€!

loc_layer๋Š” Anchor ๋ฐ•์Šค ๋‚ด๋ถ€์— ์กด์žฌํ•  Proposal์„ ๋งŒ๋“ ๋‹ค.

// ์•„๋‹˜ anchor + predicted_loc์„ Proposal๋กœ ์ดํ•ดํ•ด์•ผ ํ•˜๋‚˜?


NMS

  • while order_array.size > 0:
    • take โ€œthe first elementโ€ in order_array and append that to keep
    • Find the area with all other boxes
    • Find the index of all the boxes which have high overlap with โ€œthis boxโ€
    • Remove them from order array
    • Iterate this till we get the order_size to zero (while loop)
  • Output the keep variable which tells what indexes to consider.

๊ฒฐ๊ตญ NMS๋Š” proposal ์‚ฌ์ด ์‚ฌ์—์„œ IoU๋ฅผ ๊ตฌํ•ด proposal์„ ๊ฒฝ๋Ÿ‰ํ™”ํ•˜๋Š” ๊ฒƒ!!

ROI Pooling layer

fast R-CNN

Region of interest pooling (also known as RoI pooling) purpose is to perform max pooling on inputs of non-uniform sizes to obtain fixed-size feature maps (e.g. 7ร—7). This layer takes two inputs

โ€ฆ

  1. Dividing the region proposal into โ€œequal-sized sectionsโ€ (the number of which is the same as the dimension of the output)
  2. Finding the largest value in each section
  3. Copying these max values to the output buffer

Note that โ€œthe dimension of the RoI pooling outputโ€ doesnโ€™t actually depend on the size of the input feature map nor on the size of the region proposals. Itโ€™s determined solely by the number of sections we divide the proposal into.

Whatโ€™s the benefit of RoI pooling? One of them is processing speed. If there are multiple object proposals on the frame, we can still use the โ€œsame-size input feature mapโ€ for all of them.

From the previous sections we got gt_roi_locs, gt_roi_labels and sample_rois. We will use the sample_rois as the input to the RoI pooling layer.

Categories:

Updated: