Faster R-CNN
์ด ํฌ์คํธ๋ ์ ๊ฐ ๊ฐ์ธ์ ์ธ ์ฉ๋๋ก ์ ๋ฆฌํ๋ ์ฉ๋์ ๊ธ ์ ๋๋ค.
ใด ๋ด๊ฐ ๋ณธ ํฌ์คํธ ์ค์์ ๊ฐ์ฅ ์ ์ค๋ช ํ ํฌ์คํธ๋ค!
๋จ์ฝ ์ค๋ช ์ ์ผ์ผ์ด ๋ณด๋๊ฒ ์๋๋ผ ๋ฐ๋ก ์ ์ฒด ์ฝ๋๋ฅผ ๋ณด๊ณ ์ถ๋ค๋ฉด ์๋์ ๋งํฌ๋ฅผ ์ฐธ์กฐํ์.
ใด link
- Region Proposal Network (RPN)
- RPN Loss Function
- ROI Pooling
- ROI Loss Function
RPN Network์ โLocationโ๊ณผ โObjectnessโ๋ฅผ ํ๋จํ๋ค!
๊ทธ๋ ๊ฒ RPN Network์์ ์์ฑํ proposal ์ค์์ ๊ฐ์ฅ ๋์ $N$๋ฅผ ์ทจํ๋ค!
์ด top-$N$ proposal๋ค์ Fast R-CNN network๋ก ๋ณด๋ธ๋ค.
Fast R-CNN Network๋ โlocationโ๊ณผ โclassificationโ์ ์ํํ๋ค.
VGG16
์ feature extractor๋ก ์ฌ์ฉํ๋ค.
VGG16
์ RPN network์ Fast R-CNN network ๋ชจ๋์์ backbone ๋คํธ์ํฌ๋ก ์ฌ์ฉ๋๋ค!
Anchor Boxes
์ฐ๋ฆฌ๋ feature extractor์ธ VGG16
์ ํตํด์ 800 x 800์ ์
๋ ฅ์ด๋ฏธ์ง๋ฅผ 50 x 50์ feature map์ผ๋ก ์ค์๋ค.
์ด๋ 50 x 50์ feature map์ ๊ฐ ํฝ์ ์ 16 x 16 pixels์ ๋์๋๋ค!
์ฐ๋ฆฌ๋ 50 x 50์ feature map์ ๊ฐ ํฝ์ ์ ์ค์ฌ์ผ๋ก ์ผ๋ anchor box๋ค์ ์์ฑํ๊ฒ ๋๋ค.
์ด๋, ํฌ๊ธฐ์ ์ข ํก๋น๊ฐ ๊ฐ๊ฐ 3๊ฐ ์์ผ๋ฏ๋ก ๊ฐ ํฝ์ ์ 9๊ฐ ๋ชจ์์ anchor box๊ฐ ์์ฑ๋๋ค.
๊ทธ๋ฆฌ๊ณ anchor box๋ [y1, x1, y2, x2]
์ ๊ฐ์ ๊ฐ์ง๋ฏ๋ก
๋ฐ๋ผ์ feature map์ ๊ฐ ํฝ์
์ (9, 4)
์ anchor์ ๋ํ Tensor๊ฐ ํ ๋น๋๋ ์
์ด๋ค!
Q. Faster R-CNN์์ feature extractor๊ฐ ๊ผญ 50 x 50 ์ฌ์ด์ฆ์ faature map์ ๋ง๋ค๋๋ก ๋์์ธํด์ผ ํ๋ ๊ฑธ๊น?
โNote that single ground-truth object may assign positive labels to multiple anchors.โ -> ๋น์ฐ!
โc) We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. d) Anchors that are neither positive nor negative do not contribute to the training objective.โ
argmax_ious
: ๊ฐ anchor box๊ฐ ์ด๋ค gt-box์ ๋ ํฐ IoU๋ฅผ ๊ฐ๋์ง์ ๋ํ ๋ฐฐ์ดmax_ious
: ์ ์ฒด IoU์์argmax_ious
๋ฅผ ๋ฐํ์ผ๋ก ์ถ์ถํmax
๊ฐ!
max_ious
๊ฐ์ ๋ฐํ์ผ๋ก ๊ฐ anchor box์ positive / negative / none์ ๋ถ์ฌํ ์ ์์!!
์กฐ๊ธ ๋ค๋ฅธ ๋ฐฉ์์ผ๋ก๋ threshold๋ฅผ ๋ฐํ์ผ๋ก ํ๋๊ฒ ์๋๋ผ ์ค์ง max IoU์ ๋ํด์๋ง positive (1)์ ๋ถ์ฌํ๋ค๋ฉด,
gt_argmax_ious
: label์ ๊ด๊ณ ์์ด ๊ฐ์ฅ ํฐ IoU๋ฅผ ๊ฐ๋ anchor box์ idx๊ฐ ์ ์ฅ๋ ๋ฐฐ์ด!
์ gt_argmax_ious
๋ฅผ ์ด์ฉํ ์๋ ์์!
(๋จ, ์ด ๊ฒฝ์ฐ negative๋ฅผ ๋งค๊ธฐ๋๊ฒ ์ข tricky ํ๊ฒ ๊ตฐโฆ)
RPN ํ์ต
์ ์ฒด anchor๋ฅผ ์๊ฐํด๋ณด๋ฉด, positive anchor ๋ณด๋ค negative anchor์ ์๊ฐ ํจ์ฌ ๋ง์ ๊ฒ์ด๋ค. ๋๋ ๊ทธ ๋ฐ๋๋ ์๊ธธ ์ ์๊ณ . ๋ง์ positivei-negative anchor ์ฌ์ด ๋ถ๊ท ํ์ด ์๋ค๋ฉด, RPN์ ํ์ต์ํค์ง ์ข์ง ์์ ๊ฒ์ด๋ผ๊ณ Faster R-CNN ๋ ผ๋ฌธ์ ์๊ฐํ๋ค.
๊ทธ๋์ ์ํ๋งํ๋ ์์ ์ ํ์ ๋ฌ์ (ex: 256
) ์ด ๋ถ๊ท ํ ๋ฌธ์ ๋ฅผ ํด์ํ๊ณ ์ ํ์๋ค.
Now we need to randomly sample #(positive samples) from the positive labels and ignore (-1) the remaining ones. In some cases we get less than #(positive samples), in that we will randomly sample (#(sample) โ #(positive)) negative samples (0) and assign ignore label to the remaining anchor boxes. This is done using the following code.
BBox regression
(์ง์ฝ) np.iinfo()
, np.finfo()
๋ ๊ฐ๊ฐ int
์ float
์ ๋ํ ์ ๋ณด๋ฅผ ์ ๊ณตํ๋ ํจ์๋ค!
์๋ฅผ ๋ค์ด, np.finfo(np.float32).min
, np.finfo(np.float32).max
, np.finfo(np.float32).eps
๋ฑ์ผ๋ก
โ์ต์๊ฐโ, โ์ต๋๊ฐโ, โํํ ๊ฐ๋ฅํ ๊ฐ์ฅ ์์ ๊ฐ(machine epsilon)โ์ ๋ํ ์ ๋ณด๋ฅผ ์ ์ ์๋ค!
// ๊ทผ๋ฐ ํฌ์คํธ์์ ๋์ค๋ ๋ฐฉ์์ ์ข ๊ณผ๋ฏผ ๋ฐ์ ๊ฐ๊ธฐ๋?
RPN Network Architecture
- RPN network(?) -> predict the location of the box โinside the anchorโ.
To generate region proposals, we โslideโ a small network over the convolutional feature map output.
This feature is fed into two sibling fully connected layers.
- A box regression layer
- A box classification layer
pred_cls_scores
and objectness_scores
are used as inputs to the proposal layer, which generate a set of proposal which are further used by โRoI networkโ.
์ํ! pred_cls_scores
๋ objectness_scores
๊ฐ ๋๊ฒ ๋น์ทํ๊ฒ ๋๊ปด์ก๋๋ฐ ๊ทธ ์ด์ ๋ฅผ ์์๋ค!!
์ด๊ฒ objectness_scores
๋ background-ness์ object-ness์ ๋ํ prediction ๊ฐ์ ๊ฐ์ง pred_cls_scores
์์ object์ ๋ํ score ๋ถ๋ถ๋ง ์ถ์ถํ ๊ฑฐ๊ตฌ๋!
Generating proposals to feed Fast R-CNN network
โThe Faster R_CNN says, RPN proposals highly overlap with each other. To reduced redundancy, we adopt non-maximum suppression(NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. After an ablation study, the authors show that NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following we training Fast R-CNN using 2000 RPN proposals. During testing they evaluate only 300 proposals, they have tested this with various numbers and obtained this.โ
์คํ! ํ์ต ๋์ ํ ์คํธ ํ ๋์ RPN proposal์ ์๊ฐ ๋ค๋ฅด๊ตฌ๋!!
- convert the loc predictions from the rpn network to bbox [y1, x1, y2, x2] format.
- clip the predicted boxes to the image // ์?! ์ด๊ฑด ์ข ๋์ค์ ํด๋ ๋ ๊ฒ ๊ฐ์๋ฐโฆ?
- Remove predicted boxes with either height or width < threshold (min_size).
- Sort all (proposal, score) pairs by score from highest to lowest.
- Take top pre_nms_topN (e.g. 12000 while training and 300 while testing).
- Apply nms threshold > 0.7
- Take top pos_nms_topN (e.g. 2000 while training and 300 while testing)
์๋์ ์์์ ํตํด ๋๋์ด RPN์ ์๋ฒฝํ ์ดํดํ๋ค๋ ๋๋์ด ๋ค์๋ค!!
- x = (w_{a} * ctr_x_{p}) + ctr_x_{a}
- y = (h_{a} * ctr_x_{p}) + ctr_x_{a}
- h = np.exp(h_{p}) * h_{a}
- w = np.exp(w_{p}) * w_{a}
and later convert to y1, x1, y2, x2 format
์ฆ, โProposalโ์ด๋ผ๋ ๊ฒ์ anchor box ๋ด๋ถ์์์ ์ ๋ณด๋ฅผ ๋ด๊ณ ์๋ ๊ฒ์ด๋ค!
๊ทธ๋์ Proposal์ anchor์ ํจ๊ป ์์ง์ด๋ ์กด์ฌ๋ผ๋ ๊ฑฐ์ง!
์ฌ๊ธฐ์ predicted_loc
์ด ๊ณง Proposal์ด ๋๋ ๊ฑฐ์ง!
loc_layer
๋ Anchor ๋ฐ์ค ๋ด๋ถ์ ์กด์ฌํ Proposal์ ๋ง๋ ๋ค.
// ์๋ anchor + predicted_loc
์ Proposal๋ก ์ดํดํด์ผ ํ๋?
NMS
- while order_array.size > 0:
- take โthe first elementโ in order_array and append that to keep
- Find the area with all other boxes
- Find the index of all the boxes which have high overlap with โthis boxโ
- Remove them from order array
- Iterate this till we get the order_size to zero (while loop)
- Output the keep variable which tells what indexes to consider.
๊ฒฐ๊ตญ NMS๋ proposal ์ฌ์ด ์ฌ์์ IoU๋ฅผ ๊ตฌํด proposal์ ๊ฒฝ๋ํํ๋ ๊ฒ!!
ROI Pooling layer
Region of interest pooling (also known as RoI pooling) purpose is to perform max pooling on inputs of non-uniform sizes to obtain fixed-size feature maps (e.g. 7ร7). This layer takes two inputs
โฆ
- Dividing the region proposal into โequal-sized sectionsโ (the number of which is the same as the dimension of the output)
- Finding the largest value in each section
- Copying these max values to the output buffer
Note that โthe dimension of the RoI pooling outputโ doesnโt actually depend on the size of the input feature map nor on the size of the region proposals. Itโs determined solely by the number of sections we divide the proposal into.
Whatโs the benefit of RoI pooling? One of them is processing speed. If there are multiple object proposals on the frame, we can still use the โsame-size input feature mapโ for all of them.
From the previous sections we got gt_roi_locs
, gt_roi_labels
and sample_rois
. We will use the sample_rois
as the input to the RoI pooling layer.