Lec.10: Recognition¶

Semantic Segmentation¶

sliding window: 效率低，作用有限，感受野小
一般使用全连接神经网络
- loss function: per-pixel cross-entropy
  - But convolution on high resolution is expensive
- 提高效率：使用unet
  - 升维使用unpooling：可以使用bed of nails(新增的位置直接取0)，nearest neighbor(和最近的一致)，双线性插值，双三次插值
  - skip connection：在upsampling中直接使用downsampling过程相同大小的feature map
DeepLab: FCN + CRF
CRF(Conditional random field):
- Energy function: $$ E(x)=\sum_{i'}\theta_i(x_i)+\sum_{ij}\theta_{ij}(x_i,x_j) $$
- Unary potential $$ \theta_i(x_i)=-\log P(x_i) $$
- Pairwise potential $$ \begin{aligned}\theta_{ij}(x_{i},x_{j})=&\mu(x_{i},x_{j})\bigg[w_{1}\exp\bigg(-\frac{||p_{i}-p_{j}||^{2}}{2\sigma_{\alpha}^{2}}-\frac{||I_{i}-I_{j}||^{2}}{2\sigma_{\beta}^{2}}\bigg)\newline &+w_{2}\exp\Big(-\frac{||p_{i}-p_{j}||^{2}}{2\sigma_{\gamma}^{2}}\Big)\Big]\end{aligned} $$ where $\mu(x_{i},x_{j})=1$ if $x_i\neq x_{j}$, and zero otherwise
Evaluation metric: Per-pixel Intersection-over-union(IoU)

Object detection¶

output: a set of bounding boxes that denote objects
detect a single object: Treat localization as a regression problem! 分两个网络进行分类和确定box范围
Region proposals：对图像采取过分割，找到所有有可能的box，通常基于启发式。relatively fast to run
- Selective Search gives 2000 region proposals in a few seconds on CPU
R-CNN
- Resize each region to 224x224 and run through CNN
- Predict class scores and bbox transform
- Use scores to select a subset of region proposals to output
- evaluation metric: IoU > 0.5 is decent, > 0.7 is pretty good, > 0.9 is almost perfect
- Non-Max Suppression: Object detectors often output many overlapping detections
  1. Select the highest-scoring box
  2. Eliminate lower-scoring boxes with IoU > threshold 和第一步中分数最高的计算IoU
  3. If any boxes remain, goto 1
Fast R-CNN: 把整个图都跑一遍网络的backbone以提取特征，提高效率

Faster R-CNN: A two-stage object detector
- First stage: run once per image
- Backbone network
- RPN
- Second stage: run once per region
- Crop features: RoI pool / align
- Predict object class
- Predict bbox offset
- 进一步地，使用一个卷积网络训练region proposal
- RPN：预先定义K种可能的anchor box，对每个点都生成对应的这些box，之后跑网络，网络输出对应box是否包含物体，以及4个$\Delta$，表示box在四个方向上要如何调整。最后选取分数最高的若干个
确认哪些box有object后，Single-Stage Detector: Classify each object as one of C categories (or background)
YOLO: single-stage。即识别框内有物体和是什么物体只用跑一次，现在yolo用的最多
two vs. single
- Two-stage is generally more accurate
- Single-stage is faster

Instance segmentation¶

Faster R-CNN + additional head(mask prediction)
deep snake：最早的，通过优化的方式

Beyond instance-segmentation
- Label all pixels in the image (both things and stuff)

Human pose estimation¶

represent the pose of a human by locating a set of keypoints

e.g. 17 keypoints:

Nose

Left / Right eye

Left / Right ear

Left / Right shoulder

Left / Right elbow

Left / Right wrist

Left / Right hip

Left / Right knee

Left / Right ankle

single human
- directly predict joint locations
- represent joint locations as the heatmap
Multiple humans
- Top-down: Detect humans and detect keypoints in each bbox 先用前面的方法分割找到每个人，之后再当single human识别。但效率低，同时两个人重合时可能只会有一个box，不可修正
  - Mask R-CNNs
  - 大部分的时候更准确
- Bottom-up: Detect keypoints and group keypoints to form humans
  - Example: OpenPose: Link parts based on part affinity fields
  - 更快，复杂情况效果可能更好

Other tasks¶

video classification: use 3D CNN

temporal action localization: Given a long untrimmed video sequence, identify frames corresponding to different actions 位置+帧的region propsal
Spatial-temporal detection: Given a long untrimmed video, detect all the people in space and time and classify the activities they are performing
Multi-object trackinig: Identify and track objects belonging to one or more categories without any prior knowledge about the appearance and number of targets.

Lec.10: Recognition¶

Semantic Segmentation¶

Object detection¶

Instance segmentation¶

Human pose estimation¶

Other tasks¶

评论