跳转至

Lec.10: Recognition

Semantic Segmentation

image-20241121100855164
  1. sliding window: 效率低,作用有限,感受野小

  2. 一般使用全连接神经网络

    • loss function: per-pixel cross-entropy
      • But convolution on high resolution is expensive

    image-20241121100936758

    • 提高效率:使用unet

      image-20241121101416107

      • 升维使用unpooling:可以使用bed of nails(新增的位置直接取0),nearest neighbor(和最近的一致),双线性插值,双三次插值

      • skip connection:在upsampling中直接使用downsampling过程相同大小的feature map

      image-20241121101605990

  3. DeepLab: FCN + CRF

  4. CRF(Conditional random field):

    • Energy function: $$ E(x)=\sum_{i'}\theta_i(x_i)+\sum_{ij}\theta_{ij}(x_i,x_j) $$

    • Unary potential $$ \theta_i(x_i)=-\log P(x_i) $$

    • Pairwise potential $$ \begin{aligned}\theta_{ij}(x_{i},x_{j})=&\mu(x_{i},x_{j})\bigg[w_{1}\exp\bigg(-\frac{||p_{i}-p_{j}||^{2}}{2\sigma_{\alpha}^{2}}-\frac{||I_{i}-I_{j}||^{2}}{2\sigma_{\beta}^{2}}\bigg)\newline &+w_{2}\exp\Big(-\frac{||p_{i}-p_{j}||^{2}}{2\sigma_{\gamma}^{2}}\Big)\Big]\end{aligned} $$ where \(\mu(x_{i},x_{j})=1\) if \(x_i\neq x_{j}\), and zero otherwise

  5. Evaluation metric: Per-pixel Intersection-over-union(IoU)

image-20241121102107055

Object detection

  1. output: a set of bounding boxes that denote objects

  2. detect a single object: Treat localization as a regression problem! 分两个网络进行分类和确定box范围

  3. Region proposals:对图像采取过分割,找到所有有可能的box,通常基于启发式。relatively fast to run

    • Selective Search gives 2000 region proposals in a few seconds on CPU
  4. R-CNN

    • Resize each region to 224x224 and run through CNN
    • Predict class scores and bbox transform
    • Use scores to select a subset of region proposals to output

    image-20241121103313306

    • evaluation metric: IoU > 0.5 is decent, > 0.7 is pretty good, > 0.9 is almost perfect
    • Non-Max Suppression: Object detectors often output many overlapping detections
      1. Select the highest-scoring box
      2. Eliminate lower-scoring boxes with IoU > threshold 和第一步中分数最高的计算IoU
      3. If any boxes remain, goto 1
  5. Fast R-CNN: 把整个图都跑一遍网络的backbone以提取特征,提高效率

image-20241121104025419
  1. Faster R-CNN: A two-stage object detector

    • First stage: run once per image
    • Backbone network
    • RPN
    • Second stage: run once per region
    • Crop features: RoI pool / align
    • Predict object class
    • Predict bbox offset
    • 进一步地,使用一个卷积网络训练region proposal

    image-20241121104124531

    • RPN:预先定义K种可能的anchor box,对每个点都生成对应的这些box,之后跑网络,网络输出对应box是否包含物体,以及4个\(\Delta\),表示box在四个方向上要如何调整。最后选取分数最高的若干个

    image-20241121104638351

  2. 确认哪些box有object后,Single-Stage Detector: Classify each object as one of C categories (or background)

  3. YOLO: single-stage。 即识别框内有物体和是什么物体只用跑一次,现在yolo用的最多

  4. two vs. single

    • Two-stage is generally more accurate
    • Single-stage is faster

Instance segmentation

  1. Faster R-CNN + additional head(mask prediction)

  2. deep snake:最早的,通过优化的方式

image-20241121110219349
  1. Beyond instance-segmentation

    • Label all pixels in the image (both things and stuff)

Human pose estimation

  1. represent the pose of a human by locating a set of keypoints

e.g. 17 keypoints:

  • Nose
  • Left / Right eye
  • Left / Right ear
  • Left / Right shoulder
  • Left / Right elbow
  • Left / Right wrist
  • Left / Right hip
  • Left / Right knee
  • Left / Right ankle
  1. single human

    • directly predict joint locations
    • represent joint locations as the heatmap
  2. Multiple humans

    • Top-down: Detect humans and detect keypoints in each bbox 先用前面的方法分割找到每个人,之后再当single human识别。但效率低,同时两个人重合时可能只会有一个box,不可修正
      • Mask R-CNNs
      • 大部分的时候更准确
    • Bottom-up: Detect keypoints and group keypoints to form humans
      • Example: OpenPose: Link parts based on part affinity fields
      • 更快,复杂情况效果可能更好

Other tasks

  1. video classification: use 3D CNN
image-20241121112225663
  1. temporal action localization: Given a long untrimmed video sequence, identify frames corresponding to different actions 位置+帧的region propsal

  2. Spatial-temporal detection: Given a long untrimmed video, detect all the people in space and time and classify the activities they are performing

  3. Multi-object trackinig: Identify and track objects belonging to one or more categories without any prior knowledge about the appearance and number of targets.

评论