Lec.11: 3D Deep Learning¶

Feature matching¶

SuperPoint
CNN-based detectors:
- train CNNs to detect corners
- train CNNs to enforce repeatability: warp image and enforce equivariance $$ \min_f\frac{1}{n}\sum_{i=1}^n|f(g(\mathbf{I}))-g(f(\mathbf{I}))|^2 $$
CNN-based descriptors
- train descriptors by metric learning
- constrastive loss
\[ L_{tri}=\frac{1}{N}\sum_{i=1}^N\max(0,m+\|F_I(A)-F_{I'}(P)\|-\|F_I(A)-F_{I'}(N)\|)^2 \]
- where is training data from
  - synthetic data
  - use MVS

feature-matching-based methods
- First,reconstruct object SfM model by input multi-view images
- Then obtain 2D-3D correspondeces by lifting 2D-2D matches to 3D
- Finally, object pose of query image can be solved by PnP
Direct Pose Regression Methods
- Directly regressing object pose of queryimage using a neural network
- Need to render a large amount of images for training
Keypoint detection methods
- Using a CNN to detect pre-defined keypoints
- Need to render a large amount ofimages for training

通过神经网络训练implict representations

\[ D_{SI}(y,y^*)=\frac{1}{n}\sum_{i=1}^n(\log y_i-\log y_i^*+\alpha(y,y^*))^2 \newline \alpha(y,y^*)=\frac{1}{n}\sum_{j=1}^n(\log y_j-\log y_j^*) \]

3D ConvNets: High space/time complexity of high resolution voxels $O(N^3)$
sparse ConvNets: using sparity of 3D shapes
- Store the sparse surface signals (Octree)
- Constrain the computation near the surface
- Sparse convolution: compute inner product only at the active sites (nonzero entries)
deep learning on point clouds
- challenge
  - Point cloud is unrasterized data, irregular
  - Convolution cannot be applied
- PointNet
  - 表示：N orderless points, each represented by a D dim coordinate
  - orderless的解决：max pooling makes the output invariant to the order of input points
  - 让网络旋转不变：estimate the transformation using another network (T-Net)
  - limitations
  - No local context for each point!

PointNet++: Multi-Scale PointNet
1. Sampling: Sample anchor points by Farthest Point Sampling (FPS)
2. Grouping: Find neighbourhood of anchor points
3. Apply PointNet in each neighborhood to mimic convolution
3D semantic segmentation
- Input: sensor data of a 3D scene (RGB/depth/point cloud …..)
- Output: Label each point in point cloud with category label.
- Possible solution: fuse 2D segmentation results in 3D 因为2D的效果比较好了
3D object detection
- PointRCNN: RCNN for point cloud
- Frustum PointNets: using 2D detectors to generate 3D proposals