7 Neural Networks¶

Deep Learning¶

Linear model

\[ f(x)=w^Tx+b=0 \ \ \ \text{Hyperplane} \newline r=\frac{f(x)}{\|w\|}\ \ \ \text{distance} \]
Perceptron
- 单层：预测某个x错误，则$w_t=w_{t-1}+xy$
- 多层
  - If an example can be correctly predicted, No penalty.
  \[ \begin{aligned} J(w)&=-\sum_{i\in I_M}w^Tx_iy_i \newline \nabla J&=\sum_{i\in I_M}-x_iy_i \newline \text{gradient descent}\quad &w(k+1)=w(k)+\eta(k)\sum_{i\in I^k_M}x_iy_i \end{aligned} \]
  
  batch learning：所有sample都available
- online learning/mini-batch learning: 学习算法只能one by one的看训练数据
  - 训练速度快，memory消耗小
  - 具有实时性，可以快速适应新数据的特征变化
  - 需要进行一定的模型设计和优化，以提高算法的效率和准确性
- mistake bound theorem
Bias-variance Decomposition
Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units $n_H$, proper nonlinearities, and weights.
对激活函数要求：非线性，有上下界，本身和导数连续且smooth

对参数：以0为中心，奇函数lead to faster learning

Dropout：每次训练，每个结点都按一定概率可能被激活，防止过拟合。测试时，所有node都被激活
CNN: 卷积、池化；越接近输出的层可以表示越复杂的特征，因为他们感受野更大
- Totally, big stride makes us harvest a big receptive field faster, but it also discards some information.
ResNet: 由多个residual block组成，每个block输出output=$ReLU(x)+x$
Language Modeling: 给出一系列词，计算下个word分布的概率

$$ P(x^{(t+1)}|x^{(t)},\dots,x^{(1)}) $$

RNN
- softmax把向量转化成概率(归一化)
- 模型大小不会随input长度增加而增加，由于每一个timestep所用的权重是一样的，因此对输入顺序是symmetry的
- 缺点在于计算recurrent process很慢，同时由于梯度消失问题，在实际中我们很难获得很多个时间步之前的信息
Vanishing Gradient Problem: Gradient contributions from “far away” steps become zero, and the state at those steps doesn’t contribute to what you are learning 获得不了很远的信息
- 对权重矩阵进行好的初始化会降低梯度消失的影响
- 使用ReLU 这样梯度更可能存下来
- 使用LSTM或GRU，现在直接LLaMa？
LSTM: If the forget gate is set to 1 for a cell dimension and the input gate set to 0, then the information of that cell is preserved indefinitely.

LSTM doesn’t guarantee that there is no vanishing/exploding gradient, but it does provide an easier way for the model to learn long-distance dependencies