关于 Activation Regularization

Uncategorized
1.2k words

Activation Regularization - serp.ai A Gentle Introduction to Activation

\[\alpha{L}_{2}\left(m\circ{h_{t}}\right) \]

Here, \(m\) refers to the dropout mask used later in the model, \(\alpha\) is a scaling coefficient, and \(h_{t}\) is the output of the RNN at timestep \(t\). The \(L_{2}\) norm is used to calculate the magnitude of the activations, and the result is scaled by \(\alpha\). This encourages small activations, ultimately leading to better performance and generalization in the model.

A Gentle Introduction to Activation 中将Activation与value in feature画上等号,起初不理解,直到看到下文的:

These internal representations are tangible things. The output of a hidden layer within the network represent the learned features by the model at that point in the network.

之前的理解有局限:对于激活函数之后的下一层、单独看一个激活函数值来说,确实只是一个输入值;但如果考虑整个层输出的激活函数值整体,并且参照当前层的话,实际就是当前层的特征图。从这个角度看,Activation Regularization是限制特征图里值的大小

activation regularization 的别称

activation regularization、activity regularization、representation regularization sparse feature learning

为什么

https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models 从数值上

从导数

因为latent code也是activation,所以AE更倾向使用Activity Regularization?

用AR的优势是?

压缩时,对于DCT来说稀疏矩阵(0项)有利于提高压缩率, 如果用传统方法编码latent code,加上AR(增加activation,亦即latent code的稀疏性),可以理解 但autoencoder不只是用于压缩,使用AR的理由是? DCVC的损失函数没有用正则项,为什么