Numerical Stability

Why 当计算涉及到实数域时,比如圆周率的$\pi$,因为小数部分是无穷的,计算机是无法准确表示,因而只会用近似的值进行替代,这种情况下,误差相对...

June 25, 2023 · 2100 words

Bias Variance Decomposition

引言 我们规定,训练集记为$\mathcal{D}$,我们从中取一个样本$\boldsymbol{x}$,其训练集标签为$y_{\mathca...

June 21, 2023 · 991 words

Noise Contrastive Estimation

难以承受之重 文本生成是 NLP 任务中比较典型的一类,记参数为$\boldsymbol{\theta }$,给定的 context 为$\boldsymbol{c}$...

May 29, 2023 · 4178 words

Fast Greedy MAP Inference for DPP

问题 先规定一些术语:记选中元素构成的集合为$\mathcal{S}$,未选中构成的元素记为$\mathcal{R}$,$\mathbf{L}...

May 16, 2023 · 4188 words

Determinantal Point Process

在机器学习中,我们通常会面临一个问题:给定一个集合$\mathbf{S}$,从中寻找$k$个样本构成子集$\mathbf{V}$,尽量使得子...

April 21, 2023 · 2889 words

Generalized Linear Models

定义 若一个分布能够以下述方式进行表示,则称之为指数族( Exponential Family)的一员 $$ \begin{equation} p(y; \eta ) = b(y)\exp(\eta^{\mathbf{T}}T(y) - a(\eta )) \end{equation} $$ 其中$\eta$被称为分布的自然参数(n...

February 17, 2023 · 3664 words

Diving in distributed training in PyTorch

鉴于网上此类教程有不少模糊不清,对原理不得其法,代码也难跑通,故而花了几天细究了一下相关原理和实现,欢迎批评指正!代码开源在此: DL-Tools Cache effective tools for deep...

November 20, 2022 · 5172 words

Going Deeper into Back-Propagation

1. Gradient descent optimization Gradient-based methods make use of the gradient information to adjust the parameters. Among them, gradient descent can be the simplest. Gradient descent makes the parameters to walk a small step in the direction of the negative gradient. $$ \boldsymbol{w}^{\tau + 1} = \boldsymbol{w}^{\tau} - \eta \nabla_{\boldsymbol{w}^{\tau}} E \tag{1.1} $$ where $\eta, \tau, E$ label learning rate ($\eta > 0$), the iteration step and the loss function. Wait!...

September 7, 2022 · 1051 words