## Generalized Linear Models

定义 若一个分布能够以下述方式进行表示，则称之为指数族（ Exponential Family）的一员 $$ p(y; \eta ) = b(y)\exp(\eta^{\mathbf{T}}T(y) - a(\eta )) $$ 其中$\eta$被称为分布的自然参数（nat...

## 新的主题

这是基于 Hugo 系列主题第一篇文章，因为之前是在 Jekyll 上进行渲染，故而 Hello World也有更新 为啥变动 那么为啥从 Jekyll 变到 Hugo 呢？原因其实有几点： 之前用的主题看...

## Diving in distributed training in PyTorch

鉴于网上此类教程有不少模糊不清，对原理不得其法，代码也难跑通，故而花了几天细究了一下相关原理和实现，欢迎批评指正！ 关于此部分的代码，可以去这...

## Going Deeper into Back-propagation

1. Gradient descent optimization Gradient-based methods make use of the gradient information to adjust the parameters. Among them, gradient descent can be the simplest. Gradient descent makes the parameters to walk a small step in the direction of the negative gradient. $$ \mathbf{w}^{\tau + 1} = \mathbf{w}^{\tau} - \eta \nabla_{\mathbf{w}^{\tau}} E \tag{1.1} $$ where \(\eta, \tau, E\) label learning rate (\(\eta > 0\)), the iteration step and the loss function....

## Tips for Training Neural Networks

Recently, I have read a blog about training neural networks (simplified as NN in the rest part of this post) and it is really amazing. I am going to add my own experience in this post along with summarizing that blog’s interesting part. Nowadays, it seems like that training NN is extremely easy for there are plenty of free frameworks which are simple to use (e.g. PyTorch, Numpy, Tensorflow). Well, training NN is easy when you are copying others’ work (e....

## Quotes of Mathematicians

Life is complex, and it has both real and imaginary parts. — Someone Basically, I’m not interested in doing research and I never have been… I’m interested in understanding, which is quite a different thing. And often to understand something you have to work it out yourself because no one else has done it. — David Blackwell To not know maths is a severe limitation to understanding the world. — Richard Feynman...

## Retrieval-Enhanced Transformer

Problems To Solve To Scale Down the model size while maintaining the performances. To incorporate External Memory Retrieval in the Large Language Model Modeling. How? Data Construction Training & Evaluation set: \(\text{MassiveText}\) for both training & retrieval data (contains 5 trillion tokens) SentencePiece with a vocabulary of \(128K\) tokens During training, we retrieving \(600B\) tokens from the training The evaluation contains \(1.75T\) tokens Test set leakage: Due to the huge retrieving database, the test set may have appeared in the training set....