PyTorch学习笔记

基于Pytorch学习DL时,学习到的一些技巧/code

Tips

Loss (NAN)

损失函数在训练过程中,突然变得很大或者nan
添加 torch.cuda.amp.GradScaler() 解决 loss为nan或inf的问题

环境配置

windows

关于国内conda安装cuda11.6+pytorch的那些事。 – 王大神 (dashen.wang)

使用miniconda创建虚拟环境

  • conda create -n mine python=3.8
  • conda activate mine

安装cuda

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
换源:
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
conda config --set show_channel_urls true

安装:
conda install pytorch torchvision torchaudio pytorch-cuda=11.6

Found conflicts:
Package pytorch conflicts for:
torchaudio -> pytorch[version='1.10.0|1.10.1|1.10.2|1.11.0|1.12.0|1.12.1|1.13.0|1.13.1|2.0
.0|2.0.1|1.9.1|1.9.0|1.8.1|1.8.0|1.7.1|1.7.0|1.6.0']
torchvision -> pytorch[version='1.10.0|1.10.1|1.10.2|2.0.1|2.0.0|1.13.1|1.13.0|1.12.1|1.12
.0|1.11.0|1.9.1|1.9.0|1.8.1|1.8.0|1.7.1|1.7.0|1.6.0|1.5.1']
...

使用以下命令安装
> conda install -c gpytorch gpytorch

安装带cuda的torch
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118 --user

GPU

1
2
3
4
5
6
7
8
9
10
11
12
13
Neus: 
torch.set_default_tensor_type('torch.cuda.FloatTensor')
parser.add_argument('--gpu', type=int, default=0)
torch.cuda.set_device(args.gpu)

self.device = torch.device('cuda')
network = Network(**self.conf['model.nerf']).to(self.device)

#################################################################
NeRF:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = NeRF().to(device)
render_poses = torch.Tensor(render_poses).to(device)
1
2
3
4
5
6
7
8
9
torch.device('cpu'), torch.device('cuda'), torch.device('cuda:1')
如果有多个GPU,我们使用`torch.device(f'cuda:{i}')` 来表示第i块GPU(i从0开始)。 另外,`cuda:0`和`cuda`是等价的。

查询gpu数量
torch.cuda.device_count()

查询张量所在设备
x = torch.tensor([1, 2, 3])
x.device #device(type='cpu') 默认为gpu,也可为cpu

两张量相互运算需要在同一台设备上Z = X.cuda(1)

image.png

1
2
3
给网络指定设备
net = nn.Sequential(nn.Linear(3, 1))
net = net.to(device=try_gpu())

==只要所有的数据和参数都在同一个设备上, 我们就可以有效地学习模型==

Autograd

PyTorch 的 Autograd - 知乎 (zhihu.com)

Optimizer

Adam多个model参数,然后更新lr

Adam_in_Neus: params_to_train is a list

1
2
3
4
5
6
7
8
9
10
11
params_to_train = []
self.nerf_outside = NeRF(**self.conf['model.nerf']).to(self.device) # 创建一个NeRF网络
self.sdf_network = SDFNetwork(**self.conf['model.sdf_network']).to(self.device) # 创建一个SDF网络
self.deviation_network = SingleVarianceNetwork(**self.conf['model.variance_network']).to(self.device)
self.color_network = RenderingNetwork(**self.conf['model.rendering_network']).to(self.device)
params_to_train += list(self.nerf_outside.parameters())
params_to_train += list(self.sdf_network.parameters())
params_to_train += list(self.deviation_network.parameters())
params_to_train += list(self.color_network.parameters())

self.optimizer = torch.optim.Adam(params_to_train, lr=self.learning_rate)

然后更新学习率

g = self.optimizer.param_groups[index]

1
2
for g in self.optimizer.param_groups:
g['lr'] = self.learning_rate * learning_factor

from

1
2
3
4
5
6
7
8
9
10
def update_learning_rate(self):
if self.iter_step < self.warm_up_end:
learning_factor = self.iter_step / self.warm_up_end
else:
alpha = self.learning_rate_alpha
progress = (self.iter_step - self.warm_up_end) / (self.end_iter - self.warm_up_end)
learning_factor = (np.cos(np.pi * progress) + 1.0) * 0.5 * (1 - alpha) + alpha

for g in self.optimizer.param_groups:
g['lr'] = self.learning_rate * learning_factor

lr学习率

lr scheduler介绍和可视化 - 知乎 (zhihu.com)

lr_scheduler.nameLR

nameLR Brief
ConstantLR init_lr乘以factor持续total_iters
CosineAnnealingLR 构造一个cos函数,周期为2T_max,学习率区间为[init_lr,eta_min],cos向左平移last_epoch个iter
CyclicLR 三种mode:triangular三角波amplitude不变,triangular2每个cycle的amplitude减半,exp_range每个cycle iteration将amplitude缩放为$gamma^{iteration}$
ExponentialLR 指数减小lr:$gamma^{iter}$
LambdaLR 使用自定义的lambda来处理lr
StepLR 阶梯每step_size步将lr乘以gamma
MultiStepLR milestones = [30,80]处将lr乘以gamma
OneCycleLR not chainable,lr先上升到max_lr,然后减小。最大值处的step为total_step * pct_start = epochs * steps_per_epoch * pct_start
ConstantLR 前total_iters的lr为init_lr * factor
LinearLR init_lr * start_factor开始线性增长total_iters步到 init_lr * end_factor
MultiplicativeLR 学习率从init_lr 根据lr_lambda = lambda step: factor非线性衰减:$lr = factor^{step}$

连接多个lr

nameLR Brief
SequentialLR milestones前为scheduler1,后为scheduler2
ChainedScheduler 多个scheduler叠加

Tips

在使用SequentialLR将多个scheduler连接起来时,SequentialLR的每个milestones都会从每个scheduler的0处开始,因此Step_scheduler的milestones要设置成milestones=[1],这样设置会导致当Exp_scheduler结束时,先跳一下到ori_lr,然后step到ori_lr * 0.4

1
2
3
4
Con_scheduler = optim.lr_scheduler.ConstantLR(optimizer, factor=1.0, total_iters=total_iters)
Exp_scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=gamma)
Step_scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[1], gamma=0.4)
scheduler = optim.lr_scheduler.SequentialLR(optimizer, schedulers=[Con_scheduler,Exp_scheduler, Step_scheduler], milestones=[total_iters,total_iters+exp_iters])

私有成员

带双下划线函数

function brief description
nn.module.__repr__ 当print(model)时会运行该函数
__del__ del object时运行该函数

torch.cuda

cuda事件计算程序运行时间

1
2
3
4
5
6
7
iter_start = torch.cuda.Event(enable_timing = True)
iter_end = torch.cuda.Event(enable_timing = True)
iter_start.record()
# iter 1 code
iter_end.record()

print(f'iter time: {iter_start.elapsed_time(iter_end)}')

eg:

1
2
3
4
5
6
7
8
9
10
11
12
import torch

iter_start = torch.cuda.Event(enable_timing = True)
iter_end = torch.cuda.Event(enable_timing = True)
iter_start.record()

a = torch.tensor([1,2,3,4,5,6,7,8,9,10]).cuda()

iter_end.record()

timestamp = iter_start.elapsed_time(iter_end)
print(f'iter time: {timestamp:03f}')

Welcome to my other publishing channels