安装驱动:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# 清理全部的其他版本的nvidia驱动 $ sudo apt-get purge nvidia-* $ sudo reboot # nvidia-smi $ sudo apt install nvidia-utils-470 # 驱动 $ sudo apt install nvidia-driver-470 # cuda 11.3 $ sudo apt install nvidia-cuda-toolkit $ sudo apt-get update # 部分驱动可能会更新,需要执行更新,否则可能依旧不正常 $ sudo apt-get dist-upgrade $ sudo apt-get autoremove # 重启,否则部分驱动可能工作不正常 $ sudo reboot |
在 Anaconda 上建立独立的编译环境,然后执行编译:
1 2 3 4 5 6 7 8 |
# wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh # 国内镜像下载 $ wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.11-Linux-x86_64.sh $ bash Anaconda3-*-Linux-x86_64.sh # 更新到最新版本 $ conda update -n base -c defaults conda |
参考 Anaconda conda切换为国内源 加速下载。
编译配置StyleGAN3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
$ sudo apt-get install git $ git clone git@github.com:NVlabs/stylegan3.git $ cd stylegan3 $ conda env create -f environment.yml $ conda activate stylegan3 $ pip install psutil # cudnn加速 $ conda install cudnn # 目前测试 RTX 3060 12GB的情况下,batch建议是2,更高会报告OOM # 并且当batch低于4的时候,需要同时指定 --mbstd-group=2 $ python train.py --outdir=~/training-runs --cfg=stylegan3-t --data=~/datasets/metfaces-1024x1024.zip --gpus=1 --batch=2 --mbstd-group=2 --gamma=8.2 --mirror=1 --metrics=none |
如果报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
Constructing networks... Setting up PyTorch plugin "bias_act_plugin"... Failed! Traceback (most recent call last): File "~/source/stylegan3/train.py", line 286, in <module> main() # pylint: disable=no-value-for-parameter File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "~/source/stylegan3/train.py", line 281, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "~/source/stylegan3/train.py", line 96, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "~/source/stylegan3/train.py", line 47, in subprocess_fn training_loop.training_loop(rank=rank, **c) File "~/source/stylegan3/training/training_loop.py", line 168, in training_loop img = misc.print_module_summary(G, [z, c]) File "~/source/stylegan3/torch_utils/misc.py", line 216, in print_module_summary outputs = module(*inputs) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) File "~/source/stylegan3/training/networks_stylegan3.py", line 511, in forward ws = self.mapping(z, c, truncation_psi=truncation_psi, truncation_cutoff=truncation_cutoff, update_emas=update_emas) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) File "~/source/stylegan3/training/networks_stylegan3.py", line 151, in forward x = getattr(self, f'fc{idx}')(x) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1071, in _call_impl result = forward_call(*input, **kwargs) File "~/source/stylegan3/training/networks_stylegan3.py", line 100, in forward x = bias_act.bias_act(x, b, act=self.activation) File "~/source/stylegan3/torch_utils/ops/bias_act.py", line 84, in bias_act if impl == 'cuda' and x.device.type == 'cuda' and _init(): File "~/source/stylegan3/torch_utils/ops/bias_act.py", line 41, in _init _plugin = custom_ops.get_plugin( File "~/source/stylegan3/torch_utils/custom_ops.py", line 136, in get_plugin torch.utils.cpp_extension.load(name=module_name, build_directory=cached_build_dir, File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1080, in load return _jit_compile( File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1318, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1701, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "<frozen importlib._bootstrap>", line 565, in module_from_spec File "<frozen importlib._bootstrap_external>", line 1173, in create_module File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed ImportError: ~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/lib/../../../../libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by ~/.cache/torch_extensions/bias_act_plugin/3cb576a0039689487cfba59279dd6d46-nvidia-geforce-rtx-3060/bias_act_plugin.so) |
上述报错产生的原因是在 Anaconda 下载的包,在进行编译的时候,使用了高版本的 libstdc++.so。而运行时却使用了Anaconda 环境里低版本的 libstdc++.so 导致报错。
了解了原因,解决方法就比较简单了,可以手工升级 Anaconda 环境下的 libstdc++.so 动态库。
如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ conda activate stylegan3 $ conda install cmake $ conda install make # 关键升级命令,更新当前项目里面的 libstdc++.so $ conda install -c conda-forge libstdcxx-ng # 删除上次失败时候的编译缓存 $ rm -rf ~/.cache # 目前测试 RTX 3060 12GB的情况下,batch建议是2,更高会报告OOM # 当batch=4的时候会在第11天的时候报告OOM # 并且当batch低于4的时候,需要同时指定 --mbstd-group=2 $ python train.py --outdir=~/training-runs --cfg=stylegan3-t --data=~/datasets/metfaces-1024x1024.zip --gpus=1 --batch=2 --mbstd-group=2 --gamma=8.2 --mirror=1 --metrics=none |
目前测试发现,当batch=4的时候会在第11天的时候报告OOM,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
tick 444 kimg 1776.0 time 11d 17h 14m sec/tick 2292.6 sec/kimg 573.16 maintenance 0.2 cpumem 5.40 gpumem 7.69 reserved 10.03 augment 0.344 Traceback (most recent call last): File "~/source/stylegan3/train.py", line 286, in <module> main() # pylint: disable=no-value-for-parameter File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__ return self.main(*args, **kwargs) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "~/source/stylegan3/train.py", line 281, in main launch_training(c=c, desc=desc, outdir=opts.outdir, dry_run=opts.dry_run) File "~/source/stylegan3/train.py", line 96, in launch_training subprocess_fn(rank=0, c=c, temp_dir=temp_dir) File "~/source/stylegan3/train.py", line 47, in subprocess_fn training_loop.training_loop(rank=rank, **c) File "~/source/stylegan3/training/training_loop.py", line 278, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, gain=phase.interval, cur_nimg=cur_nimg) File "~/source/stylegan3/training/loss.py", line 81, in accumulate_gradients loss_Gmain.mean().mul(gain).backward() File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/_tensor.py", line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 147, in backward Variable._execution_engine.run_backward( File "~/anaconda3/envs/stylegan3/lib/python3.9/site-packages/torch/autograd/function.py", line 87, in apply return self._forward_cls.backward(self, *args) # type: ignore[attr-defined] File "~/source/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 50, in backward grad_input, grad_grid = _GridSample2dBackward.apply(grad_output, input, grid) File "~/source/stylegan3/torch_utils/ops/grid_sample_gradfix.py", line 59, in forward grad_input, grad_grid = op(grad_output, input, grid, 0, 0, False) RuntimeError: CUDA out of memory. Tried to allocate 1.39 GiB (GPU 0; 11.76 GiB total capacity; 7.06 GiB already allocated; 443.88 MiB free; 10.02 GiB reserved in total by PyTorch) |
参考链接
- pytorch 1.0.1在ubuntu 18.04(GeForce GTX 760)编译(CUDA-10.1)
- Tutorial Setup Anaconda On Ubuntu 21.04
- 已解决 | 在conda环境中无法找到GLIBCXX_3.4.21
- conda-forge/packages/libstdcxx-ng
- CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment
- NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver after updating Ubuntu 20.04
- 在Ubuntu系统中NVIDIA显卡驱动卸载与安装
- 耗电量相当核反应堆运行15分钟,英伟达开源的StyleGAN3果然残暴
- StyleGAN3论文解读