GPU 显存分析(待补充)

前情提要：在做Relighting项目的时候，每次训练图像分辨率增大到512就会报错，显示无法分配显存，但是控制台报错显示的剩余显存是完全足够的，为了训练更高分辨率的Ground Truth图像，不得已开始学习更加细致的GPU显存分析。

参考资料：

https://zhuanlan.zhihu.com/p/424512257
Pytorch官方的profiler教程：https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
线上Profile可视化：edge://tracing/

一. 使用Torch内置函数而不是Nvidia-smi

在对GPU占用情况进行分析时，通常会使用Nvidia官方提供的控制台命令：Nvidia-smi。但是对于深度学习来说，Pytorch有自己的GPU显存分配机制，Pytorch的机制是使用缓存分配器来管理缓存分配的(因为这样速度快)，但是在缓存分配器的机制下, 一个Tensor就算被释放了，进程也不会把空闲出来的显存还给GPU，而是等待下一个Tensor来填入这一片被释放的空间(即只要一个Tensor对象在后续不会再被使用，那么PyTorch就会自动回收该Tensor所占用的显存，并以缓冲区的形式继续占用显存，所以在nvidia-smi/gpustat中看到的显存并没有减少)。因此使用Nvidia-smi并不能准确地实时查看GPU状态，需要使用Pytorch的内置函数：

torch.cuda.memory_allocated()
torch.cuda.memory_reserved() 或者 torch.cuda.memory_cached()
torch.cuda.memory_summary()
需要注意的是，上面两个函数输出的结果都是以Bytes为单位的，如果要以MB或者GB为单位需要自己进行换算。

二. Pytorch的显存层级分配：

在PyTorch中，显存是按页为单位进行分配的，这可能是CUDA设备的限制。就算我们只想申请4字节的显存，pytorch也会为我们分配512字节或者1024字节的空间。即就算我们只想申请4字节的显存，pytorch也会先向CUDA设备申请2MB的显存到自己的cache区中，然后pytorch再为我们分配512字节或者1024字节的空间。这个在使用torch.cuda.memory_allocated()的时候可以看出来512字节；用torch.cuda.memory_cached()可以看出向CUDA申请的2MB。
Pytorch分配逻辑：

1. pytorch中的reserved_memory以block的形式存在。
1. 一个allocation的显存被释放后，他所在的block可以被重新被allocate.
1. 分配器尝试寻找能满足requested size的最小cached block，如果这个block 的大小大于requested size，那么这个block可以被split. 如果没有block了，那么分配器就会调用cudaMalloc向CUDA设备申请显存。
1. 如果cudaMalloc失败了，分配器会先尝试释放掉一个足够大的，且没有被split的cached block，并重新尝试allocate。
1. 大于1MB的allocation和小于等于1MB的allocation会被存储在不同的pool中。小的请求会放进2MB的buffer里，大的请求会先尝试使用最小的可用free block，或者用cudaMalloc申请一个新的block。
1. 为了减小碎片化，在所有可用block都没有充足的大小的时候，1MB到10MB的allocation会使allocator申请一个20MB的block，并在上面进行split；为了进一步减小碎片化，大于200MB的块则不能够被split。大于200MB的超大cached blocks仍可满足小于20MB的请求。

三. 变量拷贝到显存上时使用.to(device)和cuda()的区别

在实践中我发现对于一个相同的Tensor，使用to.(device)和cuda()两种方式将其移动的GPU上，造成的显存占用略有不同。
dbq我是蠢比，只用x.to(device)当然是不行的，因为之后没有用到过这个变量，Pytorch会自动将其释放，但是改成x=x.to(device)就没有问题了。

四. 尝试使用Nsight进行可视化分析。

首先尝试控制台调用nsys，参考该帖中的指令：https://dev-discuss.pytorch.org/t/using-nsight-systems-to-profile-gpu-workload/59：
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python main.py
后面根据我本机上的报错更换了指令，

unrecognised option ‘–cudabacktrace=true’，删除该命令行参数
–sample=cpu requires administrative privileges.，以管理员方式打开vscode
Illegal –trace argument ‘osrt’, Possible –trace values are one or more of ‘cuda’, ‘nvtx’, ‘opengl’, ‘opengl-annotations’, ‘vulkan’, ‘vulkan-annotations’, ‘dx11’, ‘dx11-annotations’, ‘dx12’, ‘dx12-annotations’, ‘wddm’ or ‘none’，把不在possible value里面的都删除。
Program not found: python，这个不知道是为什么，猜测与anaconda封装不同的python环境有关，因此增加python的绝对路径进行调用。
所以最后的控制台指令变为：
nsys profile -w true -t cuda,nvtx -s cpu --capture-range=cudaProfilerApi --stop-on-range-end=true -x true -o my_profile C:\Users\face\anaconda3\envs\py39\python.exe train_rnr_analysis.py

五. 网络映射盘的管理员权限问题：

但是又遇到了新的问题：在win10环境下，管理员权限下无法正确访问到网络映射盘。原因可能是因为：win10 的uac隔离更严格，普通用户权限创建的网络映射盘，管理员权限是无权访问的。
请教同学之后发现该服务器windows启动默认打开的是名为face的账户，真正的管理员用户Administrator是默认禁用的，可以在计算机管理取消禁用该用户后重新挂载网络映射盘，也可以管理员权限开启控制台，输入net use Y: \\192.168.16.73\exchange命令进行重新映射。
现在nsys终于跑起来了（orz。

六. The application terminated before the collection started.

遇到了新的问题：程序正常运行，但运行结束后输出：The application terminated before the collection started. No report was generated.
决定先用https://dev-discuss.pytorch.org/t/using-nsight-systems-to-profile-gpu-workload/59中的示例来解决这个问题，再到训练程序上跑。

七. 换用nsight system gui

用nsight compute gui一直出错，最后改用nsight systems gui，版本为2022.1.3，运行命令即为简单的python train_rnr.py，在服务器上本地运行nsight。唉但是发现使用Nsight的问题是，在timeline上展示出的调用函数都太过底层了，因此我只能把迭代次数和timeline上的重复单元对应起来，却不能很好的进行更细致的分析。
但是使用Nsight可以可视化显存占用的变化曲线，可以比较直观的感受到显存的占用情况。

八. 编写memory进行模块化显存占用输出

最后我使用Nsight进行整体的显存占用变化可视化，用自己编写的memory类调用pytorch的内置函数输出分析步骤级的GPU占用，唯一的问题是：假如某一个步骤中GPU最大占用了4个G，但是最后释放完成后只占用了2个G，并且最大占用没有大到改变到pytorch的max_allocated，则不能准确得到这个最大占用的4个G，想知道这个信息只能将该步骤进一步切分分析。
Memory类代码：

import torch

# torch.cuda.empty_cache()

class MemoryInfoGPU:
    def __init__(self, device_idx = 0):
        torch.cuda.empty_cache()
        self.device = device_idx
        self.last_allocated = 0
        self.last_cached = 0
        self.last_max_allocated = 0
        self.last_max_cached = 0

        self.now_allocated = 0
        self.now_cached = 0
        self.max_allocated = 0
        self.max_cached = 0
        
        self.queryGPU()
        
    def queryGPU(self):
        self.last_allocated = self.now_allocated
        self.last_cached = self.now_cached
        self.last_max_allocated = self.max_allocated
        self.last_max_cached = self.max_cached
        
        self.now_allocated = torch.cuda.memory_allocated(device=self.device)
        self.now_cached = torch.cuda.memory_reserved(device=self.device)
        self.max_allocated = torch.cuda.max_memory_allocated(device=self.device)
        self.max_cached = torch.cuda.max_memory_cached(device=self.device)

    def printMemoryInfo(self):
        print(str(self.device) + " GPU Memory Info: ")
        print("\tNow Allocated:\t", self.now_allocated/1024.0/1024.0, " MB",
              "\tNow Cached:\t", self.now_cached/1024.0/1024.0, " MB")
        print("\tMax Allocated:\t", self.max_allocated/1024/1024.0, " MB",
              "\tMax Cached:\t", self.max_cached/1024.0/1024.0, " MB")
        return
    
    def printConsume(self, message):
        print(message)
        print("Consume:\t", (self.now_allocated - self.last_allocated)/1024.0/1024.0, " MB")
        if(self.max_allocated > self.last_max_allocated):
            print("But need:\t", (self.max_allocated - self.last_allocated)/1024.0/1024.0, " MB")
        # 如果之前的最大分配非常大，这次没有超过，是不能得出准确的本进程最大分配的。
        
    def _before(self):
        self.queryGPU()
        
    def _after(self, message = ""):
        self.queryGPU()
        self.printMemoryInfo()
        self.printConsume(message)