看啥推荐读物

专栏名称: GiantPandaCV

专注于机器学习、深度学习、计算机视觉、图像处理等多个方向技术分享。团队由一群热爱技术且热衷于分享的小伙伴组成。我们坚持原创，每天一到两篇原创技术分享。希望在传播知识、分享知识的同时能够启发你，大家一起共同进步(･ω<)☆

我也要提交微信公众号

今天看啥

微信公众号rss订阅, 微信rss, 稳定的RSS源

微信公众号RSS订阅方法

B站投稿RSS订阅方法

知乎回答RSS订阅方法

知乎专栏 RSS订阅方法

雪球动态RSS订阅方法

微博RSS订阅方法

微博搜索关键词订阅方法

豆瓣日记 RSS订阅方法

PyTorch高性能编程

GiantPandaCV · 公众号 · 3D · 2024-03-23 22:13

作者丨东尼大佬来源丨https://zhuanlan.zhihu.com/p/673671771编辑丨GiantPandaCV1. 能用_all_gather_base的，不用all_gatheroutput = torch.empty(input.numel() * world_size, dtype=input.dtype, device=input.device)torch.distributed._all_gather_base(output, input, group=xxx)vs.output_list = [ torch.empty(input.numel(), dtype=input.dtype, device=input.device) for _ in range(world_size)]torch.distributed.all_gather(output_list, input, group=xxx)output = torch.cat(output_list, dim=0)内存碎片更少，操作更少，性能/内存均有收益！2. 能用专有算子的，不用通用算子如 F.embedding vs. Index-selectMegatron-LM master实现使用的Index-select算子，Index-select会涉及索引展开、内存复用等HostCPU逻辑，效率较低3. 对于生命周期较长的Tensors，可以共用contiguous bufferdata = torch.zeros(global_size, dtype=xx, device=xx)start_idx = 0for i in range(len(item_list)): item_list[i] = data[start_idx:start_idx+item_list[i].nume ………………………………

原文地址：访问原文地址
快照地址：访问文章快照

分享到微博