MFU¶

Model FLOPs Utilization 模型算力利用率, 模型实际用了 GPU 理论峰值算力的多少比例

统计模型参数数量¶

Python
def get_num_params(self, non_embedding=True):
        """
        Return the number of parameters in the model.
        For non-embedding count (default), the position embeddings get subtracted.
        The token embeddings would too, except due to the parameter sharing these
        params are actually used as weights in the final layer, so we include them.
        """
        # 统计模型里 所有参数的数量
        n_params = sum(p.numel() for p in self.parameters())
        if non_embedding:
            n_params -= self.transformer.wpe.weight.numel()
        return n_params

为什么要减 position embedding？

wpe位置嵌入,只在 embedding 查表时用不参与大规模矩阵乘法而 MFU 关注的是真正消耗 FLOPs 的参数

MFU Flops计算¶

模型结构参数

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Python

L = cfg.n_layer        # Transformer 层数 class="n">H = cfg.n_head         # 注意力头数 class="n">Q = cfg.n_embd // cfg.n_head  # 每个 head 的维度 class="n">T = cfg.block_size     # 序列长度 class="n">N = 参数总量 class="k">def estimate_mfu(self, fwdbwd_per_iter, dt): class="w">        """ estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS """ # first estimate the number of flops we do per iteration. # see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311 N = self.get_num_params() cfg = self.config L, H, Q, T = cfg.n_layer, cfg.n_head, cfg.n_embd//cfg.n_head, cfg.block_size flops_per_token = 6*N + 12*L*H*Q*T flops_per_fwdbwd = flops_per_token * T flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter # express our flops throughput as ratio of A100 bfloat16 peak flops flops_achieved = flops_per_iter * (1.0/dt) # per second flops_promised = 312e12 # A100 GPU bfloat16 peak flops is 312 TFLOPS mfu = flops_achieved / flops_promised return mfu >