Archive

2024-06-19: Parallel Reduction Optimization with CUDA reduce(归约) 是并行计算中的一个经典问题，他的做法如下：给定一个数组，计算其sum（总和）、min（最小值）、max（最大值）或mean（平均值），这一操作是基本的data parallel …

2024-05-30: W4A8KV4 的一些思考 TL;DR KV4 W4A8 Introduction 论文走读 SqueezeLLM Sensitivity-Based Non-uniform Quantization …

2024-05-06: 量化评估 Proposal 现状评估指标选择后续评估 TODO 现状目前在 crossing 中支持两大类型的评测，一类是基于 lm-evaluation-harness prefill-like task， …

2024-05-05: LLM Speculative Sampling 背景介绍算法介绍数学推导 Reference 背景介绍在 LLM 推理过程中，主要采用了 AutoRegressive Sampling(ArS) 的方式来执行的， …

2024-04-25: SmoothQuant 优化记录 SmoothQuant 原理 Recap 当前的问题一些 Fixup Greedy Search FC-FC Smooth Reverse Greedy Search …

2024-04-18: 在工作中如果我们写了一个 cuda kernel，需要去计算他的理论利用率和理论带宽，这样我们才能知道这个 kernel 还有多少的优化空间。搞清楚两个定义，tflops …

2024-04-17: Triton Tutorial #2 Triton Tutorial #2 This is the third blogpost of Triton tutorial series. In this tutorial, we will …