先写个python脚本, test.py, 保存后运行, 或者直接在命令行里输入python回车后粘贴运行下方代码:
import torch import flashinfer kv_len = 2048 num_kv_heads = 32 head_dim = 128 k = torch.randn(kv_len, num_kv_heads, head_dim).half().to('cuda') v = torch.randn(kv_len, num_kv_heads, head_dim).half().to('cuda') # CUDA Decoding for single request q = torch.randn(num_kv_heads, head_dim).half().to('cuda') o = flashinfer.single_decode_with_kv_cache(q, k, v) print("FlashInfer seems ok.")
如果报运行脚本时TORCH_CUDA_ARCH_LIST变量找不到, 那是因为你编译的时候没限定用了哪些CUDA计算能力, 不指定CUDA架构号(CUDA计算能力/cuda compute capability)会让编译后的二进制包很大, 想减少二进制包大小, 见传送门 https://www.wkwkk.com/articles/4c51566535e88f71.html