網易首頁 > 網易號 > 正文申請入駐

大模型量化（GPTQ、GGUF）實戰以及效果和推理性能實測

2025-10-12 22:53:55　來源: 機器學習與Python社區

北京舉報

分享至

省流：看 OTPS 指標，llama.cpp 單用戶性能最好，但是大并發性能下，vllm+GPTQ > vllm+GGUF

1. 安裝環境

硬件環境：

GTX 4090 24GB x 1
Windows 11 + WSL2
Driver Version: 581.29

安裝軟件環境（依賴conda: https://conda-forge.org/download/）

# 國內配置：export HF_ENDPOINT=https://hf-mirror.com conda create -n llm-speedup python==3.12 conda activate llm-speedup pip install "vllm==0.10.2" "sglang==0.5.2" "evalscope[perf]==1.0.1" langdetect immutabledict cd llm-compressor pip install -e ./ pip install "datasets<4.0.0" # fix evalscope datasets failed

2. 量化 2.1 使用 llm-compressor GPTQ 量化

我們以 GPTQ w4a16g128 量化 Qwen/Qwen3-4B-Instruct-2507 模型為例，其他量化方法（AWQ等）請參考 llm-compressor 文檔。

# 生成校準數據集，使用中英文高質量 SFT 數據 python calib_data.py # 進行 GPTQ 量化 python qwen3_dense_instruct_w4a16.py # 逐層量化，大約需要 10 - 20 分鐘

校準數據集使用中英文混合的高質量對話 SFT 數據1024條。
從各種評測和經驗看，推薦使用 GPTQ w8a16/w4a16 量化，效果損失最小。
注意 MoE 模型量化時，需要額外忽略 Gate 層，避免量化誤差過大。
如果量化損失過大，可以控制忽略掉前 N 層。

2.2 GPTQ 量化前后效果分析

# 啟動bf16推理服務 vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507 --port 8080 # 評測 Math500（數學）、IFEval（指令遵循）、IQuiz（中文理解） evalscope eval \  --model Qwen3-4B-Instruct-2507 \  --api-url http://127.0.0.1:8080/v1 \  --api-key EMPTY \  --eval-type openai_api \  --datasets math_500 ifeval iquiz \  --eval-batch-size 100 +------------------------+-----------+--------------------------+----------+-------+---------+---------+ | Model                  | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   | +========================+===========+==========================+==========+=======+=========+=========+ | Qwen3-4B-Instruct-2507 | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8299 | default | | Qwen3-4B-Instruct-2507 | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8882 | default | | Qwen3-4B-Instruct-2507 | iquiz     | mean_acc                 | OVERALL  |   120 |  0.525  | -       | | Qwen3-4B-Instruct-2507 | math_500  | mean_acc                 | OVERALL  |   500 |  0.776  | -       | +------------------------+-----------+--------------------------+----------+-------+---------+---------+  # 啟動w4a16推理服務 vllm serve Qwen3-4B-Instruct-2507-W4A16-G128 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507-W4A16-G128 --port 8080 # 評測 evalscope eval \  --model Qwen3-4B-Instruct-2507-W4A16-G128 \  --api-url http://127.0.0.1:8080/v1 \  --api-key EMPTY \  --eval-type openai_api \  --datasets math_500 ifeval iquiz \  --eval-batch-size 100 +-----------------------------------+-----------+--------------------------+----------+-------+---------+---------+ | Model                             | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   | +===================================+===========+==========================+==========+=======+=========+=========+ | Qwen3-4B-Instruct-2507-W4A16-G128 | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8355 | default | | Qwen3-4B-Instruct-2507-W4A16-G128 | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8879 | default | | Qwen3-4B-Instruct-2507-W4A16-G128 | iquiz     | mean_acc                 | OVERALL  |   120 |  0.5333 | -       | | Qwen3-4B-Instruct-2507-W4A16-G128 | math_500  | mean_acc                 | OVERALL  |   500 |  0.782  | -       | +-----------------------------------+-----------+--------------------------+----------+-------+---------+---------+

發現：量化后指標反而全面高于未量化模型，這是因為我們的校準數據集為高質量 SFT 數據，屬于正常現象。

2.3 GPTQ 量化前后 vLLM 推理性能分析

vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507 --port 8080 evalscope perf \   --parallel 1 10 20 50 100 \   --number 10 30 50 100 200 \   --model Qwen3-4B-Instruct-2507 \   --url http://127.0.0.1:8080/v1/chat/completions \   --api openai \   --dataset random \   --max-tokens 1024 \   --min-tokens 1024 \   --prefix-length 0 \   --min-prompt-length 1024 \   --max-prompt-length 1024 \   --tokenizer-path Qwen3-4B-Instruct-2507 \   --extra-args '{"ignore_eos": true}' ┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓ ┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃ ┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃ ┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩ │    1 │ 0.09 │   11.530 │   11.588 │   88.81 │    0.050 │   0.065 │    0.011 │   0.011 │    100.0%│ │   10 │ 0.65 │   15.284 │   15.711 │  669.34 │    0.288 │   0.628 │    0.015 │   0.015 │    100.0%│ │   20 │ 0.93 │   18.492 │   20.202 │  954.49 │    0.467 │   1.304 │    0.018 │   0.019 │    100.0%│ │   50 │ 1.52 │   30.359 │   38.295 │ 1555.54 │    1.214 │   3.216 │    0.029 │   0.034 │    100.0%│ │  100 │ 1.54 │   54.048 │   75.195 │ 1579.02 │   13.821 │  39.359 │    0.039 │   0.066 │    100.0%│ └──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘ vllm serve Qwen3-4B-Instruct-2507-W4A16-G128 --max-model-len 8192 --served-model-name Qwen3-4B-Instruct-2507-W4A16-G128 --port 8080 evalscope perf \   --parallel 1 10 20 50 100 \   --number 10 30 50 100 200 \   --model Qwen3-4B-Instruct-2507-W4A16-G128 \   --url http://127.0.0.1:8080/v1/chat/completions \   --api openai \   --dataset random \   --max-tokens 1024 \   --min-tokens 1024 \   --prefix-length 0 \   --min-prompt-length 1024 \   --max-prompt-length 1024 \   --tokenizer-path Qwen3-4B-Instruct-2507-W4A16-G128 \   --extra-args '{"ignore_eos": true}' ┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓ ┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃ ┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃ ┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩ │    1 │ 0.16 │    6.150 │    9.323 │  166.50 │    0.059 │   0.068 │    0.006 │   0.009 │    100.0%│ │   10 │ 1.03 │    9.666 │   10.177 │ 1058.72 │    0.386 │   0.807 │    0.009 │   0.009 │    100.0%│ │   20 │ 1.29 │   13.762 │   15.793 │ 1316.59 │    0.528 │   1.476 │    0.013 │   0.014 │    100.0%│ │   50 │ 1.77 │   28.100 │   31.295 │ 1816.37 │    1.165 │   3.533 │    0.026 │   0.027 │    100.0%│ │  100 │ 1.76 │   50.314 │   83.056 │ 1805.55 │    7.330 │  28.528 │    0.042 │   0.074 │    100.0%│ └──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

發現：量化后，單用戶 OTPS 提升 100%，但是最大 OTPS 提升較少。

2.4 GGUF imatrix 量化

GGUF 各種量化方法參考：https://huggingface.co/docs/hub/en/gguf

我們使用 imatrix 4bit 量化（類似于 GPTQ的方法）IQ4_XS

git clone https://github.com/ggml-org/llama.cpp.git # INSTALL CUDA TOOLKIT: https://developer.nvidia.com/cuda-toolkit-archive # 安裝依賴庫 sudo apt-get install cmake curl libssl-dev libcurl4-openssl-dev # 配置cuda 的路徑，具體和你的CUDA版本有關 export PATH=/usr/local/cuda-12.6/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} # 編輯 llama.cpp GPU 版本 cmake -B build -DGGML_CUDA=ON cmake --build build --config Release -j16 # 把模型下載到本地 hf download "Qwen/Qwen3-4B-Instruct-2507" --local-dir "Qwen3-4B-Instruct-2507" # 轉換為 fp16 gguf 格式 python llama.cpp/convert_hf_to_gguf.py "Qwen3-4B-Instruct-2507" --outtype f16 --outfile Qwen3-4B-Instruct-2507-f16.gguf # 生成 imatrix.dat ./llama.cpp/build/bin/llama-imatrix -m Qwen3-4B-Instruct-2507-f16.gguf -f calibration.txt -ngl 99 --output-frequency 10 -o imatrix.dat --parse-special # 進行帶校準量化 ./llama.cpp/build/bin/llama-quantize --leave-output-tensor --imatrix imatrix.dat Qwen3-4B-Instruct-2507-f16.gguf Qwen3-4B-Instruct-2507-iq4_xs.gguf IQ4_XS # 無校準量化 ./llama.cpp/build/bin/llama-quantize --leave-output-tensor Qwen3-4B-Instruct-2507-f16.gguf Qwen3-4B-Instruct-2507-q4_k_m.gguf Q4_K_M

GGUF 量化效果評測

評測模型在 wiki.test 數據集上的 PPL（困惑度），越低越好。

# ppl ./llama.cpp/scripts/get-wikitext-2.sh ./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-f16.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99 PPL = 10.5498 +/- 0.08436 ./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-iq4_xs.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99 PPL = 10.7011 +/- 0.08542 ./llama.cpp/build/bin/llama-perplexity -m Qwen3-4B-Instruct-2507-q4_k_m.gguf -f wikitext-2-raw/wiki.test.raw -ngl 99 PPL = 10.7434 +/- 0.08562

可以看到 iq4_xs 不僅體積小，效果也較好

評測模型的真實推理效果。

# 見下文，vllm 并發性能要好于 llama.cpp vllm serve ./Qwen3-4B-Instruct-2507-iq4_xs.gguf --served-model-name Qwen3-4B-Instruct-2507-iq4_xs --max-model-len 8192 --port 8080 --tokenizer Qwen3-4B-Instruct-2507 evalscope eval \  --model Qwen3-4B-Instruct-2507-iq4_xs \  --api-url http://127.0.0.1:8080/v1 \  --api-key EMPTY \  --eval-type openai_api \  --datasets math_500 ifeval iquiz \  --eval-batch-size 100 +-------------------------------+-----------+--------------------------+----------+-------+---------+---------+ | Model                         | Dataset   | Metric                   | Subset   |   Num |   Score | Cat.0   | +===============================+===========+==========================+==========+=======+=========+=========+ | Qwen3-4B-Instruct-2507-iq4_xs | ifeval    | mean_prompt_level_strict | default  |   541 |  0.8262 | default | | Qwen3-4B-Instruct-2507-iq4_xs | ifeval    | mean_inst_level_strict   | default  |   541 |  0.8851 | default | | Qwen3-4B-Instruct-2507-iq4_xs | iquiz     | mean_acc                 | OVERALL  |   120 |  0.5    | -       | | Qwen3-4B-Instruct-2507-iq4_xs | math_500  | mean_acc                 | OVERALL  |   500 |  0.758  | -       | +-------------------------------+-----------+--------------------------+----------+-------+---------+---------+

發現：比 GPTQ 量化效果略弱，但整體削弱較小。

GGUF 量化性能評測

vllm + gguf iq4 推理。

vllm serve ./Qwen3-4B-Instruct-2507-iq4_xs.gguf --served-model-name Qwen3-4B-Instruct-2507-iq4_xs --max-model-len 8192 --port 8080 --tokenizer Qwen3-4B-Instruct-2507 evalscope perf \   --parallel 1 10 20 50 100 \   --number 10 30 50 100 200 \   --model Qwen3-4B-Instruct-2507-iq4_xs \   --url http://127.0.0.1:8080/v1/chat/completions \   --api openai \   --dataset random \   --max-tokens 1024 \   --min-tokens 1024 \   --prefix-length 0 \   --min-prompt-length 1024 \   --max-prompt-length 1024 \   --tokenizer-path Qwen3-4B-Instruct-2507/ \   --extra-args '{"ignore_eos": true}' ┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓ ┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃ ┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃ ┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩ │    1 │ 0.17 │    5.884 │    5.945 │  174.02 │    0.044 │   0.087 │    0.006 │   0.006 │    100.0%│ │   10 │ 0.40 │   24.839 │   25.406 │  412.00 │    0.449 │   1.034 │    0.024 │   0.024 │    100.0%│ │   20 │ 0.66 │   25.413 │   26.805 │  677.62 │    0.658 │   1.838 │    0.024 │   0.025 │    100.0%│ │   50 │ 1.17 │   42.447 │   46.481 │ 1201.77 │    1.444 │   4.483 │    0.040 │   0.041 │    100.0%│ │  100 │ 1.20 │   72.823 │  118.206 │ 1225.47 │    8.692 │  37.972 │    0.063 │   0.106 │    100.0%│ └──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

llama.cpp + gguf iq4 推理。

# set max input tokens = 4096, max output tokens = 4096 ./llama.cpp/build/bin/llama-server -m Qwen3-4B-Instruct-2507-iq4_xs.gguf -c 4096 -n 4096 -ngl 99 # test curl -X POST http://127.0.0.1:8080/v1/chat/completions \   -H "Content-Type: application/json" \   -d '{     "model": "Qwen3-4B-Instruct-2507-iq4_xs",     "messages": [       {"role": "user", "content": "你好"}     ], "stream": true   }' # 注意首次執行一會ctrl+c，進行warmup evalscope perf \   --parallel 1 10 20 50 100 \   --number 10 30 50 100 200 \   --model Qwen3-4B-Instruct-2507-iq4_xs \   --url http://127.0.0.1:8080/v1/chat/completions \   --api openai \   --dataset random \   --max-tokens 1024 \   --min-tokens 1024 \   --prefix-length 0 \   --min-prompt-length 1024 \   --max-prompt-length 1024 \   --tokenizer-path Qwen3-4B-Instruct-2507 \   --extra-args '{"ignore_eos": true}' ┏━━━━━━┳━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┓ ┃      ┃      ┃      Avg ┃      P99 ┃    Gen. ┃      Avg ┃     P99 ┃      Avg ┃     P99 ┃   Success┃ ┃Conc. ┃  RPS ┃  Lat.(s) ┃  Lat.(s) ┃  toks/s ┃  TTFT(s) ┃ TTFT(s) ┃  TPOT(s) ┃ TPOT(s) ┃      Rate┃ ┡━━━━━━╇━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━┩ │    1 │ 0.21 │    4.812 │    4.816 │  212.76 │    0.061 │   0.063 │    0.005 │   0.005 │    100.0%│ │   10 │ 0.20 │   41.531 │   48.982 │  209.89 │   36.711 │  44.152 │    0.005 │   0.005 │    100.0%│ │   20 │ 0.20 │   80.076 │   99.156 │  207.84 │   75.205 │  94.257 │    0.005 │   0.005 │    100.0%│ │   50 │ 0.20 │  189.758 │  251.990 │  204.79 │  184.814 │ 247.020 │    0.005 │   0.005 │    100.0%│ │  100 │ 0.20 │  378.942 │  504.018 │  204.04 │  373.980 │ 499.034 │    0.005 │   0.005 │    100.0%│ └──────┴──────┴──────────┴──────────┴─────────┴──────────┴─────────┴──────────┴─────────┴──────────┘

結論：看 OTPS 指標，llama.cpp 單用戶性能最好，但是大并發性能下，vllm+GPTQ > vllm+GGUF。

本文來源：https://ninehills.tech/articles/143.html
涉及代碼： https://github.com/ninehills/llm-speedup

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.