4月7日,阿里通義實驗室智能計算團隊宣布推出新算法FIPO(Future-KL Influenced Policy Optimization),引入Future-KL機制,獎勵關鍵Token,解決純強化學習(Pure RL)訓練中“推理長度停滯”難題。據該團隊介紹,在32B規模的純RL設定下,率先實現對o1-mini與同規模DeepSeek-Zero-MATH的性能反超。
特別聲明:以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布,本平臺僅提供信息存儲服務。
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.