Wisdom Guide The first application of DeepSeek’s same RLVR to full-modal LLM, the kind that contains video!
Eyes closed and open, Ali Tongyi Labs Bo Leifeng team opened the volume again, oh is open source, R1 Omni is here.
Also in Hangzhou, what is this open source double feng ?
What have they done?
DeepSeek R1 has brought RLVR to the forefront of verifiable rewarded reinforcement learning. Previously, a team has applied RLVR to multimodal LLM of image and text, proving that it performs well on tasks such as geometric reasoning and visual counting.
However, combining RLVR with multimodal LLMs containing audio and dynamic visual content has not yet been explored.
The first time that Lifeng Bo’s team combined RLVR with a full-modal LLM, it focused on the task of emotion recognition, where both visual and audio modalities provide a key role.
The team’s experiments revealed significant model improvements in three key areas
The introduction of RLVR not only improves the overall performance of the model on intra-distributed data, but also shows stronger robustness on extra-distributed datasets.
More importantly, the improved inference capability makes it possible to clearly analyze the role played by different modalities in the emotion recognition process.
R1 Omni also attracted a lot of attention on X Very interesting paper, and I can immediately foresee its potential for sentiment tendency analysis in marketing and advertising.
There were also users who said that interpretable multimodal learning is where the next generation of AI is headed.
Let’s take a look at R1 Omni specifically.
What does R1 Omni look like?
In terms of research methodology, the paper begins with an introduction to DeepSeek’s fellow RLVRs and GRPOs.
RLVR is a new training paradigm that centers on the idea of directly evaluating outputs using a validation function, eliminating the need to rely on a separate reward model trained on human preferences, as in traditional human feedback reinforcement learning RLHF.
Given an input problem q, the policy model generates a response o, which is then evaluated using a verifiable reward function R q o with the optimization objective of maximizing the verified reward minus a KL scatter-based regularization term.
RLVR simplifies the reward mechanism while ensuring its consistency with the task’s intrinsic correctness criteria.
GRPO is a novel reinforcement learning method that differs from traditional methods such as PPO, which relies on a critic model to evaluate the performance of candidate strategies, while GRPO simplifies the training process by directly comparing the generated response sets, avoiding the use of an additional critic model.
Using a normalized scoring mechanism, GRPO encourages the model to prioritize responses with higher reward values within a group, enhancing the model’s ability to effectively differentiate between high-quality and low-quality outputs.
Following the approach presented in DeepSeek R1, the team combined GRPO with RLVR.
For the R1 Omni model construction, the team used a cold-start strategy inspired by the DeepSeek R1 training approach.
HumanOmni 0 5B an open-source omnimodal model designed for human scene understanding was fine-tuned on a combined dataset containing 232 samples from the interpretable multimodal visual and audio emotion inference dataset EMER and 348 samples from the manually labeled HumanOmni dataset to give the model initial inference capabilities on how visual and audio cues contribute to emotion recognition. how visual and audio cues contribute to emotion recognition.
Afterwards, the model is optimized by RLVR training with a reward function consisting of an accuracy reward to assess how well the predicted emotion matches the true emotion and a format reward to ensure that the model output conforms to the specified HTML tag format.
The model output is expected to contain two parts an inference process, encapsulated within the think think tag, explaining how the model integrates visual and audio cues to arrive at a prediction a final sentiment label, encapsulated within the answer answer tag, representing the predicted sentiment.
Inference Understanding Generalization Three enhancements
In the experimental evaluation, the researchers compared the R1 Omni to three baseline models Original HumanOmni 0 5B Supervised fine-tuned model EMER SFT on the EMER dataset Supervised fine-tuned model MAFW DFEW SFT based on HumanOmni 0 5B directly on the MAFW and DFEW training sets.
Evaluation metrics include unweighted average recall UAR and weighted average recall WAR , which measure the model’s ability to accurately categorize sentiments across different sentiment classes.
Importantly, all evaluations are performed under the Open Vocabulary Emotion Test OV emotion protocol. In this setting, the models do not provide predefined emotion categories, but generate emotion labels directly from the input data, which increases the challenge and practical application value of the evaluation.
The experimental results show that R1 Omni outperforms the three comparison models in three key aspects Enhanced reasoning ability Improved comprehension ability Better generalization ability.
The researchers showed a series of visual examples comparing the output of R1 Omni with that of the other three models, with R1 Omni providing a more coherent, accurate and interpretable reasoning process.
In contrast, the original HumanOmni 0 5B and MAFW DFEW SFT models showed limited reasoning power, while the EMER SFT had some reasoning power but the reasoning process was less coherent and prone to hallucinations.
On both the MAFW and DFEW datasets, the R1 Omni outperformed the other models on both UAR and WAR metrics.
For example, on the DFEW dataset, R1 Omni achieves a UAR of 65 83 and a WAR of 56 27, which is significantly better than the MAFW DFEW SFT’s 60 23 UAR and 44 39 WAR.
To evaluate the generalization ability of the model, the researchers conducted experiments on the RAVDESS dataset, which serves as an out-of-distribution OOD test set.
Unlike the MAFW and DFEW datasets, which consist primarily of movie clips, the RAVDESS dataset features professional actors delivering lexically-matched statements in a neutral North American accent, a significant difference in data distribution that makes RAVDESS an ideal benchmark for assessing the model’s ability to generalize to unseen scenes.
R1 Omni showed significant improvement over the MAFW DFEW SFT model on the RAVDESS dataset, achieving a UAR of 43 00 and a WAR of 44 69.
The base model HumanOmni 0 5B cold-start model EMER SFT, as well as the MAFW DFEW SFT and the final model R1 Omni are now fully open source.
The above content is mostly information information combing and integration, if any infringement, please contact to delete, thank you!
Leave a Reply