Update README.md
Browse files
README.md
CHANGED
|
@@ -6,33 +6,264 @@
|
|
| 6 |
|
| 7 |
<p align="center">
|
| 8 |
💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> |
|
| 9 |
-
📊 <a
|
| 10 |
📄 <a>Paper (coming soon...)</a> |
|
| 11 |
-
🚀 <a href="https://huggingface.co/spaces/infly/Infinity-
|
| 12 |
</p>
|
| 13 |
|
| 14 |
## News
|
| 15 |
|
| 16 |
-
- [2026-
|
| 17 |
-
- [2026-04-11] We released Infinity-Parser2-Pro, our flagship document parsing model — now available as a preview. Stay tuned: the official release, the lightweight Infinity-Parser2-Flash, and our multimodal parsing dataset Infinity-Doc2-10M are coming soon.
|
| 18 |
|
| 19 |
## Introduction
|
| 20 |
|
| 21 |
-
We are excited to release Infinity-Parser2
|
| 22 |
|
| 23 |
### Key Features
|
| 24 |
|
| 25 |
-
- **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By
|
| 26 |
-
- **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including
|
| 27 |
-
- **Breakthrough Parsing Performance**:
|
| 28 |
-
- **Inference Acceleration**:
|
| 29 |
|
| 30 |
## Performance
|
| 31 |
|
| 32 |
<p align="left">
|
| 33 |
-
<img src="assets/
|
| 34 |
<p>
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
## Quick Start
|
| 37 |
|
| 38 |
### 1. Minimal "Hello World" (Native Transformers)
|
|
@@ -296,10 +527,14 @@ vllm serve infly/Infinity-Parser2-Pro \
|
|
| 296 |
|
| 297 |
For more details, please refer to the [official guide](https://github.com/infly-ai/INF-MLLM/blob/main/Infinity-Parser2).
|
| 298 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
## Acknowledgments
|
| 300 |
|
| 301 |
We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
|
| 302 |
|
| 303 |
-
#
|
| 304 |
|
| 305 |
This model is licensed under apache-2.0.
|
|
|
|
| 6 |
|
| 7 |
<p align="center">
|
| 8 |
💻 <a href="https://github.com/infly-ai/INF-MLLM">Github</a> |
|
| 9 |
+
📊 <a href="https://huggingface.co/datasets/infly/Infinity-Doc2-5M">Dataset</a> |
|
| 10 |
📄 <a>Paper (coming soon...)</a> |
|
| 11 |
+
🚀 <a href="https://huggingface.co/spaces/infly/Infinity-Parser2-Demo">Demo</a>
|
| 12 |
</p>
|
| 13 |
|
| 14 |
## News
|
| 15 |
|
| 16 |
+
- [2026-05-09] Released flagship document parsing models: [Infinity-Parser2-Pro](https://huggingface.co/infly/Infinity-Parser2-Pro), [Infinity-Parser2-Flash](https://huggingface.co/infly/Infinity-Parser2-Flash), and the dataset [Infinity-Doc2-5M](https://huggingface.co/datasets/infly/Infinity-Doc2-5M). [Infinity-Parser2](Infinity-Parser2) achieves SOTA results on olmOCR-bench and ParseBench.
|
|
|
|
| 17 |
|
| 18 |
## Introduction
|
| 19 |
|
| 20 |
+
We are excited to release Infinity-Parser2, our latest flagship document understanding model. We offer two distinct variants to address diverse deployment constraints: Infinity-Parser2-Pro, optimized for maximum accuracy in precision-critical tasks, achieves state-of-the-art results on olmOCR-Bench (87.6%) and ParseBench (74.3%), surpassing frontier models including DeepSeek-OCR-2, PaddleOCR-VL-1.5, and MinerU-2.5. Infinity-Parser2-Flash, engineered for low-latency inference, delivers a 3.68x speedup over our previous Infinity-Parser-7B model. With significant upgrades to both our data engine and multi-task reinforcement learning approach, the model consolidates robust multi-modal parsing capabilities into a unified architecture, unlocking brand-new zero-shot capabilities across a wide range of real-world business scenarios.
|
| 21 |
|
| 22 |
### Key Features
|
| 23 |
|
| 24 |
+
- **Upgraded Data Engine**: We have comprehensively enhanced our synthetic data engine to support both fixed-layout and flexible-layout document formats. By curating nearly 5 million diverse document parsing samples across a wide range of layouts, combined with a dynamic adaptive sampling strategy, we ensure highly balanced and robust multi-task learning across various document types.
|
| 25 |
+
- **Multi-Task Reinforcement Learning**: We designed a novel verifiable reward system to support Joint Reinforcement Learning (RL), enabling seamless and simultaneous co-optimization of multiple complex tasks, including document parsing, element parsing, chart parsing, chemical formula parsing, document vqa, and general multimodal understanding.
|
| 26 |
+
- **Breakthrough Parsing Performance**: Infinity-Parser2-Pro substantially outperforms our previous 7B model, achieving 87.6% on olmOCR-Bench and 74.3% on ParseBench, surpassing frontier models such as DeepSeek-OCR-2, PaddleOCR-VL, and MinerU-2.5.
|
| 27 |
+
- **Inference Acceleration**: Infinity-Parser2-Flash delivers significantly higher efficiency than Infinity-Parser-7B, with inference throughput increased by 3.68x (from 441 to 1,624 tokens/sec), reducing both deployment latency and costs.
|
| 28 |
|
| 29 |
## Performance
|
| 30 |
|
| 31 |
<p align="left">
|
| 32 |
+
<img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/olmocr_bench_perf.png" width="1200"/>
|
| 33 |
<p>
|
| 34 |
|
| 35 |
+
<p align="left">
|
| 36 |
+
<img src="https://raw.githubusercontent.com/infly-ai/INF-MLLM/main/Infinity-Parser2/assets/parsebench_perf.png" width="1200"/>
|
| 37 |
+
<p>
|
| 38 |
+
|
| 39 |
+
<table align="center">
|
| 40 |
+
<thead>
|
| 41 |
+
<tr>
|
| 42 |
+
<th>Task</th>
|
| 43 |
+
<th>Infinity-Parser2-Pro</th>
|
| 44 |
+
<th>Infinity-Parser2-Flash</th>
|
| 45 |
+
<th>PaddleOCR-VL-1.5</th>
|
| 46 |
+
<th>DeepSeek-OCR-2</th>
|
| 47 |
+
<th>MinerU-2.5</th>
|
| 48 |
+
<th>Gemini-3-Pro</th>
|
| 49 |
+
</tr>
|
| 50 |
+
</thead>
|
| 51 |
+
<tbody>
|
| 52 |
+
<tr>
|
| 53 |
+
<td colspan=7><b>Document Parsing</b></td>
|
| 54 |
+
</tr>
|
| 55 |
+
<tr>
|
| 56 |
+
<td>olmOCR-bench</td>
|
| 57 |
+
<td><b>87.6</b></td>
|
| 58 |
+
<td>86.0</td>
|
| 59 |
+
<td>80.0†</td>
|
| 60 |
+
<td>76.3</td>
|
| 61 |
+
<td>75.2</td>
|
| 62 |
+
<td>-</td>
|
| 63 |
+
</tr>
|
| 64 |
+
<tr>
|
| 65 |
+
<td>ParseBench</td>
|
| 66 |
+
<td><b>74.3</b></td>
|
| 67 |
+
<td>72.2</td>
|
| 68 |
+
<td>40.9†</td>
|
| 69 |
+
<td>41.2</td>
|
| 70 |
+
<td>45.9</td>
|
| 71 |
+
<td>69.1‡</td>
|
| 72 |
+
</tr>
|
| 73 |
+
<tr>
|
| 74 |
+
<td>OmniDocBench-v1.6</td>
|
| 75 |
+
<td>93.95</td>
|
| 76 |
+
<td>91.98</td>
|
| 77 |
+
<td><b>94.87</b></td>
|
| 78 |
+
<td>90.17</td>
|
| 79 |
+
<td>92.98</td>
|
| 80 |
+
<td>92.85</td>
|
| 81 |
+
</tr>
|
| 82 |
+
<tr>
|
| 83 |
+
<td colspan=7>Layout Analysis (mIoU)</td>
|
| 84 |
+
</tr>
|
| 85 |
+
<tr>
|
| 86 |
+
<td>DocLayNet</td>
|
| 87 |
+
<td>64.93*</td>
|
| 88 |
+
<td>64.97*</td>
|
| 89 |
+
<td><b>71.05*</b></td>
|
| 90 |
+
<td>45.62*</td>
|
| 91 |
+
<td>67.74*</td>
|
| 92 |
+
<td>-</td>
|
| 93 |
+
</tr>
|
| 94 |
+
<tr>
|
| 95 |
+
<td>D4LA</td>
|
| 96 |
+
<td><b>52.41*</b></td>
|
| 97 |
+
<td>46.05*</td>
|
| 98 |
+
<td>50.21*</td>
|
| 99 |
+
<td>33.03*</td>
|
| 100 |
+
<td>51.62*</td>
|
| 101 |
+
<td>-</td>
|
| 102 |
+
</tr>
|
| 103 |
+
<tr>
|
| 104 |
+
<td>OmniDocBench-v1.5-Layout</td>
|
| 105 |
+
<td>74.56*</td>
|
| 106 |
+
<td>73.07*</td>
|
| 107 |
+
<td>74.80*</td>
|
| 108 |
+
<td>55.28*</td>
|
| 109 |
+
<td><b>76.28*</b></td>
|
| 110 |
+
<td>-</td>
|
| 111 |
+
</tr>
|
| 112 |
+
<tr>
|
| 113 |
+
<td colspan=7>Element Parsing</td>
|
| 114 |
+
</tr>
|
| 115 |
+
<tr>
|
| 116 |
+
<td>OmniDocBench-v1.5-TextBlock</td>
|
| 117 |
+
<td>93.66</td>
|
| 118 |
+
<td>93.53</td>
|
| 119 |
+
<td><b>94.97*</b></td>
|
| 120 |
+
<td>84.13*</td>
|
| 121 |
+
<td>86.00</td>
|
| 122 |
+
<td>-</td>
|
| 123 |
+
</tr>
|
| 124 |
+
<tr>
|
| 125 |
+
<td>PubTabNet (val)</td>
|
| 126 |
+
<td><b>94.76</b></td>
|
| 127 |
+
<td>92.41</td>
|
| 128 |
+
<td>84.60</td>
|
| 129 |
+
<td>89.53*</td>
|
| 130 |
+
<td>89.07</td>
|
| 131 |
+
<td>91.40</td>
|
| 132 |
+
</tr>
|
| 133 |
+
<tr>
|
| 134 |
+
<td>UniMERNet</td>
|
| 135 |
+
<td><b>97.7</b></td>
|
| 136 |
+
<td>96.5</td>
|
| 137 |
+
<td>95.8*</td>
|
| 138 |
+
<td>79.8*</td>
|
| 139 |
+
<td>96.5</td>
|
| 140 |
+
<td>96.4</td>
|
| 141 |
+
</tr>
|
| 142 |
+
<tr>
|
| 143 |
+
<td colspan=7>Chart Parsing</td>
|
| 144 |
+
</tr>
|
| 145 |
+
<tr>
|
| 146 |
+
<td>Chart2Table</td>
|
| 147 |
+
<td>80.45</td>
|
| 148 |
+
<td>80.49</td>
|
| 149 |
+
<td><b>86.2*</b></td>
|
| 150 |
+
<td>-</td>
|
| 151 |
+
<td>-</td>
|
| 152 |
+
<td>-</td>
|
| 153 |
+
</tr>
|
| 154 |
+
<tr>
|
| 155 |
+
<td>Chart2Json</td>
|
| 156 |
+
<td><b>73.69</b></td>
|
| 157 |
+
<td>67.66</td>
|
| 158 |
+
<td>-</td>
|
| 159 |
+
<td>-</td>
|
| 160 |
+
<td>-</td>
|
| 161 |
+
<td>-</td>
|
| 162 |
+
</tr>
|
| 163 |
+
<tr>
|
| 164 |
+
<td colspan=7>Chemical Formula Parsing</td>
|
| 165 |
+
</tr>
|
| 166 |
+
<tr>
|
| 167 |
+
<td>CoSyn_Chemical</td>
|
| 168 |
+
<td><b>71.48</b></td>
|
| 169 |
+
<td>62.08</td>
|
| 170 |
+
<td>-</td>
|
| 171 |
+
<td>52.16*</td>
|
| 172 |
+
<td>-</td>
|
| 173 |
+
<td>-</td>
|
| 174 |
+
</tr>
|
| 175 |
+
<tr>
|
| 176 |
+
<td colspan=7>Document VQA</td>
|
| 177 |
+
</tr>
|
| 178 |
+
<tr>
|
| 179 |
+
<td>DocVQA (val)</td>
|
| 180 |
+
<td><b>96.43</b></td>
|
| 181 |
+
<td>93.16</td>
|
| 182 |
+
<td>-</td>
|
| 183 |
+
<td>43.42*</td>
|
| 184 |
+
<td>-</td>
|
| 185 |
+
<td>93.68*</td>
|
| 186 |
+
</tr>
|
| 187 |
+
<tr>
|
| 188 |
+
<td>InfoVQA (val)</td>
|
| 189 |
+
<td><b>86.26</b></td>
|
| 190 |
+
<td>75.94</td>
|
| 191 |
+
<td>-</td>
|
| 192 |
+
<td>22.07*</td>
|
| 193 |
+
<td>-</td>
|
| 194 |
+
<td>85.24*</td>
|
| 195 |
+
</tr>
|
| 196 |
+
<tr>
|
| 197 |
+
<td colspan=7>General Multimodal Understanding</td>
|
| 198 |
+
</tr>
|
| 199 |
+
<tr>
|
| 200 |
+
<td>AI2D</td>
|
| 201 |
+
<td>88.89</td>
|
| 202 |
+
<td>79.53</td>
|
| 203 |
+
<td>-</td>
|
| 204 |
+
<td>37.66*</td>
|
| 205 |
+
<td>-</td>
|
| 206 |
+
<td><b>91.87*</b></td>
|
| 207 |
+
</tr>
|
| 208 |
+
<tr>
|
| 209 |
+
<td>MathVista (testmini)</td>
|
| 210 |
+
<td>71.4</td>
|
| 211 |
+
<td>59.5</td>
|
| 212 |
+
<td>-</td>
|
| 213 |
+
<td>-</td>
|
| 214 |
+
<td>-</td>
|
| 215 |
+
<td><b>81.8*</b></td>
|
| 216 |
+
</tr>
|
| 217 |
+
<tr>
|
| 218 |
+
<td>MMBench-EN (dev)</td>
|
| 219 |
+
<td>87.54</td>
|
| 220 |
+
<td>77.92</td>
|
| 221 |
+
<td>-</td>
|
| 222 |
+
<td>-</td>
|
| 223 |
+
<td>-</td>
|
| 224 |
+
<td><b>90.29*</b></td>
|
| 225 |
+
</tr>
|
| 226 |
+
<tr>
|
| 227 |
+
<td>MMBench-CN (dev)</td>
|
| 228 |
+
<td>86.43</td>
|
| 229 |
+
<td>75.77</td>
|
| 230 |
+
<td>-</td>
|
| 231 |
+
<td>-</td>
|
| 232 |
+
<td>-</td>
|
| 233 |
+
<td><b>90.98*</b></td>
|
| 234 |
+
</tr>
|
| 235 |
+
<tr>
|
| 236 |
+
<td>MMMU (val)</td>
|
| 237 |
+
<td><b>61.89</b></td>
|
| 238 |
+
<td>45.89</td>
|
| 239 |
+
<td>-</td>
|
| 240 |
+
<td>-</td>
|
| 241 |
+
<td>-</td>
|
| 242 |
+
<td>56.00*</td>
|
| 243 |
+
</tr>
|
| 244 |
+
<tr>
|
| 245 |
+
<td>MMStar</td>
|
| 246 |
+
<td>69.66</td>
|
| 247 |
+
<td>57.13</td>
|
| 248 |
+
<td>-</td>
|
| 249 |
+
<td>-</td>
|
| 250 |
+
<td>-</td>
|
| 251 |
+
<td><b>83.78*</b></td>
|
| 252 |
+
</tr>
|
| 253 |
+
<tr>
|
| 254 |
+
<td>OCRBench</td>
|
| 255 |
+
<td>86.20</td>
|
| 256 |
+
<td>81.60</td>
|
| 257 |
+
<td>-</td>
|
| 258 |
+
<td>47.20*</td>
|
| 259 |
+
<td>-</td>
|
| 260 |
+
<td><b>89.30*</b></td>
|
| 261 |
+
</tr>
|
| 262 |
+
</tbody>
|
| 263 |
+
</table>
|
| 264 |
+
|
| 265 |
+
Note: '*' denotes results evaluated using our internal evaluation tools. Results marked with '†' are from PaddleOCR-VL. '‡' denotes results from the Gemini-3.1-Pro.
|
| 266 |
+
|
| 267 |
## Quick Start
|
| 268 |
|
| 269 |
### 1. Minimal "Hello World" (Native Transformers)
|
|
|
|
| 527 |
|
| 528 |
For more details, please refer to the [official guide](https://github.com/infly-ai/INF-MLLM/blob/main/Infinity-Parser2).
|
| 529 |
|
| 530 |
+
## Limitations
|
| 531 |
+
|
| 532 |
+
Infinity-Parser2 has several known limitations to consider. It primarily supports English and Chinese documents, and performance degrades when processing multilingual content. Accuracy may also be reduced when parsing charts with complex layouts, as well as documents containing multi-oriented elements such as table rotated at varying angles. Additionally, the model does not capture fine-grained text formatting (e.g., bold, italic, strikethrough) and exhibits suboptimal multimodal instruction-following capability, meaning it may not always reliably follow complex multi-step visual instructions.
|
| 533 |
+
|
| 534 |
## Acknowledgments
|
| 535 |
|
| 536 |
We would like to thank [Qwen3.5](https://github.com/QwenLM/Qwen3.5), [ms-swift](https://github.com/modelscope/ms-swift), [VeRL](https://github.com/verl-project/verl), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [olmocr](https://huggingface.co/datasets/allenai/olmOCR-bench), [PaddleOCR-VL](https://github.com/PaddlePaddle/PaddleOCR), [MinerU](https://github.com/opendatalab/MinerU), [dots.ocr](https://github.com/rednote-hilab/dots.ocr), [Chandra-OCR-2](https://github.com/datalab-to/chandra) for providing dataset, code and models.
|
| 537 |
|
| 538 |
+
# License
|
| 539 |
|
| 540 |
This model is licensed under apache-2.0.
|