| # | Model | Size | Score |
|---|---|---|---|
| ๐ฅ | datalab-to/lift | 9B | 88.5 |
| ๐ฅ | numind/NuExtract3 | 4B | 84.5 |
| ๐ฅ | MiniMaxAI/MiniMax-M3 | MoE | 82.4 |
| 4 | meta-llama/Llama-4-Scout-17B-16E | 109B | 81.9 |
| 5 | Qwen/Qwen2.5-VL-72B | 72B | 71.3 |
| 6 | Qwen/Qwen3-VL-235B-A22B | 235B | 65.0 |
| 7 | Qwen/Qwen3-VL-8B | 8B | 64.7 |
| 8 | google/gemma-3-4b-it | 4B | 45.8 |
There is no best model. The top two (โ green = purpose-built schema-native extractors, lift 9B and NuExtract 4B) beat much larger general VLMs on these cards. Size decouples from quality โ an 8B matches a 235B. And every model here is open-weight and self-hostable, so an institution can run extraction on its own hardware and sensitive or licensed collections never leave the building. Use the size filter to see.
Illustrative POC โ 7 typed items, silver labels (NuExtract-3 + one reviewer agent; not human-verified, so NuExtract is advantaged). Models run via schema-native Spaces, the HF Inference Providers router, and HF jobs-serving. Not a ranking โ a demonstration of the shape. github.com/davanstrien/glam-extract-bench