๐Ÿ“ŠLeaderboard GLAM extractionPOC ยท silver Task: index cards + forms ยท score = typed-KIE F1 ร—100
Parameters Size
#ModelSizeScore
๐Ÿฅ‡ datalab-to/lift 9B 88.5
๐Ÿฅˆ numind/NuExtract3 4B 84.5
๐Ÿฅ‰ MiniMaxAI/MiniMax-M3 MoE 82.4
4 meta-llama/Llama-4-Scout-17B-16E 109B 81.9
5 Qwen/Qwen2.5-VL-72B 72B 71.3
6 Qwen/Qwen3-VL-235B-A22B 235B 65.0
7 Qwen/Qwen3-VL-8B 8B 64.7
8 google/gemma-3-4b-it 4B 45.8

There is no best model. The top two (โ— green = purpose-built schema-native extractors, lift 9B and NuExtract 4B) beat much larger general VLMs on these cards. Size decouples from quality โ€” an 8B matches a 235B. And every model here is open-weight and self-hostable, so an institution can run extraction on its own hardware and sensitive or licensed collections never leave the building. Use the size filter to see.

Illustrative POC โ€” 7 typed items, silver labels (NuExtract-3 + one reviewer agent; not human-verified, so NuExtract is advantaged). Models run via schema-native Spaces, the HF Inference Providers router, and HF jobs-serving. Not a ranking โ€” a demonstration of the shape. github.com/davanstrien/glam-extract-bench