Models & Researchlora finetuninginformation extractionqwenllama 3

LoRA Fine-Tunes 270M-8B Models for Merchant Extraction

|June 9, 2026|By LDS Team

5.9

Relevance Score

LoRA Fine-Tunes 270M-8B Models for Merchant Extraction

The arXiv paper 2606.08051 (submitted June 6, 2026) evaluates LoRA fine-tuning across 24 model variants for merchant information extraction from noisy bank transaction strings. The paper reports their production LoRA-fine-tuned LLaMA 3.1-8B achieves 96.95% F1, and finds a LoRA rank-8 reproduction of that fine-tune yields 96.75% F1, a 0.20-point gap to the rank-32 baseline, according to the paper. The authors compare Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya 3.35B, and LLaMA 3.1-8B, and report that Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, while Qwen 3.5 0.8B achieves 94.75% F1. The paper also reports deployments of 14 sub-8B fine-tuned models as Databricks Model Serving endpoints, observing an average F1 change of 0.8 points in production, with Aya 3.35B showing a 3-5 point decline under serving conditions.

What happened

The arXiv paper 2606.08051, submitted June 6, 2026, reports a deployment-focused study of LoRA fine-tuning for merchant information extraction from noisy bank transaction strings. Per the paper, the authors evaluated 24 model variants across four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B. The paper reports their production LoRA-fine-tuned LLaMA 3.1-8B achieves 96.95% F1 on the task.

Technical details

Per the paper, the study measures accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. The paper finds that reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, 0.20 points below the rank-32 baseline. The authors report Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline, and that Qwen 3.5 0.8B achieves 94.75% F1, matching models 2.5-4x larger in accuracy while offering a favorable latency-accuracy trade-off. The paper also reports that chain-of-thought fine-tuning improves F1 by 0.3-1.8 points across most models, but that Qwen 3.5 4B performs best with direct JSON-only prompting. The authors further state that Qwen 3.5 Think and Nothink training templates produced nearly identical results (F1 differences < 0.004) for this structured extraction task.

Deployment results

According to the paper, the authors deployed all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observed that benchmark performance transfers reliably to production, with an average F1 change of 0.8 points. The paper notes Aya 3.35B as an exception, exhibiting a 3-5 point decline under serving conditions.

Editorial analysis

For practitioners, the paper documents a concrete, end-to-end evaluation that connects fine-tune choices to latency and serving behavior. Teams comparing model size versus cost can use the reported F1 trade-offs and the production transfer numbers as empirical reference points rather than relying solely on offline benchmarks. The consistent performance of Qwen 3.5 0.8B and 4B variants in this task illustrates a broader industry pattern where smaller, well-tuned models can approach the accuracy of much larger models on narrow extraction tasks while reducing inference cost.

What to watch

Observers should track:

•replication of these results on other transaction datasets and languages
•whether JSON-only prompting remains robust across noise levels and input truncation
•hardware-specific serving regressions similar to the Aya 3.35B decline reported by the paper. Future evaluations that report latency-per-F1 and end-to-end cost-per-query will make these trade-offs easier to operationalize

Key Points

1LoRA rank-8 reproductions can reach within 0.2 F1 points of rank-32 baselines, lowering fine-tune costs for deployment.
2Small Qwen 3.5 models (0.8B) can match larger models' accuracy for structured extraction, improving latency-cost trade-offs.
3Benchmark-to-production transfer was tight (average F1 change 0.8 points), but architecture-specific serving drops occurred for Aya 3.35B.

Scoring Rationale

A single arXiv preprint benchmarking LoRA fine-tuning across 24 sub-8B model variants for merchant-name extraction, with concrete production-serving observations (Databricks endpoints, rank-8 vs rank-32 parity). Practically useful to applied NLP/MLOps teams but a narrow vertical study, keeping it upper-solid.

Sources

Primary source and supporting public references used for this report.

1 source

Primary sourcearxiv.org[2606.08051] How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems