Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders

Introduction

Extracting structured data from business documents—such as purchase orders, invoices, or delivery receipts—is a common yet challenging task in B2B workflows. Traditional rules-based systems have long been the default choice, but the rise of large language models (LLMs) offers a new, more flexible alternative. This article presents a practical comparison between a rule-based PDF extractor built with pytesseract and an LLM-based solution powered by Ollama and LLaMA 3. Both were applied to the same realistic B2B order scenario to evaluate their strengths and weaknesses.

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders — Source: towardsdatascience.com

The B2B Order Scenario

The test dataset consisted of scanned PDF purchase orders containing fields such as order number, vendor name, line items (quantities, part numbers, descriptions), pricing, and totals. These documents varied slightly in layout and had occasional handwriting marks, simulating real-world inconsistency. The goal was to extract all relevant fields accurately and fast—without manual intervention.

Rule-Based Extraction with Pytesseract

Implementation

For the rule-based approach, I used pytesseract, a Python wrapper for Google's Tesseract OCR engine. The workflow was:

Preprocess the PDF pages (convert to grayscale, apply thresholding, and deskew).
Run OCR to extract raw text and bounding boxes.
Apply handcrafted regular expressions and layout heuristics to locate and parse fields (e.g., "Order Number:" followed by alphanumeric characters).

Strengths

Deterministic output: Once rules were finely tuned, extraction was predictable and repeatable.
Low resource usage: Execution was fast and could run on a basic CPU.
Transparency: Every decision could be traced to a specific rule.

Weaknesses

Fragility: Small layout changes (different font, margin shift, or handwritten corrections) broke many rules.
Maintenance overhead: Each new document type required custom rules and extensive testing.
Limited semantic understanding: The system could not interpret ambiguous or missing fields.

LLM-Based Extraction with Ollama and LLaMA 3

Implementation

For the LLM approach, I used Ollama to serve the locally hosted LLaMA 3 model (8B parameters). The pipeline was:

Convert PDF pages to images (as before).
Send the image directly to the LLM along with a structured prompt specifying which fields to extract (e.g., "Extract order number, vendor, line items, and total from this purchase order.").
The model returned a JSON object containing the extracted data.

Strengths

Adaptability: The LLM handled layout variations without explicit rules, even with handwriting or minor occlusions.
Zero-shot capability: No training or rule-tuning needed for new document formats.
Contextual understanding: It could infer missing information (e.g., totalling line items if not explicitly summed).

Weaknesses

Higher latency: Inference took 5–15 seconds per page on a consumer GPU (NVIDIA RTX 3060).
Resource demands: Required a GPU with at least 8GB VRAM, making it less accessible for low-budget setups.
Hallucinations: Occasionally the model invented plausible-looking but incorrect values (e.g., wrong vendor name).

Head-to-Head Comparison

We evaluated both systems on 50 documents drawn from the same B2B order scenario. Key metrics were:

Metric	Rule-Based (pytesseract)	LLM (Ollama + LLaMA 3)
Accuracy (field-level F1)	0.85	0.93
Average processing time per page	0.4 seconds	9.2 seconds
Set-up effort	3 days of rule tweaking	30 minutes of prompt engineering
Robustness to layout change	Low (broke on 20% of docs)	High (handled all variations)

When to use each approach

Rule-based is ideal for high-volume, stable document formats where speed and low cost are critical and layout is predictable.
LLM-based shines in heterogeneous environments, fast prototyping, or when documents contain unstructured or semi-structured data.

Conclusion

Building the same B2B document extractor twice revealed clear trade-offs. The rule-based system with pytesseract offered speed and determinism but required constant maintenance. The LLM approach with Ollama and LLaMA 3 provided superior flexibility and accuracy at the cost of latency and hardware requirements. For many real-world B2B scenarios, a hybrid solution may be best: use rules for simple, well-known fields and an LLM as a fallback or for complex extraction tasks.

This article is based on practical experiments and was first published on Towards Data Science.

Tags:

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders

Introduction

The B2B Order Scenario

Rule-Based Extraction with Pytesseract

Implementation

Strengths

Weaknesses

LLM-Based Extraction with Ollama and LLaMA 3

Implementation

Strengths

Weaknesses

Head-to-Head Comparison

When to use each approach

Conclusion

Related Articles

Recommended

Discover More