Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders

By

Introduction

Extracting structured data from business documents—such as purchase orders, invoices, or delivery receipts—is a common yet challenging task in B2B workflows. Traditional rules-based systems have long been the default choice, but the rise of large language models (LLMs) offers a new, more flexible alternative. This article presents a practical comparison between a rule-based PDF extractor built with pytesseract and an LLM-based solution powered by Ollama and LLaMA 3. Both were applied to the same realistic B2B order scenario to evaluate their strengths and weaknesses.

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com

The B2B Order Scenario

The test dataset consisted of scanned PDF purchase orders containing fields such as order number, vendor name, line items (quantities, part numbers, descriptions), pricing, and totals. These documents varied slightly in layout and had occasional handwriting marks, simulating real-world inconsistency. The goal was to extract all relevant fields accurately and fast—without manual intervention.

Rule-Based Extraction with Pytesseract

Implementation

For the rule-based approach, I used pytesseract, a Python wrapper for Google's Tesseract OCR engine. The workflow was:

  1. Preprocess the PDF pages (convert to grayscale, apply thresholding, and deskew).
  2. Run OCR to extract raw text and bounding boxes.
  3. Apply handcrafted regular expressions and layout heuristics to locate and parse fields (e.g., "Order Number:" followed by alphanumeric characters).

Strengths

Weaknesses

LLM-Based Extraction with Ollama and LLaMA 3

Implementation

For the LLM approach, I used Ollama to serve the locally hosted LLaMA 3 model (8B parameters). The pipeline was:

  1. Convert PDF pages to images (as before).
  2. Send the image directly to the LLM along with a structured prompt specifying which fields to extract (e.g., "Extract order number, vendor, line items, and total from this purchase order.").
  3. The model returned a JSON object containing the extracted data.

Strengths

Weaknesses

Head-to-Head Comparison

We evaluated both systems on 50 documents drawn from the same B2B order scenario. Key metrics were:

Rule-Based vs. LLM Document Extraction: A Hands-On Comparison for B2B Orders
Source: towardsdatascience.com
MetricRule-Based (pytesseract)LLM (Ollama + LLaMA 3)
Accuracy (field-level F1)0.850.93
Average processing time per page0.4 seconds9.2 seconds
Set-up effort3 days of rule tweaking30 minutes of prompt engineering
Robustness to layout changeLow (broke on 20% of docs)High (handled all variations)

When to use each approach

Conclusion

Building the same B2B document extractor twice revealed clear trade-offs. The rule-based system with pytesseract offered speed and determinism but required constant maintenance. The LLM approach with Ollama and LLaMA 3 provided superior flexibility and accuracy at the cost of latency and hardware requirements. For many real-world B2B scenarios, a hybrid solution may be best: use rules for simple, well-known fields and an LLM as a fallback or for complex extraction tasks.

This article is based on practical experiments and was first published on Towards Data Science.

Tags:

Related Articles

Recommended

Discover More

Git 2.54: A Simpler Way to Rewrite History with `git history`Reclaiming the American Dream: A Guide to Building a Future of Fairness and OpportunityHow to Evaluate Roblox's AI Photorealism Initiative from a Developer's PerspectiveSource Code Breach Response: A Step-by-Step Guide (Using the Trellix Incident as a Case Study)10 Critical Insights into Automation and AI-Driven Cybersecurity Defense