How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison

By ⚡ min read

Introduction

Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

What You Need

  • Python 3.8+ installed on your system
  • pytesseract and Tesseract OCR engine (follow installation for your OS)
  • Ollama (install from ollama.ai) with LLaMA 3 model pulled (ollama pull llama3)
  • A sample B2B PDF invoice or order document (use a real but anonymized one)
  • Basic Python libraries: pdf2image, Pillow, re, json
  • Text editor or IDE

Step-by-Step Guide

Step 1: Set Up the Environment and Sample Document

First, create a project folder and install dependencies:

pip install pytesseract pdf2image Pillow ollama

Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.

Step 2: Build the Rule-Based Extractor with pytesseract

Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])

Now define rules using regex and keyword matching. For example:

  • Extract Order ID by looking for patterns like Order #:\s*(\w+)
  • Find Supplier Name after the word Supplier or Vendor
  • Parse line items using tabular assumption (fixed positions or delimiter)
  • Grab the total via Total:\s*[\$]?(\d+\.\d{2})

Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.

Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3

Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:

import ollama

prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)

response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])

This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

Step 4: Compare Outputs and Handle Failures

Run both scripts on the same document. Compare extracted JSON:

  • Rule-based may miss fields if layout shifts or OCR introduces noise
  • LLM-based may hallucinate or misinterpret ambiguous text

For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.

Step 5: Optimize for Your Use Case

For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.

You can also combine them: try rules first, then use LLM for confidence threshold below 90%.

Tips for Success

  • Preprocess images before OCR: crop, deskew, convert to grayscale, increase contrast.
  • Use structured output with LLMs: ask for JSON and validate with Pydantic.
  • Test on multiple documents with varying layouts to see where each approach shines.
  • Monitor costs: local LLM via Ollama has no API costs but uses compute; rules need no GPU.
  • Version control both extraction scripts and sample documents to reproduce comparisons.
  • Consider a hybrid system as the best of both worlds: rules for speed, LLM for edge cases.

By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.

Recommended

Discover More

New Brazilian Banking Trojan TCLBANKER Targets Financial Apps Through Messaging WormsThe Diminishing Power of U.S. Sanctions: Lessons from the Iran ConflictHow to Gain Cost Visibility for Amazon Bedrock AI Usage with IAM Cost AllocationBitcoin's Early Days: Inside Morgan Stanley's Strategy and the Urgent Education GapWhy I Ditched Windows for Linux: My Gaming PC Now Runs on Linux Only