Data engineering · Legal

Judicial Scraper

LegalTech · AI for lawyers

Large-scale extraction of Brazilian court records — 23 million cases from four courts, built for a LegalTech AI platform.

Delivered in one month for a client training legal AI. No public court APIs — only manual portals with pagination, CAPTCHAs, and anti-bot defenses.

PythonSeleniumSQLiteMachine LearningLibrosaWeb ScrapingOCR

2024

23M

Total cases

All four courts

Courts covered

TJSP · TJBA · TJRJ · TRF3

TJSP volume

Largest single court

1 month

Build time

End-to-end delivery

The Problem

The client was building an AI assistant for lawyers and needed a massive dataset of court records — case numbers, parties, subjects, rulings, procedural history, and attached documents. This data lives on court websites with no public API: only manual queries, one case at a time, behind pagination, CAPTCHAs, and anti-bot measures. Collecting 23 million records by hand was not feasible.

The Solution

We built a distributed, resilient scraping system that extracts court data at scale from the public portals of four Brazilian courts: TJSP (5 million cases), TJBA, TJRJ, and TRF3. Up to 12 parallel headless browser instances query OAB record ranges in parallel, persisting results in real time to SQLite so failures never lose progress. The pipeline also opens individual case pages to extract status, procedural history, rulings, and PDF attachments stored as BLOBs for direct ingestion by the client's AI stack.

Engineering Highlights

23M

Cases extracted

Courts

5M · TJSP

Largest court

1 month

Delivery

Key deliverables

23 million structured court cases across TJSP, TJBA, TJRJ, and TRF3
Automatic audio CAPTCHA solving without external services
PDF documents stored for offline and AI pipeline ingestion
Parallel headless browser farm with real-time SQLite checkpoints

Stack

PythonOrchestration and scraping logic

Selenium + ChromeDriverHeadless browser automation

SQLiteResilient local persistence

Librosa + SciPy + NumPyAudio CAPTCHA — MFCC features

Requests + BeautifulSoupHTTP and HTML parsing

ThreadingParallel throughput

OpenCV + TesseractVisual CAPTCHA OCR

Parallel collection without data loss

The system runs multiple simultaneous threads, each querying different OAB ranges. Results flush to SQLite continuously so a crashed worker or blocked session does not wipe hours of progress.

Audio CAPTCHA cracking with ML

TJRJ uses audio CAPTCHAs — five spoken digits. We built an in-house decoder: amplitude envelope segmentation, MFCC feature extraction per digit, and cosine-similarity classification against a labeled sample bank — no paid third-party CAPTCHA APIs.

Verdict and document extraction

Beyond metadata, crawlers navigate into each case to pull status, procedural history, and attached PDFs (petitions, decisions, sentences), stored as BLOBs in per-case databases for offline querying and AI ingestion.

Four court portals, one pipeline

Each court portal has different HTML flows and defenses. The architecture isolates court-specific adapters while sharing persistence, CAPTCHA, and document-download primitives.

Have a similar project?

Let's build it right.

Start a Project