Portfolio
Data engineering · Legal

Judicial Scraper

LegalTech · AI for lawyers

Large-scale extraction of Brazilian court records — 23 million cases from four courts, built for a LegalTech AI platform.

Delivered in one month for a client training legal AI. No public court APIs — only manual portals with pagination, CAPTCHAs, and anti-bot defenses.

PythonSeleniumSQLiteMachine LearningLibrosaWeb ScrapingOCR
2024
23M
Total cases
All four courts
4
Courts covered
TJSP · TJBA · TJRJ · TRF3
5M
TJSP volume
Largest single court
1 month
Build time
End-to-end delivery
The Problem

The client was building an AI assistant for lawyers and needed a massive dataset of court records — case numbers, parties, subjects, rulings, procedural history, and attached documents. This data lives on court websites with no public API: only manual queries, one case at a time, behind pagination, CAPTCHAs, and anti-bot measures. Collecting 23 million records by hand was not feasible.

The Solution

We built a distributed, resilient scraping system that extracts court data at scale from the public portals of four Brazilian courts: TJSP (5 million cases), TJBA, TJRJ, and TRF3. Up to 12 parallel headless browser instances query OAB record ranges in parallel, persisting results in real time to SQLite so failures never lose progress. The pipeline also opens individual case pages to extract status, procedural history, rulings, and PDF attachments stored as BLOBs for direct ingestion by the client's AI stack.

Engineering Highlights
23M
Cases extracted
4
Courts
5M · TJSP
Largest court
1 month
Delivery
Key deliverables
  • 23 million structured court cases across TJSP, TJBA, TJRJ, and TRF3
  • Automatic audio CAPTCHA solving without external services
  • PDF documents stored for offline and AI pipeline ingestion
  • Parallel headless browser farm with real-time SQLite checkpoints
Stack
PythonOrchestration and scraping logic
Selenium + ChromeDriverHeadless browser automation
SQLiteResilient local persistence
Librosa + SciPy + NumPyAudio CAPTCHA — MFCC features
Requests + BeautifulSoupHTTP and HTML parsing
ThreadingParallel throughput
OpenCV + TesseractVisual CAPTCHA OCR

Parallel collection without data loss

The system runs multiple simultaneous threads, each querying different OAB ranges. Results flush to SQLite continuously so a crashed worker or blocked session does not wipe hours of progress.

Audio CAPTCHA cracking with ML

TJRJ uses audio CAPTCHAs — five spoken digits. We built an in-house decoder: amplitude envelope segmentation, MFCC feature extraction per digit, and cosine-similarity classification against a labeled sample bank — no paid third-party CAPTCHA APIs.

Verdict and document extraction

Beyond metadata, crawlers navigate into each case to pull status, procedural history, and attached PDFs (petitions, decisions, sentences), stored as BLOBs in per-case databases for offline querying and AI ingestion.

Four court portals, one pipeline

Each court portal has different HTML flows and defenses. The architecture isolates court-specific adapters while sharing persistence, CAPTCHA, and document-download primitives.

Have a similar project?

Let's build it right.

Start a Project