Document IntelligenceNew

Read every document. See every entity.

Structure-aware sensitive data detection across PDFs, scans, images and emails , with confidence scores, custom entities, and role awareness.

claim-2047-hoefer.pdf · page 3 of 14Detecting
SchadensanzeigeHanseatic Versicherung AG, Formular H-47
Versicherungsnehmer
Hermann Höfer
Policenummer
HV-2020-33891
Schadensdatum
14. März 2026
Schadensort
Fährkanalstraße 12, 21073 Hamburg
Gegenstand
Rückfassade · Wasserschaden durch Starkregen
Schadenssumme
€ 18.420,00
Zeuge
Peter Weber
Detection inspector
POLICYHOLDER_A·S7FT9CD
Hermann Höfer
Confidence0.98
SourceNER + structural
RolePolicyholder (field label)
AltsHerman Höfer (0.12)
POLICY_A·33891
HV-2020-33891
Confidence0.94
SourceCustom rule · HV-\d{4}-\d{5}
RolePolicy reference
Alts
ADDRESS_A·KM3R4J
Fährkanalstraße 12, 21073 Hamburg
Confidence0.96
SourceNER + structural
RoleDamage location
AltsFaehrkanalstr. 12 (0.08)
WITNESS_A·48LK2P
Peter Weber
Confidence0.91
SourceNER + structural
RoleWitness (field label)
AltsP. Weber (0.22)
18 entities · 4 types · mean confidence 0.94Parsed in 138 ms · structure-aware
01 · Structure-aware

Reads layout, not just text.

Field labels, tables, form hierarchy, legal roles. Null understands what a document is before it looks at the words, so a name in a witness field never gets mistaken for a plaintiff.

02 · Confidence-scored

Every detection, with its doubt.

Each entity carries a confidence score, alternatives, and provenance, NER, structural signal, custom rule. Thresholds are tunable per entity type, per workspace.

03 · Plain-language rules

Describe what to hide, in words.

“Internal project codes starting with PRJ-”, and the engine compiles a detector. No regex gymnastics. No ML pipeline to wire up. Your compliance team writes the rule.

Custom entities

Write it in English. Get a detector.

Your legal or compliance team already knows what needs hiding, internal project codes, case IDs, product SKUs, claim numbers. Null's compiler turns plain-language descriptions into typed, versioned, threshold-tunable detectors, no regex syntax, no pipeline to maintain.

Versioned per workspace
Every custom entity has an owner, a changelog, and an approval gate. Treat detectors like code.
Confidence-aware by default
Plain-language descriptions compile into hybrid rules, regex plus NER hints, so you get coverage without over-matching.
Hot-reloadable
Edit a detector, and every future document sees the new rule. Past detections are flagged for re-review, never silently revised.
Compiler · plain text → detectorReady
You describe
Internal project codes that look like PRJ-1234
Null compiles
EntityINTERNAL_CODE
Pattern/PRJ-\d{4}/i
Hintsproject reference, internal code
Threshold0.85
Scopeworkspace-wide

Role awareness

Same name. Different role, different token.

In a complaint, Mara Hoffmannis the plaintiff. In the next paragraph, as “Hoffmann”, she's still the plaintiff, the same token. But in a different document where she's a witness, she's a different token entirely. Structure-aware coreference keeps the semantics intact, so the model reasons correctly.

Coreference resolution
"Hoffmann", "Mara", "the plaintiff" all collapse to the same token when the structure says so.
Role classification
Plaintiff, defendant, claimant, adjuster, witness, policyholder, inferred from document layout and surrounding language.
Cross-document stable
Same person, same workspace, same token, even across different case files. Vectors, joins and audit all stay aligned.
civil-complaint.pdf · excerptStructure resolved
Mara HoffmannPLAINTIFF_A filed a claim against Hanseatic Versicherung AGDEFENDANT_A on 14 March 2026DATE_X. The defendant, Hanseatic AGDEFENDANT_A, disputed the estimated damage. HoffmannPLAINTIFF_A's attorney Peter WeberATTORNEY_A filed a motion on 18 March 2026DATE_X.
3 roles · 7 mentions · 4 resolved coreferences✓ roles preserved

Formats

Whatever your team actually uses.

Native parsing where possible, OCR where needed. Structure awareness applies the same, regardless of format.

Text, layout, tables, headers.
Native PDF
OCR fallback with layout reconstruction.
Scanned PDF
Full field and style parsing.
DOCX · RTF
OCR plus vision-language backstop.
PNG · JPG · TIFF
Header-aware; attachments parsed in turn.
Email · EML · MSG
Fast path, no layout analysis.
Plain text · Markdown
Engine provenance

Trained on regulated corpora. Auditable end-to-end.

We don't train on your documents. The detectors are pretrained on 2.4M regulated corpora, insurance, legal, healthcare, financial. Your data stays in your vault; the engine just reads it.

Training data
Pretrained, not your docs
2.4M regulated docs
Median latency
Per page, structure-aware
138 ms
Entity types
Out of the box
12 + custom
Deployment
Cloud · EU · on-prem
All three

Bring your hardest document. We'll parse it with you.

A real claim file, a contract, a discharge summary, whatever keeps your DPO up at night. Run it through the inspector. See every detection. Talk to the engineer who trained the detector.