What Virza Extracts

Virza’s document processing pipeline extracts structured knowledge from your research papers. This page explains each type of extraction, how it works, and how to use the results.

Metadata

Extracted from every document on all plans.

Field	Source	Accuracy
Title	Document header + CrossRef/arXiv API lookup	High for published papers; may need manual correction for preprints
Authors	Document header + CrossRef/arXiv	High for standard formats
Abstract	Document content or API	High
DOI	Regex extraction from text + CrossRef resolution	Very high when present
arXiv ID	Regex extraction	Very high
Journal	CrossRef lookup	High for indexed journals
Publication year	Document + CrossRef/arXiv	High
Document type	AI classification (research paper, review article, technical report, etc.)	Good

All metadata fields are editable. Click any metadata field in the document view to correct it. This is especially useful for preprints, working papers, and non-standard formats.

Tables

Tables are detected and extracted using Docling’s TableFormer model (97.9% accuracy on standard academic tables).

What you get:

Bounding box overlays in the PDF viewer showing where each table is located
Extracted headers and row data as structured content
Natural language summary describing what the table contains
Confidence score indicating extraction reliability

On Pro+ plans, tables become fully structured data. You can query them, include them in extraction tables, and reference them in AI conversations.

Quality notes:

Tables with standard row/column layouts extract best
Complex merged cells or multi-page tables may extract partially
Tables with fewer than 2 rows or extreme aspect ratios are filtered as false positives
Very wide tables (20+ columns) are flagged for manual review

Figures

Figures are detected, cropped, and saved as individual previews.

What you get:

Cropped figure images viewable in the document sidebar
Bounding box overlays in the PDF viewer
Perceptual deduplication: repeated logos, headers, and watermarks are automatically filtered out (using average hash with Hamming distance ≤ 5)

On Pro+ plans, figures also receive AI-generated descriptions from a vision model, explaining what the chart, plot, or diagram shows.

Quality notes:

Clear, high-resolution figures extract best
Figures smaller than 2% of page area are filtered as noise
Figures in header/footer margins (top/bottom 5%) are excluded

Equations

LaTeX equations are extracted from mathematical content using Docling’s formula recognition.

What you get:

LaTeX source code for each equation
Rendered text representation
Page location and display mode (inline vs. block)

Citations

References are extracted using GROBID’s full-text document analysis, which identifies the actual bibliography section and parses each reference into structured fields.

What you get:

Structured citation records: title, authors, year, journal, DOI
DOI resolution via CrossRef for richer bibliographic data
Library linking: citations that match documents already in your workspace are automatically linked, building a citation graph across your library

Quality validation:

Chemistry noise filtering prevents NMR, IR, and synthesis data from being misclassified as references
Structural completeness scoring ranks each citation’s reliability (DOI match = highest, title + authors + year = good, title only = lower)
Cross-referenced citations show a link icon; external citations show a document icon

Summaries

AI-generated executive summaries are produced for every document.

On Free and Starter plans: Standard AI model produces a concise summary.

On Pro+ plans: Advanced model (Claude) generates multi-level summaries:

TLDR: one-sentence takeaway
Structured breakdown: Objective, Methods, Results, Conclusions
Detailed narrative: comprehensive summary covering all major findings

Claims (Enterprise only)

Statistical and evidence claims are extracted from results sections.

What you get:

Typed claims (statistical, causal, comparative, methodological)
Associated p-values, effect sizes, confidence intervals
Variable tracking (independent, dependent, control)
Evidence direction (supports, contradicts, neutral)

Methodology analysis (Enterprise only)

What you get:

Study design classification (meta-analysis, RCT, cohort, case-control, etc.)
Methodology quality scores across dimensions: study design rigor, statistical methodology, bias risk
Automated quality assessment comparable to systematic review screening

What plan unlocks what

Extraction	Free	Starter	Pro	Team	Enterprise
Text, metadata, sections	✅	✅	✅	✅	✅
Table detection + previews	✅	✅	✅	✅	✅
Figure detection + previews	✅	✅	✅	✅	✅
Equation extraction	✅	✅	✅	✅	✅
Citation parsing (GROBID)	✅	✅	✅	✅	✅
Executive summary	✅	✅	✅	✅	✅
Artifact descriptions	✅	✅	✅	✅	✅
Search embeddings	✅	✅	✅	✅	✅
Vision descriptions	❌	❌	✅	✅	✅
Structured tables	❌	❌	✅	✅	✅
Multi-level summaries	❌	❌	✅	✅	✅
QA pairs	❌	❌	✅	✅	✅
Academic embeddings	❌	❌	✅	✅	✅
Claims extraction	❌	❌	❌	❌	✅
Methodology scoring	❌	❌	❌	❌	✅
Citation verification	❌	❌	❌	❌	✅

Features marked with ❌ for your plan will show “Not triggered” in the artifact status panel and display an upgrade prompt explaining what the feature does.