Skip to Content
DocumentsWhat Virza Extracts

What Virza Extracts

Virza’s document processing pipeline extracts structured knowledge from your research papers. This page explains each type of extraction, how it works, and how to use the results.

Metadata

Extracted from every document on all plans.

FieldSourceAccuracy
TitleDocument header + CrossRef/arXiv API lookupHigh for published papers; may need manual correction for preprints
AuthorsDocument header + CrossRef/arXivHigh for standard formats
AbstractDocument content or APIHigh
DOIRegex extraction from text + CrossRef resolutionVery high when present
arXiv IDRegex extractionVery high
JournalCrossRef lookupHigh for indexed journals
Publication yearDocument + CrossRef/arXivHigh
Document typeAI classification (research paper, review article, technical report, etc.)Good

All metadata fields are editable. Click any metadata field in the document view to correct it. This is especially useful for preprints, working papers, and non-standard formats.

Tables

Tables are detected and extracted using Docling’s TableFormer model (97.9% accuracy on standard academic tables).

What you get:

  • Bounding box overlays in the PDF viewer showing where each table is located
  • Extracted headers and row data as structured content
  • Natural language summary describing what the table contains
  • Confidence score indicating extraction reliability

On Pro+ plans, tables become fully structured data. You can query them, include them in extraction tables, and reference them in AI conversations.

Quality notes:

  • Tables with standard row/column layouts extract best
  • Complex merged cells or multi-page tables may extract partially
  • Tables with fewer than 2 rows or extreme aspect ratios are filtered as false positives
  • Very wide tables (20+ columns) are flagged for manual review

Figures

Figures are detected, cropped, and saved as individual previews.

What you get:

  • Cropped figure images viewable in the document sidebar
  • Bounding box overlays in the PDF viewer
  • Perceptual deduplication: repeated logos, headers, and watermarks are automatically filtered out (using average hash with Hamming distance ≤ 5)

On Pro+ plans, figures also receive AI-generated descriptions from a vision model, explaining what the chart, plot, or diagram shows.

Quality notes:

  • Clear, high-resolution figures extract best
  • Figures smaller than 2% of page area are filtered as noise
  • Figures in header/footer margins (top/bottom 5%) are excluded

Equations

LaTeX equations are extracted from mathematical content using Docling’s formula recognition.

What you get:

  • LaTeX source code for each equation
  • Rendered text representation
  • Page location and display mode (inline vs. block)

Citations

References are extracted using GROBID’s full-text document analysis, which identifies the actual bibliography section and parses each reference into structured fields.

What you get:

  • Structured citation records: title, authors, year, journal, DOI
  • DOI resolution via CrossRef for richer bibliographic data
  • Library linking: citations that match documents already in your workspace are automatically linked, building a citation graph across your library

Quality validation:

  • Chemistry noise filtering prevents NMR, IR, and synthesis data from being misclassified as references
  • Structural completeness scoring ranks each citation’s reliability (DOI match = highest, title + authors + year = good, title only = lower)
  • Cross-referenced citations show a link icon; external citations show a document icon

Summaries

AI-generated executive summaries are produced for every document.

On Free and Starter plans: Standard AI model produces a concise summary.

On Pro+ plans: Advanced model (Claude) generates multi-level summaries:

  • TLDR: one-sentence takeaway
  • Structured breakdown: Objective, Methods, Results, Conclusions
  • Detailed narrative: comprehensive summary covering all major findings

Claims (Enterprise only)

Statistical and evidence claims are extracted from results sections.

What you get:

  • Typed claims (statistical, causal, comparative, methodological)
  • Associated p-values, effect sizes, confidence intervals
  • Variable tracking (independent, dependent, control)
  • Evidence direction (supports, contradicts, neutral)

Methodology analysis (Enterprise only)

What you get:

  • Study design classification (meta-analysis, RCT, cohort, case-control, etc.)
  • Methodology quality scores across dimensions: study design rigor, statistical methodology, bias risk
  • Automated quality assessment comparable to systematic review screening

What plan unlocks what

ExtractionFreeStarterProTeamEnterprise
Text, metadata, sections
Table detection + previews
Figure detection + previews
Equation extraction
Citation parsing (GROBID)
Executive summary
Artifact descriptions
Search embeddings
Vision descriptions
Structured tables
Multi-level summaries
QA pairs
Academic embeddings
Claims extraction
Methodology scoring
Citation verification

Features marked with ❌ for your plan will show “Not triggered” in the artifact status panel and display an upgrade prompt explaining what the feature does.

Last updated on