What Virza Extracts
Virza’s document processing pipeline extracts structured knowledge from your research papers. This page explains each type of extraction, how it works, and how to use the results.
Metadata
Extracted from every document on all plans.
| Field | Source | Accuracy |
|---|---|---|
| Title | Document header + CrossRef/arXiv API lookup | High for published papers; may need manual correction for preprints |
| Authors | Document header + CrossRef/arXiv | High for standard formats |
| Abstract | Document content or API | High |
| DOI | Regex extraction from text + CrossRef resolution | Very high when present |
| arXiv ID | Regex extraction | Very high |
| Journal | CrossRef lookup | High for indexed journals |
| Publication year | Document + CrossRef/arXiv | High |
| Document type | AI classification (research paper, review article, technical report, etc.) | Good |
All metadata fields are editable. Click any metadata field in the document view to correct it. This is especially useful for preprints, working papers, and non-standard formats.
Tables
Tables are detected and extracted using Docling’s TableFormer model (97.9% accuracy on standard academic tables).
What you get:
- Bounding box overlays in the PDF viewer showing where each table is located
- Extracted headers and row data as structured content
- Natural language summary describing what the table contains
- Confidence score indicating extraction reliability
On Pro+ plans, tables become fully structured data. You can query them, include them in extraction tables, and reference them in AI conversations.
Quality notes:
- Tables with standard row/column layouts extract best
- Complex merged cells or multi-page tables may extract partially
- Tables with fewer than 2 rows or extreme aspect ratios are filtered as false positives
- Very wide tables (20+ columns) are flagged for manual review
Figures
Figures are detected, cropped, and saved as individual previews.
What you get:
- Cropped figure images viewable in the document sidebar
- Bounding box overlays in the PDF viewer
- Perceptual deduplication: repeated logos, headers, and watermarks are automatically filtered out (using average hash with Hamming distance ≤ 5)
On Pro+ plans, figures also receive AI-generated descriptions from a vision model, explaining what the chart, plot, or diagram shows.
Quality notes:
- Clear, high-resolution figures extract best
- Figures smaller than 2% of page area are filtered as noise
- Figures in header/footer margins (top/bottom 5%) are excluded
Equations
LaTeX equations are extracted from mathematical content using Docling’s formula recognition.
What you get:
- LaTeX source code for each equation
- Rendered text representation
- Page location and display mode (inline vs. block)
Citations
References are extracted using GROBID’s full-text document analysis, which identifies the actual bibliography section and parses each reference into structured fields.
What you get:
- Structured citation records: title, authors, year, journal, DOI
- DOI resolution via CrossRef for richer bibliographic data
- Library linking: citations that match documents already in your workspace are automatically linked, building a citation graph across your library
Quality validation:
- Chemistry noise filtering prevents NMR, IR, and synthesis data from being misclassified as references
- Structural completeness scoring ranks each citation’s reliability (DOI match = highest, title + authors + year = good, title only = lower)
- Cross-referenced citations show a link icon; external citations show a document icon
Summaries
AI-generated executive summaries are produced for every document.
On Free and Starter plans: Standard AI model produces a concise summary.
On Pro+ plans: Advanced model (Claude) generates multi-level summaries:
- TLDR: one-sentence takeaway
- Structured breakdown: Objective, Methods, Results, Conclusions
- Detailed narrative: comprehensive summary covering all major findings
Claims (Enterprise only)
Statistical and evidence claims are extracted from results sections.
What you get:
- Typed claims (statistical, causal, comparative, methodological)
- Associated p-values, effect sizes, confidence intervals
- Variable tracking (independent, dependent, control)
- Evidence direction (supports, contradicts, neutral)
Methodology analysis (Enterprise only)
What you get:
- Study design classification (meta-analysis, RCT, cohort, case-control, etc.)
- Methodology quality scores across dimensions: study design rigor, statistical methodology, bias risk
- Automated quality assessment comparable to systematic review screening
What plan unlocks what
| Extraction | Free | Starter | Pro | Team | Enterprise |
|---|---|---|---|---|---|
| Text, metadata, sections | ✅ | ✅ | ✅ | ✅ | ✅ |
| Table detection + previews | ✅ | ✅ | ✅ | ✅ | ✅ |
| Figure detection + previews | ✅ | ✅ | ✅ | ✅ | ✅ |
| Equation extraction | ✅ | ✅ | ✅ | ✅ | ✅ |
| Citation parsing (GROBID) | ✅ | ✅ | ✅ | ✅ | ✅ |
| Executive summary | ✅ | ✅ | ✅ | ✅ | ✅ |
| Artifact descriptions | ✅ | ✅ | ✅ | ✅ | ✅ |
| Search embeddings | ✅ | ✅ | ✅ | ✅ | ✅ |
| Vision descriptions | ❌ | ❌ | ✅ | ✅ | ✅ |
| Structured tables | ❌ | ❌ | ✅ | ✅ | ✅ |
| Multi-level summaries | ❌ | ❌ | ✅ | ✅ | ✅ |
| QA pairs | ❌ | ❌ | ✅ | ✅ | ✅ |
| Academic embeddings | ❌ | ❌ | ✅ | ✅ | ✅ |
| Claims extraction | ❌ | ❌ | ❌ | ❌ | ✅ |
| Methodology scoring | ❌ | ❌ | ❌ | ❌ | ✅ |
| Citation verification | ❌ | ❌ | ❌ | ❌ | ✅ |
Features marked with ❌ for your plan will show “Not triggered” in the artifact status panel and display an upgrade prompt explaining what the feature does.