Skip to Content
DocumentsHow Processing Works

How Document Processing Works

When you upload a document to Virza, it doesn’t just get stored. It goes through a multi-stage AI pipeline that transforms a static file into a research-ready artifact. Each stage extracts structured knowledge that powers search, chat, citations, and evidence analysis.

The processing lifecycle

Every document passes through four phases. You can use the document as soon as Phase 2 completes. Later phases add progressively deeper analysis.

Upload → Security scan → Core extraction → AI enrichment → Ready (instant) (15-30s) (30-90s)

Phase 1: Upload and security (instant)

Your file is uploaded directly to secure storage via a presigned URL. The file data never passes through our API server. Before any processing begins, the file enters a quarantine zone where it is scanned for malware.

  • File integrity verified (SHA-256 hash)
  • File type validated (PDF, DOCX, or TXT only)
  • Virus scan completed
  • File size and page count checked against your plan limits

If the virus scan detects a threat, the document is immediately quarantined and deleted. You will see an Infected status. If file validation fails (wrong format, too large, too many pages), you will see a specific error explaining what went wrong. See Failure Modes for all possible errors.

Phase 2: Core extraction (15–30 seconds)

The document enters the extraction pipeline. Your document status changes to Processing.

What’s extractedHow it works
Full textDocling AI parser extracts text, headers, captions, and footnotes with document structure preserved. Falls back to PyMuPDF for compatibility.
MetadataTitle, authors, abstract, DOI, journal, and publication year are identified from the document and enriched via CrossRef and arXiv APIs
SectionsThe document is segmented into semantic sections (Abstract, Introduction, Methods, Results, Discussion, etc.)
TablesTables are detected with bounding boxes, confidence scores, headers, and row data
FiguresFigures are cropped, deduplicated (perceptual hashing prevents repeated logos/headers), and saved as previews
EquationsLaTeX equations are extracted from mathematical content

After this phase, your document status changes to Available: you can read it, search for it, and see its structure.

Phase 3: AI enrichment (30–90 seconds)

Additional AI stages run in parallel to deepen the extraction. Your document status shows Enriching: you can already use it while enrichment completes.

What’s producedDescriptionAvailable on
Executive summaryAI-generated overview of key findingsAll plans
CitationsReferences parsed via GROBID, matched to your library, DOIs resolved via CrossRefAll plans
Artifact descriptionsAI-generated captions for tables and figuresAll plans
Search embeddingsVector representations enabling semantic searchAll plans
Context enrichmentLinks the document to related content in your workspaceAll plans
Vision descriptionsAI vision model analyzes charts, plots, and diagramsPro+
Structured tablesTables become queryable structured data (not just images)Pro+
Document structureSemantic structure analysis: sections, arguments, evidence flowPro+
Academic embeddingsDiscipline-aware embeddings for academic searchPro+
Multi-level summariesTLDR + structured breakdown + detailed narrativePro+
QA pairsPre-generated question-answer pairs for rapid comprehensionPro+
Claims extractionTyped claims with p-values, effect sizes, and variable trackingEnterprise
Methodology scoringAutomated quality assessment of research methodsEnterprise

Phase 4: Ready

All stages complete. Your document is fully searchable, chat-ready, and all extracted artifacts are available.

If any non-critical stage failed (for example, an embedding provider was temporarily unavailable), the document is marked Ready with warnings: core functionality works, but some enrichment artifacts may be missing.

Processing time

Document typeTypical timeNotes
Standard academic paper (5–20 pages)15–60 secondsMost papers complete in under 30 seconds
Long report (50–100 pages)1–3 minutesMore figures and tables mean more extraction time
Book-length document (100+ pages)3–10 minutesEnterprise plans support up to 5,000 pages

Pro and higher plans get priority processing. Your documents are processed ahead of the standard queue.

What affects processing quality

  • Text-based PDFs produce the best results. Publisher PDFs from journals and arXiv have consistent layouts that the parser is trained on.
  • Scanned PDFs (image-only) trigger OCR, which works well for clean scans but may produce lower-quality text from poor images.
  • Password-protected PDFs cannot be processed. Remove the password before uploading.
  • DOCX files have good text extraction but limited metadata and citation parsing compared to PDF.
  • TXT files have basic support with no metadata extraction.

Duplicate detection

If you upload the same file twice (identical bytes), Virza detects the duplicate via SHA-256 hash matching and links to the existing processed version. No extra processing needed.

If you upload the same paper from a different source (for example, an arXiv preprint and the published version), Virza detects the match via DOI and notifies you of the existing copy.

Metadata is editable

If Virza’s automatic detection gets something wrong, click on any metadata field to correct it manually. This is common for preprints, working papers, and non-standard formats.

Troubleshooting

IssueWhat happenedWhat to do
Size limit exceededFile is larger than your plan allowsCompress the PDF or upgrade your plan
Page limit exceededDocument has more pages than your plan allowsSplit the document or upgrade
Encryption detectedPDF is password-protectedRemove the password and re-upload
Parse timeoutDocument structure was too complex for the parser (300s limit)Try re-uploading; contact support@serenoty.com if it persists
OCR failedScanned PDF couldn’t be reliably readUse a cleaner scan or the text-based version if available
Stuck in “Processing”A pipeline stage may have timed outTry re-uploading. If the issue persists, contact support@serenoty.com

For the complete reference of all error states, see Failure Modes & Partial Results.

Last updated on