Known Limitations

Virza is designed to be transparent about what it can and cannot do. This page documents current limitations so you can make informed decisions about using AI outputs in your research.

Document processing

Scanned PDFs: OCR quality varies significantly with scan quality. High-resolution, clean scans work well. Low-resolution, skewed, or handwritten documents may produce poor text extraction.
Complex layouts: Documents with multi-column layouts, text boxes overlapping figures, or unusual formatting may have imperfect section detection.
Non-English documents: Text extraction works for most Latin-script languages. CJK (Chinese, Japanese, Korean) and RTL (Arabic, Hebrew) support is limited and may produce lower-quality results.
Very large tables: Tables spanning multiple pages may be extracted as separate segments rather than a single unified table.
Equations in figures: Equations embedded inside figure images are not extracted; only standalone equation blocks in the text are recognized.

Search

Semantic search requires embeddings: If a document’s embedding stage failed (shown as “Ready with warnings”), semantic search won’t find it by concept. Keyword search still works.
New documents: Newly uploaded documents become keyword-searchable within seconds but may take up to 30 seconds to become semantically searchable (embedding generation takes time).
Cross-language search: Searching in one language will not reliably find documents written in a different language.
External database coverage: The Discover tab searches major academic databases (Semantic Scholar, OpenAlex, PubMed, ArXiv, Crossref), but some niche journals, conference proceedings, or very recent publications may not be indexed yet.

AI chat

Hallucination risk: Like all large language models, Virza’s AI can generate plausible-sounding statements that are not grounded in your documents. Always check the knowledge source badge and confidence meter.
Context window limits: Very long conversations may cause earlier context to be summarized or dropped. For important questions, start a fresh conversation.
Complex multi-document reasoning: Cross-document synthesis works best with 2–10 related documents. With hundreds of documents in scope, the system prioritizes the most relevant ones rather than reasoning across all of them.
Domain specificity: The AI performs best on topics well-represented in its training data. Highly specialized or emerging subfields may produce lower-quality general knowledge responses.

Citations

Extraction accuracy: Citation parsing is highly accurate for standard academic reference formats but may struggle with:
- Non-standard citation styles (legal citations, patent references)
- References embedded in footnotes rather than a bibliography section
- Chemistry-heavy papers where reaction data can be misclassified as references (Virza includes specific filters for this, but edge cases exist)
Library linking: Citation-to-library matching relies on DOI and title similarity. Papers without DOIs or with significantly different titles between preprint and published versions may not be automatically linked.

Collaboration

Real-time editing: Notes support auto-save but not real-time collaborative editing (simultaneous editing by multiple users). The last save wins.
Workspace isolation: Documents, collections, and notes are strictly workspace-scoped. There is no way to share individual documents between workspaces; you must use share links or invite users to the workspace.

Privacy

AI model providers: Virza uses third-party AI model providers (OpenAI, Anthropic) for some AI features. Your document content is sent to these providers for processing but is never stored by them or used for training. See Privacy for details.
Search analytics: Search queries are logged for system performance monitoring. They are not associated with individual users and are purged regularly.