Supported Formats
Virza handles the most common academic document formats.
Fully supported
| Format | Extension | Notes |
|---|---|---|
.pdf | Best support. Academic papers, preprints, reports. Full text extraction, metadata detection, and citation parsing. | |
| Microsoft Word | .docx | Good support. Text and basic formatting extracted. |
| Plain text | .txt | Basic support. No metadata extraction. |
Best results with PDF
PDF documents from academic publishers and preprint servers (arXiv, bioRxiv, SSRN) give the best results because they follow standard layouts that our parser is trained to understand.
For best parsing quality:
- Use text-based PDFs: Not scanned images. If the text can be selected in your PDF reader, it’s text-based.
- Publisher PDFs: Direct downloads from journals produce better results than screenshots or re-saved files
- arXiv papers: Excellent parsing results due to consistent formatting
Scanned PDFs (image-only) have limited support. Virza attempts OCR, but results may vary. For best results, use the text-based version if available.
Planned formats
We’re working on support for:
- EPUB: E-book format
- HTML: Web pages and articles
- Markdown:
.mdfiles - BibTeX:
.bibcitation files for bulk import
Have a format request? Let us know at support@serenoty.com.
Last updated on