From 8cb96bfd65e4e1334de5a457961b801991a204fb Mon Sep 17 00:00:00 2001 From: Vik Paruchuri Date: Mon, 12 Jan 2026 17:59:28 -0500 Subject: [PATCH] README --- README.md | 161 ++++++++++++++++++++++++++---------------------------- 1 file changed, 76 insertions(+), 85 deletions(-) diff --git a/README.md b/README.md index 650275f..5f38c57 100644 --- a/README.md +++ b/README.md @@ -1,85 +1,106 @@ # Chandra -Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information. +[![Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/KuZwXNGnfH) -## Features +An OCR model for complex documents — handwriting, tables, math equations, and messy forms. -- Convert documents to markdown, html, or json with detailed layout information -- Good handwriting support -- Reconstructs forms accurately, including checkboxes -- Good support for tables, math, and complex layouts -- Extracts images and diagrams, with captions and structured data -- Support for 40+ languages -- Two inference modes: local (HuggingFace) and remote (vLLM server) + + +## Benchmarks + +Overall scores on the [olmocr bench](https://github.com/allenai/olmocr): + + ## Hosted API -- We have a hosted API for Chandra [here](https://www.datalab.to/), which also includes other accuracy improvements and document workflows. -- There is a free playground [here](https://www.datalab.to/playground) if you want to try it out without installing. +A hosted API with additional accuracy improvements is available at [datalab.to](https://www.datalab.to/). Try the [free playground](https://www.datalab.to/playground) without installing. -## Quickstart +## Community -The easiest way to start is with the CLI tools: +Join [Discord](https://discord.gg//KuZwXNGnfH) to discuss development and get help. + +## Quick Start ```shell pip install chandra-ocr -# With VLLM +# Start vLLM server, then run OCR chandra_vllm chandra input.pdf ./output -# With HuggingFace +# Or use HuggingFace locally chandra input.pdf ./output --method hf -# Interactive streamlit app +# Interactive web app chandra_app ``` -## Benchmarks +**Python:** -These are overall scores on the olmocr bench. +```python +from chandra.model import InferenceManager +from chandra.input import load_pdf_images - +manager = InferenceManager(method="hf") +images = load_pdf_images("document.pdf") +results = manager.generate(images) +print(results[0].markdown) +``` -See full scores [below](#benchmark-table). +## How it Works. + +- **Two inference modes**: Run locally via HuggingFace Transformers, or deploy a vLLM server for production throughput +- **Layout-aware output**: Every text block, table, and image comes with bounding box coordinates +- **Structured formats**: Output as Markdown, HTML, or JSON with full layout metadata +- **40+ languages** supported + +## What It Handles + +**Handwriting** — Doctor notes, filled forms, homework. Chandra reads cursive and messy print that trips up traditional OCR. + +**Tables** — Preserves structure including merged cells (colspan/rowspan). Works on financial filings, invoices, and data tables. + +**Math** — Inline and block equations rendered as LaTeX. Handles textbooks, worksheets, and research papers. + +**Forms** — Reconstructs checkboxes, radio buttons, and form fields with their values. + +**Complex Layouts** — Multi-column documents, newspapers, textbooks with figures and captions. ## Examples - +| | | +|---|---| +|
**Handwriting** |
**Tables** | +|
**Math** |
**Newspapers** | + +
+More examples | Type | Name | Link | |------|------|------| -| Tables | Water Damage Form | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/tables/water_damage.png) | | Tables | 10K Filing | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/tables/10k.png) | -| Forms | Handwritten Form | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/forms/handwritten_form.png) | | Forms | Lease Agreement | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/forms/lease.png) | -| Handwriting | Doctor Note | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/handwriting/doctor_note.png) | | Handwriting | Math Homework | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/handwriting/math_hw.png) | | Books | Geography Textbook | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/books/geo_textbook_page.png) | | Books | Exercise Problems | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/books/exercises.png) | | Math | Attention Diagram | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/attn_all.png) | | Math | Worksheet | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/worksheet.png) | -| Math | EGA Page | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/ega.png) | -| Newspapers | New York Times | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/newspapers/nyt.png) | | Newspapers | LA Times | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/newspapers/la_times.png) | | Other | Transcript | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/transcript.png) | | Other | Flowchart | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/flowchart.png) | -## Community - -[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development. +
## Installation -### Package - ```bash pip install chandra-ocr ``` -If you're going to use the huggingface method, we also recommend installing [flash attention](https://github.com/Dao-AILab/flash-attention). +For HuggingFace inference, we recommend installing [flash attention](https://github.com/Dao-AILab/flash-attention) for better performance. -### From Source +**From source:** ```bash git clone https://github.com/datalab-to/chandra.git @@ -92,17 +113,15 @@ source .venv/bin/activate ### CLI -Process single files or entire directories: - ```bash -# Single file, with vllm server (see below for how to launch vllm) +# Single file with vLLM server chandra input.pdf ./output --method vllm -# Process all files in a directory with local model +# Directory with local model chandra ./documents ./output --method hf ``` -**CLI Options:** +**Options:** - `--method [hf|vllm]`: Inference method (default: vllm) - `--page-range TEXT`: Page range for PDFs (e.g., "1-5,7,9-12") - `--max-output-tokens INTEGER`: Max tokens per page @@ -111,77 +130,49 @@ chandra ./documents ./output --method hf - `--include-headers-footers/--no-headers-footers`: Include page headers/footers (default: exclude) - `--batch-size INTEGER`: Pages per batch (default: 1) -**Output Structure:** +**Output structure:** -Each processed file creates a subdirectory with: -- `.md` - Markdown output -- `.html` - HTML output -- `_metadata.json` - Metadata (page info, token count, etc.) -- `images/` - Extracted images from the document - -### Streamlit Web App - -Launch the interactive demo for single-page processing: - -```bash -chandra_app +``` +output/ +└── filename/ + ├── filename.md # Markdown + ├── filename.html # HTML with bounding boxes + ├── filename_metadata.json + └── images/ # Extracted images ``` -### vLLM Server (Optional) +### vLLM Server -For production deployments or batch processing, use the vLLM server: +For production or batch processing: ```bash chandra_vllm ``` -This launches a Docker container with optimized inference settings. Configure via environment variables: +Launches a Docker container with optimized inference. Configure via environment: - `VLLM_API_BASE`: Server URL (default: `http://localhost:8000/v1`) -- `VLLM_MODEL_NAME`: Model name for the server (default: `chandra`) +- `VLLM_MODEL_NAME`: Model name (default: `chandra`) - `VLLM_GPUS`: GPU device IDs (default: `0`) -You can also start your own vllm server with the `datalab-to/chandra` model. - ### Configuration -Settings can be configured via environment variables or a `local.env` file: +Settings via environment variables or `local.env`: ```bash -# Model settings MODEL_CHECKPOINT=datalab-to/chandra MAX_OUTPUT_TOKENS=8192 - -# vLLM settings VLLM_API_BASE=http://localhost:8000/v1 -VLLM_MODEL_NAME=chandra VLLM_GPUS=0 ``` -# Commercial usage +## Commercial Usage -This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-chandra). +Code is Apache 2.0. Model weights use a modified OpenRAIL-M license: free for research, personal use, and startups under $2M funding/revenue. Cannot be used competitively with our API. For broader commercial licensing, see [pricing](https://www.datalab.to/pricing?utm_source=gh-chandra). -# Benchmark table - -| **Model** | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall | Source | -|:--------------------------|:--------:|:--------------:|:--------:|:---------:|:-------------------:|:------------:|:--------------:|:----:|:--------------:|:------:| -| Datalab Chandra v0.1.0 | 82.2 | **80.3** | **88.0** | **50.4** | 90.8 | 81.2 | **92.3** | **99.9** | **83.1 ± 0.9** | Own benchmarks | -| Datalab Marker v1.10.0 | **83.8** | 69.7 | 74.8 | 32.3 | 86.6 | 79.4 | 85.7 | 99.6 | 76.5 ± 1.0 | Own benchmarks | -| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 | olmocr repo | -| Deepseek OCR | 75.2 | 72.3 | 79.7 | 33.3 | 96.1 | 66.7 | 80.1 | 99.7 | 75.4 ± 1.0 | Own benchmarks | -| GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 | olmocr repo | -| Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 | olmocr repo | -| Qwen 3 VL 8B | 70.2 | 75.1 | 45.6 | 37.5 | 89.1 | 62.1 | 43.0 | 94.3 | 64.6 ± 1.1 | Own benchmarks | -| olmOCR v0.3.0 | 78.6 | 79.9 | 72.9 | 43.9 | **95.1** | 77.3 | 81.2 | 98.9 | 78.5 ± 1.1 | olmocr repo | -| dots.ocr | 82.1 | 64.2 | 88.3 | 40.9 | 94.1 | **82.4** | 81.2 | 99.5 | 79.1 ± 1.0 | dots.ocr repo | - - -# Credits - -Thank you to the following open source projects: +## Credits - [Huggingface Transformers](https://github.com/huggingface/transformers) -- [VLLM](https://github.com/vllm-project/vllm) -- [olmocr](github.com/allenai/olmocr) -- [Qwen 3 VL](https://github.com/QwenLM/Qwen3) \ No newline at end of file +- [vLLM](https://github.com/vllm-project/vllm) +- [olmocr](https://github.com/allenai/olmocr) +- [Qwen3 VL](https://github.com/QwenLM/Qwen3) \ No newline at end of file