README

2026-03-07 22:33:48 +00:00 · 2026-01-12 17:59:28 -05:00
parent 17d49eec2e
commit 8cb96bfd65
1 changed files with 76 additions and 85 deletions
--- a/README.md
+++ b/README.md
@@ -1,85 +1,106 @@
 # Chandra

-Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
+[![Discord](https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white)](https://discord.gg/KuZwXNGnfH)

-## Features
+An OCR model for complex documents — handwriting, tables, math equations, and messy forms.

- Convert documents to markdown, html, or json with detailed layout information
- Good handwriting support
- Reconstructs forms accurately, including checkboxes
- Good support for tables, math, and complex layouts
- Extracts images and diagrams, with captions and structured data
- Support for 40+ languages
- Two inference modes: local (HuggingFace) and remote (vLLM server)
+<img src="assets/examples/forms/handwritten_form.png" width="700px"/>
+
+## Benchmarks
+
+Overall scores on the [olmocr bench](https://github.com/allenai/olmocr):
+
+<img src="assets/benchmarks/bench.png" width="600px"/>

 ## Hosted API

- We have a hosted API for Chandra [here](https://www.datalab.to/), which also includes other accuracy improvements and document workflows.
- There is a free playground [here](https://www.datalab.to/playground) if you want to try it out without installing.
+A hosted API with additional accuracy improvements is available at [datalab.to](https://www.datalab.to/). Try the [free playground](https://www.datalab.to/playground) without installing.

-## Quickstart
+## Community

-The easiest way to start is with the CLI tools:
+Join [Discord](https://discord.gg//KuZwXNGnfH) to discuss development and get help.
+
+## Quick Start

 ```shell
 pip install chandra-ocr

-# With VLLM
+# Start vLLM server, then run OCR
 chandra_vllm
 chandra input.pdf ./output

-# With HuggingFace
+# Or use HuggingFace locally
 chandra input.pdf ./output --method hf

-# Interactive streamlit app
+# Interactive web app
 chandra_app
 ```

-## Benchmarks
+**Python:**

-These are overall scores on the olmocr bench.
+```python
+from chandra.model import InferenceManager
+from chandra.input import load_pdf_images

-<img src="assets/benchmarks/bench.png" width="600px"/>
+manager = InferenceManager(method="hf")
+images = load_pdf_images("document.pdf")
+results = manager.generate(images)
+print(results[0].markdown)
+```

-See full scores [below](#benchmark-table).
+## How it Works.
+
+- **Two inference modes**: Run locally via HuggingFace Transformers, or deploy a vLLM server for production throughput
+- **Layout-aware output**: Every text block, table, and image comes with bounding box coordinates
+- **Structured formats**: Output as Markdown, HTML, or JSON with full layout metadata
+- **40+ languages** supported
+
+## What It Handles
+
+**Handwriting** — Doctor notes, filled forms, homework. Chandra reads cursive and messy print that trips up traditional OCR.
+
+**Tables** — Preserves structure including merged cells (colspan/rowspan). Works on financial filings, invoices, and data tables.
+
+**Math** — Inline and block equations rendered as LaTeX. Handles textbooks, worksheets, and research papers.
+
+**Forms** — Reconstructs checkboxes, radio buttons, and form fields with their values.
+
+**Complex Layouts** — Multi-column documents, newspapers, textbooks with figures and captions.

 ## Examples

-<img src="assets/examples/forms/handwritten_form.png" width="600px"/>
+| | |
+|---|---|
+| <img src="assets/examples/handwriting/doctor_note.png" width="350px"/><br/>**Handwriting** | <img src="assets/examples/tables/water_damage.png" width="350px"/><br/>**Tables** |
+| <img src="assets/examples/math/ega.png" width="350px"/><br/>**Math** | <img src="assets/examples/newspapers/nyt.png" width="350px"/><br/>**Newspapers** |
+
+<details>
+<summary>More examples</summary>

 | Type | Name | Link |
 |------|------|------|
-| Tables | Water Damage Form | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/tables/water_damage.png) |
 | Tables | 10K Filing | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/tables/10k.png) |
-| Forms | Handwritten Form | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/forms/handwritten_form.png) |
 | Forms | Lease Agreement | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/forms/lease.png) |
-| Handwriting | Doctor Note | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/handwriting/doctor_note.png) |
 | Handwriting | Math Homework | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/handwriting/math_hw.png) |
 | Books | Geography Textbook | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/books/geo_textbook_page.png) |
 | Books | Exercise Problems | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/books/exercises.png) |
 | Math | Attention Diagram | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/attn_all.png) |
 | Math | Worksheet | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/worksheet.png) |
-| Math | EGA Page | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/math/ega.png) |
-| Newspapers | New York Times | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/newspapers/nyt.png) |
 | Newspapers | LA Times | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/newspapers/la_times.png) |
 | Other | Transcript | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/transcript.png) |
 | Other | Flowchart | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/flowchart.png) |

-## Community
-
-[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
+</details>

 ## Installation

-### Package
-
 ```bash
 pip install chandra-ocr
 ```

-If you're going to use the huggingface method, we also recommend installing [flash attention](https://github.com/Dao-AILab/flash-attention).
+For HuggingFace inference, we recommend installing [flash attention](https://github.com/Dao-AILab/flash-attention) for better performance.

-### From Source
+**From source:**

 ```bash
 git clone https://github.com/datalab-to/chandra.git
@@ -92,17 +113,15 @@ source .venv/bin/activate

 ### CLI

-Process single files or entire directories:
-
 ```bash
-# Single file, with vllm server (see below for how to launch vllm)
+# Single file with vLLM server
 chandra input.pdf ./output --method vllm

-# Process all files in a directory with local model
+# Directory with local model
 chandra ./documents ./output --method hf
 ```

-**CLI Options:**
+**Options:**
 - `--method [hf|vllm]`: Inference method (default: vllm)
 - `--page-range TEXT`: Page range for PDFs (e.g., "1-5,7,9-12")
 - `--max-output-tokens INTEGER`: Max tokens per page
@@ -111,77 +130,49 @@ chandra ./documents ./output --method hf
 - `--include-headers-footers/--no-headers-footers`: Include page headers/footers (default: exclude)
 - `--batch-size INTEGER`: Pages per batch (default: 1)

-**Output Structure:**
+**Output structure:**

-Each processed file creates a subdirectory with:
- `<filename>.md` - Markdown output
- `<filename>.html` - HTML output
- `<filename>_metadata.json` - Metadata (page info, token count, etc.)
- `images/` - Extracted images from the document
-
-### Streamlit Web App
-
-Launch the interactive demo for single-page processing:
-
-```bash
-chandra_app
+```
+output/
+└── filename/
+    ├── filename.md           # Markdown
+    ├── filename.html         # HTML with bounding boxes
+    ├── filename_metadata.json
+    └── images/               # Extracted images
 ```

-### vLLM Server (Optional)
+### vLLM Server

-For production deployments or batch processing, use the vLLM server:
+For production or batch processing:

 ```bash
 chandra_vllm
 ```

-This launches a Docker container with optimized inference settings. Configure via environment variables:
+Launches a Docker container with optimized inference. Configure via environment:

 - `VLLM_API_BASE`: Server URL (default: `http://localhost:8000/v1`)
- `VLLM_MODEL_NAME`: Model name for the server (default: `chandra`)
+- `VLLM_MODEL_NAME`: Model name (default: `chandra`)
 - `VLLM_GPUS`: GPU device IDs (default: `0`)

-You can also start your own vllm server with the `datalab-to/chandra` model.
-
 ### Configuration

-Settings can be configured via environment variables or a `local.env` file:
+Settings via environment variables or `local.env`:

 ```bash
-# Model settings
 MODEL_CHECKPOINT=datalab-to/chandra
 MAX_OUTPUT_TOKENS=8192
-
-# vLLM settings
 VLLM_API_BASE=http://localhost:8000/v1
-VLLM_MODEL_NAME=chandra
 VLLM_GPUS=0
 ```

-# Commercial usage
+## Commercial Usage

-This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-chandra).
+Code is Apache 2.0. Model weights use a modified OpenRAIL-M license: free for research, personal use, and startups under $2M funding/revenue. Cannot be used competitively with our API. For broader commercial licensing, see [pricing](https://www.datalab.to/pricing?utm_source=gh-chandra).

-# Benchmark table
-
-| **Model**                 |  ArXiv   | Old Scans Math |  Tables  | Old Scans | Headers and Footers | Multi column | Long tiny text | Base |    Overall     | Source |
-|:--------------------------|:--------:|:--------------:|:--------:|:---------:|:-------------------:|:------------:|:--------------:|:----:|:--------------:|:------:|
-| Datalab Chandra v0.1.0    |   82.2   | **80.3** | **88.0** | **50.4**  |        90.8         |     81.2     |    **92.3**    | **99.9** | **83.1 ± 0.9** | Own benchmarks |
-| Datalab Marker v1.10.0    | **83.8** | 69.7 |   74.8   |   32.3    |        86.6         |     79.4     |      85.7      | 99.6 |   76.5 ± 1.0   | Own benchmarks |
-| Mistral OCR API           |   77.2   | 67.5 |   60.6   |   29.3    |        93.6         |     71.3     |      77.1      | 99.4 |   72.0 ± 1.1   | olmocr repo |
-| Deepseek OCR              |   75.2   | 72.3 |   79.7   |   33.3    |        96.1         |     66.7     |      80.1      | 99.7 |   75.4 ± 1.0   | Own benchmarks |
-| GPT-4o (Anchored)         |   53.5   | 74.5 |   70.0   |   40.7    |        93.8         |     69.3     |      60.6      | 96.8 |   69.9 ± 1.1   | olmocr repo |
-| Gemini Flash 2 (Anchored) |   54.5   | 56.1 |   72.1   |   34.2    |        64.7         |     61.5     |      71.5      | 95.6 |   63.8 ± 1.2   | olmocr repo |
-| Qwen 3 VL 8B              |   70.2   | 75.1 |   45.6   |   37.5    |        89.1         |     62.1     |      43.0      | 94.3 |   64.6 ± 1.1   | Own benchmarks |
-| olmOCR v0.3.0             |   78.6   | 79.9 |   72.9   |   43.9    |      **95.1**       |     77.3     |      81.2      | 98.9 |   78.5 ± 1.1   | olmocr repo |
-| dots.ocr                  |   82.1   | 64.2 |   88.3   |   40.9    |        94.1         |   **82.4**   |      81.2      | 99.5 |   79.1 ± 1.0   | dots.ocr repo |
-
-
-# Credits
-
-Thank you to the following open source projects:
+## Credits

 - [Huggingface Transformers](https://github.com/huggingface/transformers)
- [VLLM](https://github.com/vllm-project/vllm)
- [olmocr](github.com/allenai/olmocr)
- [Qwen 3 VL](https://github.com/QwenLM/Qwen3)
+- [vLLM](https://github.com/vllm-project/vllm)
+- [olmocr](https://github.com/allenai/olmocr)
+- [Qwen3 VL](https://github.com/QwenLM/Qwen3)