30 Commits
v0.1.2 ... dev

Author SHA1 Message Date
Vik Paruchuri
c049e7524f Shift import 2025-11-19 12:18:57 -05:00
Vik Paruchuri
b96eb84094 Hide imports 2025-11-19 12:06:07 -05:00
Vik Paruchuri
0f5f3d485c Enable piping through params 2025-11-12 18:06:05 -05:00
Vik Paruchuri
1bab4bf73a fix issue with pop 2025-11-12 17:33:50 -05:00
Vik Paruchuri
34f825351c Enable passing bbox scale 2025-11-12 17:17:15 -05:00
Vik Paruchuri
068db0311e Add a small sleep 2025-11-12 16:06:02 -05:00
Vik Paruchuri
aafbb70ce8 Merge pull request #36 from datalab-to/vik/bbox
Fix retry settings
2025-11-12 16:03:10 -05:00
Vik Paruchuri
22639087e7 Fix retry settings 2025-11-12 16:01:43 -05:00
Vik Paruchuri
910bcf100f Merge pull request #31 from datalab-to/vik/bbox
Vik/bbox
2025-11-10 11:36:50 -05:00
Vik Paruchuri
3958707a80 Support multiple formats 2025-11-10 11:12:00 -05:00
Vik Paruchuri
fe28f26fc2 Adjust bbox format 2025-11-07 13:18:38 -05:00
Vik Paruchuri
4470243560 Merge remote-tracking branch 'origin/dev' into dev 2025-11-05 13:46:20 -05:00
Vik Paruchuri
a3889b12fb bbox scale 2025-11-05 13:45:59 -05:00
Vik Paruchuri
d69d18d6e8 Merge pull request #24 from datalab-to/tokens
fix: respect max output tokens
2025-11-04 13:19:06 -05:00
Zach Nussbaum
d1cde9b608 fix: respect max output tokens 2025-11-04 13:16:57 -05:00
Vik Paruchuri
aabfed2ed3 Fix max repeats 2025-11-03 17:11:51 -05:00
Vik Paruchuri
4b01146865 Support different bbox format 2025-10-30 20:06:45 -04:00
Vik Paruchuri
7cf96f3911 Enable passing custom headers 2025-10-30 10:21:11 -04:00
Vik Paruchuri
607205211a Improve robustness 2025-10-29 18:16:40 -04:00
Vik Paruchuri
358358134e Fix lanczos 2025-10-26 10:38:04 -04:00
Vik Paruchuri
2d2d7ab331 Change image rendering 2025-10-26 10:27:49 -04:00
Vik Paruchuri
528b58c16f Track errors properly 2025-10-23 16:55:16 -04:00
Vik Paruchuri
5acfd8dc6a Patch image behavior 2025-10-23 12:19:41 -04:00
Vik Paruchuri
17d49eec2e Flatten in annotation 2025-10-22 09:16:12 -04:00
Vik Paruchuri
0fde883a52 Add model license 2025-10-21 13:35:24 -04:00
Vik Paruchuri
47bd444f20 Code cleanup 2025-10-21 12:11:37 -04:00
Vik Paruchuri
2151833414 Fix file output dir 2025-10-21 11:54:05 -04:00
Vik Paruchuri
8c1bfe277f Set proper batch sizes 2025-10-21 11:43:09 -04:00
Vik Paruchuri
ad6508fbc3 Fix vllm token 2025-10-21 11:33:56 -04:00
Vik Paruchuri
2e455aeb2c Fix attn impl 2025-10-21 11:15:29 -04:00
17 changed files with 401 additions and 189 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,7 @@
local.env
experiments
.claude
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/

59
MODEL_LICENSE Normal file
View File

@@ -0,0 +1,59 @@
AI PUBS OPEN RAIL-M LICENSE (MODIFIED)
Version 0.1, March 2, 2023 (Modified)
http://licenses.ai/
PLEASE READ THESE TERMS CAREFULLY BEFORE USING THE MODEL OR A DERIVATIVE WORKS OF THE MODEL MADE AVAILABLE IN CONNECTION WITH THESE TERMS. BY DOWNLOADING, REPRODUCING, DISTRIBUTING OR USING THE MODEL OR A DERIVATIVE WORK OF THE MODEL IN ANY MANNER, YOU (“YOU”) AGREE TO BE BOUND BY THESE TERMS (THE “AGREEMENT”) TO THE EXCLUSION OF ALL OTHER TERMS. YOU REPRESENT AND WARRANT THAT YOU HAVE THE AUTHORITY TO ENTER INTO THIS AGREEMENT; IF YOU ARE ENTERING INTO THIS AGREEMENT ON BEHALF OF AN ORGANIZATION OR ENTITY, REFERENCES TO AND “YOU” IN THIS AGREEMENT, REFER TO THAT ORGANIZATION OR ENTITY. IF YOU DO NOT AGREE TO ALL OF THE FOLLOWING, YOU MAY NOT DOWNLOAD, REPRODUCE, DISTRIBUTE OR USE THE MODEL OR A DERIVATIVE WORK OF THE MODEL IN ANY MANNER.
Section I: PREAMBLE
This OpenRAIL-M License, as modified, is generally applicable to any machine-learning Model.
The “Open” nomenclature indicates that the licensed Model is be freely accessible to downstream and other users. The “RAIL” nomenclature indicates that there are use restrictions prohibiting the use of the Model. These restrictions are intended to avoid potential misuse. This License specifies that the use restrictions in the original License must apply to such derivatives.
NOW THEREFORE, You and Licensor agree as follows:
1. Definitions
(a) “Complementary Material” means the applicable source code and scripts used to define, run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if any. This includes any accompanying documentation, tutorials, examples, and any related information, if any. Complementary Material is not licensed under this License.
(b) "Contribution" means any work, including the original version of the Model and any modifications or additions to that Model or Derivatives of the Model thereof, that is intentionally submitted to Licensor for inclusion in the Model by the rights owner or by an individual or legal entity authorized to submit on behalf of the rights owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Model, but excluding communication that is conspicuously marked or otherwise designated in writing by the rights owner as "Not a Contribution."
(c) "Contributor" means Licensor and any individual or legal entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Model.
(d) “Data” means a collection of information and/or content extracted from the dataset used with the Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under this License.
(e) “Derivatives of the Model” means all modifications to the Model, works based on the Model, or any other model which is created or initialized by transfer of patterns of the weights, parameters, activations or output of the Model, to the other model, in order to cause the other model to perform similarly to the Model, including - but not limited to - distillation methods entailing the use of intermediate data representations or methods based on the generation of synthetic data by the Model for training the other model.
(f) “Distribution” means any transmission, reproduction, publication, distribution, or other sharing of the Model or Derivatives of the Model to a third party, including providing the Model as a hosted service made available by electronic or other remote means, including but not limited to API-based or web access.
(g) “Harm” includes but is not limited to physical, mental, psychological, financial and reputational damage, pain, or loss
(h) "License" means the terms and conditions for use, reproduction, and Distribution as defined in this document.
(i) “Licensor” means the rights owner or entity authorized by the rights owner that is granting the License, including the persons or entities that may have rights in the Model and/or distributing the Model.
(j) “Model” means any accompanying machine-learning based assemblies (including checkpoints), consisting of learnt weights, parameters (including optimizer states), corresponding to the model architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or in part on the Data, using the Complementary Material.
(k) “Output” means the results of operating a Model as embodied in informational content resulting therefrom.
(l) “Third Parties” means individuals or legal entities that are not under common control with Licensor or You.
(m) "You" (or "Your") means an individual or legal entity exercising permissions granted by this License and/or making use of the Model for whichever purpose and in any field of use, including usage of the Model in an end-use application, including but not limited to a chatbot, translator, or image generator.
Section II: INTELLECTUAL PROPERTY RIGHTS
Both copyright and patent grants may apply to the Model and Derivatives of the Model. The Model and Derivatives of the Model are subject to additional terms as described in Section III, which shall govern the use of the Model and Derivatives of the Model even in the event Section II is held unenforceable.
2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Model and Derivatives of the Model.
3. Grant of Patent License. Subject to the terms and conditions of this License and where and as applicable, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model and/or Derivatives of the Model where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Model or Derivatives of the Model to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model or Derivative of the Model and/or a Contribution incorporated within the Model or Derivative of the Model constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for the Model and/or Derivative of the Model shall terminate as of the date such litigation is asserted or filed.
Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
4. Distribution and Redistribution. You may host the Model or Derivatives of the Model for remote access by Third Parties, including but not limited to software-as-a-service, reproduce, or Distribute copies of the Model or Derivatives of the Model thereof in any medium, with or without modifications, provided that You meet the conditions in this Section III:
(a) Use-based restrictions in paragraph 5 MUST be included as an enforceable provision by You in any type of legal agreement (for example, a license) governing the use and/or distribution of the Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to, that the Model and Derivatives of the Model are subject to paragraph 5;
(b) You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this License;
(c) You must cause any modified files to carry prominent notices stating that You changed the files; and
(d) You must retain all copyright, patent, trademark, and attribution notices excluding those notices that do not pertain to any part of the Model or Derivatives of the Model.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions, consistent with paragraph 4.a., for use, reproduction, or Distribution of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use, reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License.
5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions. Accordingly, You cannot use the Model or the Derivatives of the Model in violation of such restrictions. You may use the Model subject to this License, including only for lawful purposes and in accordance with the License. Use may include creating any content with, fine-tuning, updating, running, training, evaluating and/or re-parametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model to comply with the terms of this paragraph 5.
6. The Output You Generate. Except as set forth herein, Licensor claims no rights in the Output You generate using the Model. You are solely responsible for the Output you generate and its subsequent uses. No use of the Output can contravene any provision as stated in the License.
7. Attribution. In connection with any Output, or use of Distribution of any Model or Derivatives of the Model, You agree to give appropriate credit and attribution to Licensor, provide a link to the original Model or Derivatives of the Model, provide a copy of this License, and identify any changes You have made to the Model or Derivatives of the Model (collectively, the “Attribution”). The Attribution must not suggest endorsement by any Licensor.
8. Share-a-Like. As a condition to the license and authorizations herein, You agree to apply this License (to the exclusion of all others) to any and all copies of the Model, Derivatives of the Model, any changes or improvements to the Model or Derivatives of the Model, and to the Output and any derivatives, changes or improvements to or of the Output.
Section IV: OTHER PROVISIONS
9. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model through electronic means, or cause modification to the Output resulting from updates to the Model based.
10. Trademarks and related. Nothing in this License permits You to make use of Licensors trademarks, trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the parties; and any rights not expressly granted herein are reserved by the Licensors.
11. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Model (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Model and Derivatives of the Model, and assume any risks associated with Your exercise of permissions under this License.
12. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
13. Accepting Warranty or Additional Liability. While Distributing the Model or Derivatives of the Model, You may choose to charge a fee in exchange for support, warranty, indemnity, or other obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor or Licensor, and only if You agree to indemnify, defend, and hold each Contributor and the Licensor harmless for any liability incurred by, or claims asserted against, such Contributor or Licensor by reason of your accepting any such warranty or additional liability.
14. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
END OF TERMS AND CONDITIONS
Attachment A
USE RESTRICTIONS
As conditions to the Licenses set forth in this Agreement, You agree not to use, reproduce, modify, create or Distribute the Model, Derivatives of the Model, or Output (collectively, “Use”) in any of the following ways:
1. Legal:
(a) In any way that violates any applicable national, federal, state, local or international law or regulation; or
(b) to directly or indirectly infringe or misappropriate any third party intellectual property rights (including those of Licensor or any Contributor)
2. Commercial:
(a) for any purpose if You (your employer, or the entity you are affiliated with) generated more than two million US Dollars ($2,000,000) in gross revenue in the prior year, except where Your Use is limited to personal use or research purposes;
(b) for any purpose if You (your employer, or the entity you are affiliated with) has raised more than two million US dollars ($2,000,000) in total equity or debt funding from any source, except where Your Use is limited to personal use or research purposes; or
(c) for any purpose if You (your employer, or the entity you are affiliated with) provides or otherwise makes available any product or service that competes with any product or service offered by or made available by Licensor or any of its affiliates.
Commercial and broader use licenses may be available from Licensor at the following URL: https://www.datalab.to/

View File

@@ -1,6 +1,6 @@
# Chandra
Chandra is an OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
## Features
@@ -65,6 +65,10 @@ See full scores [below](#benchmark-table).
| Other | Transcript | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/transcript.png) |
| Other | Flowchart | [View](https://github.com/datalab-to/chandra/blob/master/assets/examples/other/flowchart.png) |
## Community
[Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
## Installation
### Package
@@ -73,6 +77,8 @@ See full scores [below](#benchmark-table).
pip install chandra-ocr
```
If you're going to use the huggingface method, we also recommend installing [flash attention](https://github.com/Dao-AILab/flash-attention).
### From Source
```bash
@@ -152,24 +158,25 @@ VLLM_MODEL_NAME=chandra
VLLM_GPUS=0
```
## Benchmark table
| **Model** | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall | Source |
|:----------|:--------:|:--------------:|:--------:|:---------:|:-------------------:|:------------:|:--------------:|:----:|:--------------:|:------:|
| Datalab Chandra v0.1.0 | 82.2 | **80.3** | **88.0** | **50.4** | 90.8 | 81.2 | **92.3** | **99.9** | **83.1 ± 0.9** | Own benchmarks |
| Datalab Marker v1.10.0 | **83.8** | 69.7 | 74.8 | 32.3 | 86.6 | 79.4 | 85.7 | 99.6 | 76.5 ± 1.0 | Own benchmarks |
| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 | olmocr repo |
| Deepseek OCR | 75.2 | 72.3 | 79.7 | 33.3 | 96.1 | 66.7 | 80.1 | 99.7 | 75.4 ± 1.0 | Own benchmarks |
| GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 | olmocr repo |
| Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 | olmocr repo |
| Qwen 3 VL | 70.2 | 75.1 | 45.6 | 37.5 | 89.1 | 62.1 | 43.0 | 94.3 | 64.6 ± 1.1 | Own benchmarks |
| olmOCR v0.3.0 | 78.6 | 79.9 | 72.9 | 43.9 | **95.1** | 77.3 | 81.2 | 98.9 | 78.5 ± 1.1 | olmocr repo |
| dots.ocr | 82.1 | 64.2 | 88.3 | 40.9 | 94.1 | **82.4** | 81.2 | 99.5 | 79.1 ± 1.0 | dots.ocr repo |
# Commercial usage
This code is Apache 2.0, and our model weights use a modified OpenRAIL-M license (free for research, personal use, and startups under $2M funding/revenue, cannot be used competitively with our API). To remove the OpenRAIL license requirements, or for broader commercial licensing, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-chandra).
# Benchmark table
| **Model** | ArXiv | Old Scans Math | Tables | Old Scans | Headers and Footers | Multi column | Long tiny text | Base | Overall | Source |
|:--------------------------|:--------:|:--------------:|:--------:|:---------:|:-------------------:|:------------:|:--------------:|:----:|:--------------:|:------:|
| Datalab Chandra v0.1.0 | 82.2 | **80.3** | **88.0** | **50.4** | 90.8 | 81.2 | **92.3** | **99.9** | **83.1 ± 0.9** | Own benchmarks |
| Datalab Marker v1.10.0 | **83.8** | 69.7 | 74.8 | 32.3 | 86.6 | 79.4 | 85.7 | 99.6 | 76.5 ± 1.0 | Own benchmarks |
| Mistral OCR API | 77.2 | 67.5 | 60.6 | 29.3 | 93.6 | 71.3 | 77.1 | 99.4 | 72.0 ± 1.1 | olmocr repo |
| Deepseek OCR | 75.2 | 72.3 | 79.7 | 33.3 | 96.1 | 66.7 | 80.1 | 99.7 | 75.4 ± 1.0 | Own benchmarks |
| GPT-4o (Anchored) | 53.5 | 74.5 | 70.0 | 40.7 | 93.8 | 69.3 | 60.6 | 96.8 | 69.9 ± 1.1 | olmocr repo |
| Gemini Flash 2 (Anchored) | 54.5 | 56.1 | 72.1 | 34.2 | 64.7 | 61.5 | 71.5 | 95.6 | 63.8 ± 1.2 | olmocr repo |
| Qwen 3 VL 8B | 70.2 | 75.1 | 45.6 | 37.5 | 89.1 | 62.1 | 43.0 | 94.3 | 64.6 ± 1.1 | Own benchmarks |
| olmOCR v0.3.0 | 78.6 | 79.9 | 72.9 | 43.9 | **95.1** | 77.3 | 81.2 | 98.9 | 78.5 ± 1.1 | olmocr repo |
| dots.ocr | 82.1 | 64.2 | 88.3 | 40.9 | 94.1 | **82.4** | 81.2 | 99.5 | 79.1 ± 1.0 | dots.ocr repo |
# Credits
Thank you to the following open source projects:

View File

@@ -2,20 +2,48 @@ from typing import List
import filetype
from PIL import Image
import pypdfium2 as pdfium
import pypdfium2.raw as pdfium_c
from chandra.settings import settings
def load_pdf_images(filepath: str, page_range: List[int]):
def flatten(page, flag=pdfium_c.FLAT_NORMALDISPLAY):
rc = pdfium_c.FPDFPage_Flatten(page, flag)
if rc == pdfium_c.FLATTEN_FAIL:
print(f"Failed to flatten annotations / form fields on page {page}.")
def load_image(
filepath: str, min_image_dim: int = settings.MIN_IMAGE_DIM
) -> Image.Image:
image = Image.open(filepath).convert("RGB")
if image.width < min_image_dim or image.height < min_image_dim:
scale = min_image_dim / min(image.width, image.height)
new_size = (int(image.width * scale), int(image.height * scale))
image = image.resize(new_size, Image.Resampling.LANCZOS)
return image
def load_pdf_images(
filepath: str,
page_range: List[int],
image_dpi: int = settings.IMAGE_DPI,
min_pdf_image_dim: int = settings.MIN_PDF_IMAGE_DIM,
) -> List[Image.Image]:
doc = pdfium.PdfDocument(filepath)
doc.init_forms()
images = []
for page in range(len(doc)):
if not page_range or page in page_range:
page_obj = doc[page]
min_page_dim = min(page_obj.get_width(), page_obj.get_height())
scale_dpi = (settings.MIN_IMAGE_DIM / min_page_dim) * 72
scale_dpi = max(scale_dpi, settings.IMAGE_DPI)
pil_image = doc[page].render(scale=scale_dpi / 72).to_pil().convert("RGB")
scale_dpi = (min_pdf_image_dim / min_page_dim) * 72
scale_dpi = max(scale_dpi, image_dpi)
page_obj = doc[page]
flatten(page_obj)
page_obj = doc[page]
pil_image = page_obj.render(scale=scale_dpi / 72).to_pil().convert("RGB")
images.append(pil_image)
doc.close()
@@ -44,5 +72,5 @@ def load_file(filepath: str, config: dict):
if input_type and input_type.extension == "pdf":
images = load_pdf_images(filepath, page_range)
else:
images = [Image.open(filepath).convert("RGB")]
return images
images = [load_image(filepath)]
return images

View File

@@ -4,6 +4,7 @@ from chandra.model.hf import load_model, generate_hf
from chandra.model.schema import BatchInputItem, BatchOutputItem
from chandra.model.vllm import generate_vllm
from chandra.output import parse_markdown, parse_html, parse_chunks, extract_images
from chandra.settings import settings
class InferenceManager:
@@ -26,19 +27,29 @@ class InferenceManager:
output_kwargs["include_headers_footers"] = kwargs.pop(
"include_headers_footers"
)
bbox_scale = kwargs.pop("bbox_scale", settings.BBOX_SCALE)
vllm_api_base = kwargs.pop("vllm_api_base", settings.VLLM_API_BASE)
if self.method == "vllm":
results = generate_vllm(
batch, max_output_tokens=max_output_tokens, **kwargs
batch,
max_output_tokens=max_output_tokens,
bbox_scale=bbox_scale,
vllm_api_base=vllm_api_base,
**kwargs,
)
else:
results = generate_hf(
batch, self.model, max_output_tokens=max_output_tokens, **kwargs
batch,
self.model,
max_output_tokens=max_output_tokens,
bbox_scale=bbox_scale,
**kwargs,
)
output = []
for result, input_item in zip(results, batch):
chunks = parse_chunks(result.raw, input_item.image)
chunks = parse_chunks(result.raw, input_item.image, bbox_scale=bbox_scale)
output.append(
BatchOutputItem(
markdown=parse_markdown(result.raw, **output_kwargs),
@@ -48,6 +59,7 @@ class InferenceManager:
page_box=[0, 0, input_item.image.width, input_item.image.height],
token_count=result.token_count,
images=extract_images(result.raw, chunks, input_item.image),
error=result.error,
)
)
return output

View File

@@ -1,8 +1,5 @@
from typing import List
from qwen_vl_utils import process_vision_info
from transformers import Qwen3VLForConditionalGeneration, Qwen3VLProcessor
from chandra.model.schema import BatchInputItem, GenerationResult
from chandra.model.util import scale_to_fit
from chandra.prompts import PROMPT_MAPPING
@@ -10,12 +7,20 @@ from chandra.settings import settings
def generate_hf(
batch: List[BatchInputItem], model, max_output_tokens=None, **kwargs
batch: List[BatchInputItem],
model,
max_output_tokens=None,
bbox_scale: int = settings.BBOX_SCALE,
**kwargs,
) -> List[GenerationResult]:
from qwen_vl_utils import process_vision_info
if max_output_tokens is None:
max_output_tokens = settings.MAX_OUTPUT_TOKENS
messages = [process_batch_element(item, model.processor) for item in batch]
messages = [
process_batch_element(item, model.processor, bbox_scale) for item in batch
]
text = model.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
@@ -48,12 +53,12 @@ def generate_hf(
return results
def process_batch_element(item: BatchInputItem, processor):
def process_batch_element(item: BatchInputItem, processor, bbox_scale: int):
prompt = item.prompt
prompt_type = item.prompt_type
if not prompt:
prompt = PROMPT_MAPPING[prompt_type]
prompt = PROMPT_MAPPING[prompt_type].replace("{bbox_scale}", str(bbox_scale))
content = []
image = scale_to_fit(item.image) # Guarantee max size
@@ -65,14 +70,22 @@ def process_batch_element(item: BatchInputItem, processor):
def load_model():
import torch
from transformers import Qwen3VLForConditionalGeneration, Qwen3VLProcessor
device_map = "auto"
if settings.TORCH_DEVICE:
device_map = {"": settings.TORCH_DEVICE}
kwargs = {
"dtype": torch.bfloat16,
"device_map": device_map,
}
if settings.TORCH_ATTN:
kwargs["attn_implementation"] = settings.TORCH_ATTN
model = Qwen3VLForConditionalGeneration.from_pretrained(
settings.MODEL_CHECKPOINT,
dtype=settings.TORCH_DTYPE,
device_map=device_map,
attn_implementation=settings.TORCH_ATTN_IMPLEMENTATION,
settings.MODEL_CHECKPOINT, **kwargs
)
model = model.eval()
processor = Qwen3VLProcessor.from_pretrained(settings.MODEL_CHECKPOINT)

View File

@@ -27,3 +27,4 @@ class BatchOutputItem:
page_box: List[int]
token_count: int
images: dict
error: bool

View File

@@ -43,7 +43,11 @@ def scale_to_fit(
def detect_repeat_token(
predicted_tokens: str, max_repeats: int = 4, window_size: int = 500, cut_from_end: int = 0
predicted_tokens: str,
base_max_repeats: int = 4,
window_size: int = 500,
cut_from_end: int = 0,
scaling_factor: float = 3.0,
):
try:
predicted_tokens = parse_markdown(predicted_tokens)
@@ -54,11 +58,13 @@ def detect_repeat_token(
if cut_from_end > 0:
predicted_tokens = predicted_tokens[:-cut_from_end]
# Try different sequence lengths (1 to window_size//2)
for seq_len in range(1, window_size // 2 + 1):
# Extract the potential repeating sequence from the end
candidate_seq = predicted_tokens[-seq_len:]
# Inverse scaling: shorter sequences need more repeats
max_repeats = int(base_max_repeats * (1 + scaling_factor / seq_len))
# Count how many times this sequence appears consecutively at the end
repeat_count = 0
pos = len(predicted_tokens) - seq_len
@@ -72,12 +78,7 @@ def detect_repeat_token(
else:
break
# If we found more than max_repeats consecutive occurrences
if repeat_count > max_repeats:
return True
return False
def layout_failed(predicted_tokens: str, image: Image.Image):
pass

View File

@@ -1,5 +1,6 @@
import base64
import io
import time
from concurrent.futures import ThreadPoolExecutor
from itertools import repeat
from typing import List
@@ -25,10 +26,15 @@ def generate_vllm(
max_output_tokens: int = None,
max_retries: int = None,
max_workers: int | None = None,
custom_headers: dict | None = None,
max_failure_retries: int | None = None,
bbox_scale: int = settings.BBOX_SCALE,
vllm_api_base: str = settings.VLLM_API_BASE,
) -> List[GenerationResult]:
client = OpenAI(
api_key=settings.VLLM_API_KEY,
base_url=settings.VLLM_API_BASE,
base_url=vllm_api_base,
default_headers=custom_headers,
)
model_name = settings.VLLM_MODEL_NAME
@@ -50,7 +56,9 @@ def generate_vllm(
) -> GenerationResult:
prompt = item.prompt
if not prompt:
prompt = PROMPT_MAPPING[item.prompt_type]
prompt = PROMPT_MAPPING[item.prompt_type].replace(
"{bbox_scale}", str(bbox_scale)
)
content = []
image = scale_to_fit(item.image)
@@ -68,41 +76,68 @@ def generate_vllm(
completion = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": content}],
max_tokens=settings.MAX_OUTPUT_TOKENS,
max_tokens=max_output_tokens,
temperature=temperature,
top_p=top_p,
)
raw = completion.choices[0].message.content
result = GenerationResult(
raw=raw,
token_count=completion.usage.completion_tokens,
error=False,
)
except Exception as e:
print(f"Error during VLLM generation: {e}")
return GenerationResult(raw="", token_count=0, error=True)
return GenerationResult(
raw=completion.choices[0].message.content,
token_count=completion.usage.completion_tokens,
error=False,
)
return result
def process_item(item, max_retries):
def process_item(item, max_retries, max_failure_retries=None):
result = _generate(item)
retries = 0
while retries < max_retries and (
detect_repeat_token(result.raw)
or (
len(result.raw) > 50
and detect_repeat_token(result.raw, cut_from_end=50)
)
or result.error
):
print(
f"Detected repeat token or error, retrying generation (attempt {retries + 1})..."
)
while _should_retry(result, retries, max_retries, max_failure_retries):
result = _generate(item, temperature=0.3, top_p=0.95)
retries += 1
return result
def _should_retry(result, retries, max_retries, max_failure_retries):
has_repeat = detect_repeat_token(result.raw) or (
len(result.raw) > 50 and detect_repeat_token(result.raw, cut_from_end=50)
)
if retries < max_retries and has_repeat:
print(
f"Detected repeat token, retrying generation (attempt {retries + 1})..."
)
return True
if retries < max_retries and result.error:
print(
f"Detected vllm error, retrying generation (attempt {retries + 1})..."
)
time.sleep(2 * (retries + 1)) # Sleeping can help under load
return True
if (
result.error
and max_failure_retries is not None
and retries < max_failure_retries
):
print(
f"Detected vllm error, retrying generation (attempt {retries + 1})..."
)
time.sleep(2 * (retries + 1)) # Sleeping can help under load
return True
return False
with ThreadPoolExecutor(max_workers=max_workers) as executor:
results = list(executor.map(process_item, batch, repeat(max_retries)))
results = list(
executor.map(
process_item, batch, repeat(max_retries), repeat(max_failure_retries)
)
)
return results

View File

@@ -6,9 +6,11 @@ from functools import lru_cache
import six
from PIL import Image
from bs4 import BeautifulSoup, NavigableString
from bs4 import BeautifulSoup
from markdownify import MarkdownConverter, re_whitespace
from chandra.settings import settings
@lru_cache
def _hash_html(html: str):
@@ -30,7 +32,11 @@ def extract_images(html: str, chunks: dict, image: Image.Image):
if not img:
continue
bbox = chunk["bbox"]
block_image = image.crop(bbox)
try:
block_image = image.crop(bbox)
except ValueError:
# Happens when bbox coordinates are invalid
continue
img_name = get_image_name(html, div_idx)
images[img_name] = block_image
return images
@@ -67,44 +73,22 @@ def parse_html(
else:
img = BeautifulSoup(f"<img src='{img_src}'/>", "html.parser")
div.append(img)
# Wrap text content in <p> tags if no inner HTML tags exist
if label in ["Text"] and not re.search(
"<.+>", str(div.decode_contents()).strip()
):
# Add inner p tags if missing for text blocks
text_content = str(div.decode_contents()).strip()
text_content = f"<p>{text_content}</p>"
div.clear()
div.append(BeautifulSoup(text_content, "html.parser"))
content = str(div.decode_contents())
out_html += content
return out_html
def escape_dollars(text):
return text.replace("$", r"\$")
def get_formatted_table_text(element):
text = []
for content in element.contents:
if content is None:
continue
if isinstance(content, NavigableString):
stripped = content.strip()
if stripped:
text.append(escape_dollars(stripped))
elif content.name == "br":
text.append("<br>")
elif content.name == "math":
text.append("$" + content.text + "$")
else:
content_str = escape_dollars(str(content))
text.append(content_str)
full_text = ""
for i, t in enumerate(text):
if t == "<br>":
full_text += t
elif i > 0 and text[i - 1] != "<br>":
full_text += " " + t
else:
full_text += t
return full_text
class Markdownify(MarkdownConverter):
def __init__(
self,
@@ -204,19 +188,25 @@ class LayoutBlock:
content: str
def parse_layout(html: str, image: Image.Image):
def parse_layout(html: str, image: Image.Image, bbox_scale=settings.BBOX_SCALE):
soup = BeautifulSoup(html, "html.parser")
top_level_divs = soup.find_all("div", recursive=False)
width, height = image.size
width_scaler = width / 1024
height_scaler = height / 1024
width_scaler = width / bbox_scale
height_scaler = height / bbox_scale
layout_blocks = []
for div in top_level_divs:
bbox = div.get("data-bbox")
try:
bbox = json.loads(bbox)
assert len(bbox) == 4, "Invalid bbox length"
except Exception:
bbox = [0, 0, 1, 1] # Fallback to a default bbox if parsing fails
try:
bbox = bbox.split(" ")
assert len(bbox) == 4, "Invalid bbox length"
except Exception:
bbox = [0, 0, 1, 1]
bbox = list(map(int, bbox))
# Normalize bbox
@@ -232,7 +222,7 @@ def parse_layout(html: str, image: Image.Image):
return layout_blocks
def parse_chunks(html: str, image: Image.Image):
layout = parse_layout(html, image)
def parse_chunks(html: str, image: Image.Image, bbox_scale=settings.BBOX_SCALE):
layout = parse_layout(html, image, bbox_scale=bbox_scale)
chunks = [asdict(block) for block in layout]
return chunks

View File

@@ -65,7 +65,7 @@ Guidelines:
""".strip()
OCR_LAYOUT_PROMPT = f"""
OCR this image to HTML, arranged as layout blocks. Each layout block should be a div with the data-bbox attribute representing the bounding box of the block in [x0, y0, x1, y1] format. Bboxes are normalized 0-1024. The data-label attribute is the label for the block.
OCR this image to HTML, arranged as layout blocks. Each layout block should be a div with the data-bbox attribute representing the bounding box of the block in [x0, y0, x1, y1] format. Bboxes are normalized 0-{{bbox_scale}}. The data-label attribute is the label for the block.
Use the following labels:
- Caption

View File

@@ -87,7 +87,7 @@ def save_merged_output(
# Save extracted images if requested
if save_images and result.images:
images_dir = file_output_dir / "images"
images_dir = file_output_dir
images_dir.mkdir(exist_ok=True)
for img_name, pil_image in result.images.items():
@@ -172,7 +172,7 @@ def save_merged_output(
@click.option(
"--batch-size",
type=int,
default=1,
default=None,
help="Number of pages to process in a batch.",
)
@click.option(
@@ -194,6 +194,16 @@ def main(
batch_size: int,
paginate_output: bool,
):
if method == "hf":
click.echo(
"When using '--method hf', ensure that the batch size is set correctly. We will default to batch size of 1."
)
if batch_size is None:
batch_size = 1
elif method == "vllm":
if batch_size is None:
batch_size = 28
click.echo("Chandra CLI - Starting OCR processing")
click.echo(f"Input: {input_path}")
click.echo(f"Output: {output_path}")

View File

@@ -143,6 +143,7 @@ def process():
"image_height": img_height,
"blocks": blocks_data,
"html": html_with_images,
"markdown": result.markdown,
}
)

View File

@@ -64,6 +64,20 @@
cursor: not-allowed;
}
.controls label {
display: flex;
align-items: center;
gap: 8px;
color: white;
font-size: 14px;
cursor: pointer;
user-select: none;
}
.controls input[type="checkbox"] {
cursor: pointer;
}
.loading {
display: none;
color: #f39c12;
@@ -75,6 +89,11 @@
font-weight: bold;
}
.success {
color: #27ae60;
font-weight: bold;
}
.screenshot-container {
display: none;
margin-top: 60px;
@@ -88,8 +107,18 @@
display: flex;
}
.left-panel, .right-panel {
flex: 1;
.left-panel {
flex: 0 0 40%;
display: flex;
flex-direction: column;
background: white;
border-radius: 8px;
overflow: hidden;
box-shadow: 0 4px 12px rgba(0,0,0,0.3);
}
.right-panel {
flex: 0 0 60%;
display: flex;
flex-direction: column;
background: white;
@@ -137,6 +166,7 @@
padding: 30px;
line-height: 1.6;
color: #333;
font-size: 24px;
}
.markdown-content h1, .markdown-content h2, .markdown-content h3 {
@@ -215,8 +245,14 @@
<input type="text" id="filePath" placeholder="Enter file path (e.g., /path/to/document.pdf)">
<input type="number" id="pageNumber" placeholder="Page" value="0" min="0">
<button id="processBtn" onclick="processFile()">Process</button>
<label>
<input type="checkbox" id="showLayoutBoxes" checked onchange="toggleLayoutBoxes()">
Show Layout Boxes
</label>
<button id="copyMarkdownBtn" onclick="copyMarkdown()" style="display: none;">Copy Markdown</button>
<span class="loading" id="loading">Processing...</span>
<span class="error" id="error"></span>
<span class="success" id="success"></span>
</div>
<div class="screenshot-container" id="container">
@@ -242,6 +278,11 @@
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/github-markdown-css/5.8.1/github-markdown.min.css" integrity="sha512-BrOPA520KmDMqieeM7XFe6a3u3Sb3F1JBaQnrIAmWg3EYrciJ+Qqe6ZcKCdfPv26rGcgTrJnZ/IdQEct8h3Zhw==" crossorigin="anonymous" referrerpolicy="no-referrer" />
<script>
// Global state to store markdown and canvas data
let currentMarkdown = null;
let currentData = null;
let currentImageSrc = null;
async function processFile() {
const filePath = document.getElementById('filePath').value;
const pageNumber = parseInt(document.getElementById('pageNumber').value) || 0;
@@ -285,6 +326,10 @@
}
function renderResults(data) {
// Store data for toggle functionality
currentData = data;
currentImageSrc = data.image_base64;
const canvas = document.getElementById('layoutCanvas');
const ctx = canvas.getContext('2d');
const markdownContent = document.getElementById('markdownContent');
@@ -292,51 +337,14 @@
// Draw image with layout overlays
const img = new Image();
img.onload = function() {
canvas.width = data.image_width;
canvas.height = data.image_height;
// Draw image
ctx.drawImage(img, 0, 0, data.image_width, data.image_height);
// Draw layout blocks
ctx.lineWidth = 3;
ctx.font = 'bold 14px -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif';
const labelCounts = {};
data.blocks.forEach((block) => {
const [x1, y1, x2, y2] = block.bbox;
const width = x2 - x1;
const height = y2 - y1;
// Draw rectangle with semi-transparent fill
ctx.strokeStyle = block.color;
ctx.fillStyle = block.color + '33';
ctx.fillRect(x1, y1, width, height);
ctx.strokeRect(x1, y1, width, height);
// Count labels for unique identification
labelCounts[block.label] = (labelCounts[block.label] || 0) + 1;
const labelWithCount = `${block.label} #${labelCounts[block.label]}`;
// Draw label with background
const textMetrics = ctx.measureText(labelWithCount);
const textWidth = textMetrics.width;
const textHeight = 16;
const padding = 6;
const labelX = x1;
const labelY = Math.max(y1 - textHeight - padding, textHeight);
ctx.fillStyle = block.color;
ctx.fillRect(labelX, labelY - textHeight, textWidth + padding * 2, textHeight + padding);
ctx.fillStyle = 'white';
ctx.textBaseline = 'top';
ctx.fillText(labelWithCount, labelX + padding, labelY - textHeight + padding/2);
});
drawCanvas(img, data, ctx);
};
img.src = data.image_base64;
// Store markdown and show copy button
currentMarkdown = data.markdown;
document.getElementById('copyMarkdownBtn').style.display = 'inline-block';
// Render HTML directly (with images embedded)
markdownContent.innerHTML = data.html;
@@ -362,6 +370,85 @@
});
}
function drawCanvas(img, data, ctx) {
const canvas = document.getElementById('layoutCanvas');
canvas.width = data.image_width;
canvas.height = data.image_height;
// Draw image
ctx.drawImage(img, 0, 0, data.image_width, data.image_height);
// Check if layout boxes should be shown
const showBoxes = document.getElementById('showLayoutBoxes').checked;
if (!showBoxes) return;
// Draw layout blocks
ctx.lineWidth = 3;
ctx.font = 'bold 14px -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif';
const labelCounts = {};
data.blocks.forEach((block) => {
const [x1, y1, x2, y2] = block.bbox;
const width = x2 - x1;
const height = y2 - y1;
// Draw rectangle with semi-transparent fill
ctx.strokeStyle = block.color;
ctx.fillStyle = block.color + '33';
ctx.fillRect(x1, y1, width, height);
ctx.strokeRect(x1, y1, width, height);
// Count labels for unique identification
labelCounts[block.label] = (labelCounts[block.label] || 0) + 1;
const labelWithCount = `${block.label} #${labelCounts[block.label]}`;
// Draw label with background
const textMetrics = ctx.measureText(labelWithCount);
const textWidth = textMetrics.width;
const textHeight = 16;
const padding = 6;
const labelX = x1;
const labelY = Math.max(y1 - textHeight - padding, textHeight);
ctx.fillStyle = block.color;
ctx.fillRect(labelX, labelY - textHeight, textWidth + padding * 2, textHeight + padding);
ctx.fillStyle = 'white';
ctx.textBaseline = 'top';
ctx.fillText(labelWithCount, labelX + padding, labelY - textHeight + padding/2);
});
}
function toggleLayoutBoxes() {
if (!currentData || !currentImageSrc) return;
const canvas = document.getElementById('layoutCanvas');
const ctx = canvas.getContext('2d');
const img = new Image();
img.onload = function() {
drawCanvas(img, currentData, ctx);
};
img.src = currentImageSrc;
}
function copyMarkdown() {
if (!currentMarkdown) {
document.getElementById('error').textContent = 'No markdown to copy';
return;
}
navigator.clipboard.writeText(currentMarkdown).then(() => {
const success = document.getElementById('success');
success.textContent = 'Markdown copied!';
setTimeout(() => {
success.textContent = '';
}, 2000);
}).catch((err) => {
document.getElementById('error').textContent = 'Failed to copy: ' + err.message;
});
}
// Allow Enter key to trigger processing
document.getElementById('filePath').addEventListener('keypress', function(e) {
if (e.key === 'Enter') processFile();

View File

@@ -17,8 +17,6 @@ def main():
"-v",
f"{os.path.expanduser('~')}/.cache/huggingface:/root/.cache/huggingface",
"--env",
f"HUGGING_FACE_HUB_TOKEN={os.getenv('HF_TOKEN')}",
"--env",
"VLLM_ATTENTION_BACKEND=TORCH_SDPA",
"-p",
"8000:8000",

View File

@@ -1,7 +1,5 @@
from dotenv import find_dotenv
from pydantic import computed_field
from pydantic_settings import BaseSettings
import torch
import os
@@ -9,11 +7,13 @@ class Settings(BaseSettings):
# Paths
BASE_DIR: str = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
IMAGE_DPI: int = 192
MIN_IMAGE_DIM: int = 1024
MIN_PDF_IMAGE_DIM: int = 1024
MIN_IMAGE_DIM: int = 1536
MODEL_CHECKPOINT: str = "datalab-to/chandra"
TORCH_DEVICE: str | None = None
MAX_OUTPUT_TOKENS: int = 8192
MAX_OUTPUT_TOKENS: int = 12384
TORCH_ATTN: str | None = None
BBOX_SCALE: int = 1024
# vLLM server settings
VLLM_API_KEY: str = "EMPTY"
@@ -22,37 +22,6 @@ class Settings(BaseSettings):
VLLM_GPUS: str = "0"
MAX_VLLM_RETRIES: int = 6
# Transformers settings
@computed_field
@property
def TORCH_DEVICE_MODEL(self) -> str:
if self.TORCH_DEVICE is not None:
return self.TORCH_DEVICE
if torch.cuda.is_available():
return "cuda"
if torch.backends.mps.is_available():
return "mps"
return "cpu"
@computed_field
@property
def TORCH_DTYPE(self) -> torch.dtype:
return torch.bfloat16
@computed_field
@property
def TORCH_ATTN_IMPLEMENTATION(self) -> str:
if self.TORCH_ATTN is not None:
return self.TORCH_ATTN
if self.TORCH_DEVICE_MODEL == "cuda":
return "flash_attention_2"
else:
return "sdpa"
class Config:
env_file = find_dotenv("local.env")
extra = "ignore"

View File

@@ -1,6 +1,6 @@
[project]
name = "chandra-ocr"
version = "0.1.1"
version = "0.1.9"
description = "OCR model that converts documents to markdown, HTML, or JSON."
readme = "README.md"
requires-python = ">=3.10"