Assamese Book Digitization — Complete Workflow

From a fragile printed page to a press-ready PDF: the complete technical workflow for digitizing and reprinting Assamese books using DRISTI OCR and Rupantarak.

What Book Digitization Actually Requires

Digitizing an Assamese book is not a single-step operation. It is a pipeline with five distinct phases — scanning, OCR, editing, encoding conversion, and layout — each with its own technical requirements and quality gates. Understanding each phase prevents the most common failure mode: producing a high-quality scan and a poor-quality final document because of failures in the intermediate steps.

The two core tools for this workflow are DRISTI OCR (for text extraction from scanned pages) and Rupantarak (for encoding conversion between Unicode and Geetanjali when the print pipeline requires it). Together they handle what was previously the most labor-intensive part of the process: converting scanned images to correct, editable Assamese text.

Phase 1: Scanning Setup

The scanner is the input device for everything that follows. Errors introduced at scanning cannot be corrected in OCR — they are baked into the pixel data. Choose the correct DPI for your source material:

  • 300 DPI: Clean post-1990 books printed in digital fonts (Geetanjali, Unicode)
  • 400 DPI: Newsprint, pre-1990 books, moderate fading or yellowing
  • 600 DPI: Letterpress printing (pre-1975), Sanchipat manuscripts, heavily damaged material

Scan in grayscale TIFF format. Do not scan to JPEG for archive work — JPEG compression degrades the fine detail of Assamese vowel marks and hasanta subscripts at the pixel level. Do not scan directly to binary (1-bit black and white) — preserve grayscale so that adaptive binarization can be applied in the OCR step.

Handling Bound Books

For sewn-binding books, opening the book flat on a flatbed scanner is generally safe. For glued or perfect-bound spines, pressing the book fully flat risks spine cracking and page loss. Use a V-shaped book cradle or book scanning stand to allow partial opening, and apply DRISTI's automatic deskew to correct the resulting page curvature before recognition.

Phase 2: OCR with DRISTI

Load the scanned grayscale TIFF images into DRISTI. For book-length projects, use batch mode: point DRISTI at the folder containing all page images and configure the output destination. DRISTI processes approximately 250 pages in under 10 minutes.

Key settings to configure for book digitization:

  • Language: Assamese — ensures U+09F0 (ৰ) and U+09F1 (ৱ) are recognized as distinct Assamese characters, not Bangla equivalents
  • Document model: Historical if material is pre-1990 letterpress; Standard for digital-font books
  • Output format: Unicode plain text (one file per page, or consolidated) plus searchable PDF
  • Preprocessing: Enable automatic deskew and adaptive binarization for all but the cleanest material

Phase 3: Proofreading the OCR Output

Even at 98% character accuracy, a 300-page book with 2000 characters per page contains approximately 12,000 character-level errors. Proofreading is mandatory for reprint-quality output.

Priority error patterns to check in Assamese OCR output:

  • Hasanta (্) drops: When the virama subscript is not recognized, conjuncts in the output appear as separate consonants — the most semantically significant error type
  • Short vs long matras: ি (short i, U+09BF) vs ী (long i, U+09C0); ু vs ূ; ে vs ে — these appear nearly identical at degraded scan quality
  • ৰ vs র distinction: Assamese ৰ (U+09F0) vs Bangla র (U+09B0) — critical for Assamese authenticity
  • Conjunct classification at chapter headings: Display fonts used for chapter headings often produce lower accuracy than body text — check heading text carefully

Use a Unicode-aware text editor (Visual Studio Code, Notepad++) with an Assamese keyboard for corrections. Do not edit in a DTP application at this stage.

Phase 4: Encoding Conversion (If Required)

If your print pipeline is PageMaker 6.5 with Geetanjali, you need to convert the proofread Unicode text to Geetanjali encoding before import. Use Rupantarak for this conversion — it processes book-length text in seconds and handles all Juktakkhor conjunct mappings correctly.

If your print pipeline is InDesign with a Unicode font (Noto Serif Bengali or similar), skip this step — InDesign works directly with Unicode.

Always retain the Unicode master file regardless of which DTP pipeline you use. The Unicode file is your archival record. The Geetanjali conversion is a derived format for a specific output target.

Phase 5: Layout and Press Export

Import the final text (Unicode for InDesign, Geetanjali for PageMaker) into your DTP application. Apply paragraph styles consistently: body text, headings, subheadings, running headers. Verify that line spacing (leading) is sufficient for Assamese script — the ascenders and descenders of Assamese vowel marks require more vertical clearance than Latin text at equivalent point sizes.

Export as PDF with all fonts embedded. For press output, use PDF/X-4. For digital archival, use PDF/A. Verify the exported PDF by searching for a known Assamese word to confirm text layer correctness.

For the complete DTP software landscape for Assamese publishing, see the Assamese DTP software guide. For newspaper archive digitization projects rather than book projects, see the Assamese newspaper OCR guide. For understanding why scan quality matters so much for Assamese OCR specifically, read the OCR image preprocessing guide and the technical breakdown of Assamese OCR challenges. For typing new content rather than digitizing old content, see the Jahnabi Pro Keyboard.

Frequently Asked Questions

What is the complete workflow to digitize an Assamese book for reprinting?

The full workflow is: (1) Scan pages at 300–600 DPI in grayscale depending on print era and condition, (2) preprocess for skew correction and adaptive binarization, (3) run DRISTI OCR to extract Unicode text, (4) proofread the OCR output for character-level errors focusing on matras, hasanta, and conjuncts, (5) if reprinting via legacy PageMaker pipeline, convert Unicode to Geetanjali using Rupantarak, (6) import into InDesign or PageMaker for layout, (7) export press-ready PDF with fonts embedded.

How accurate is Assamese OCR for old printed books?

For clean post-1990 books scanned at 300 DPI, DRISTI achieves 97–99% character accuracy. For pre-1990 letterpress printing scanned at 400–600 DPI, accuracy typically ranges from 92–97%. For heavily faded or damaged material, accuracy depends strongly on scan quality — a 600 DPI scan of faded material can achieve 90%+ where a 300 DPI scan of the same page may only reach 75%. Post-OCR proofreading is essential for all print-quality reprinting work.

Can Assamese OCR handle Sanchipat palm-leaf manuscripts?

DRISTI includes a historical document model trained on pre-digital Assamese material. Sanchipat manuscripts require special scanning preparation: scan in color at 600 DPI, convert to grayscale using a channel mix that enhances contrast between the dark resin ink and the golden palm leaf surface, apply CLAHE contrast enhancement before binarization. Recognition accuracy on Sanchipat is lower than on printed books — plan for more intensive proofreading on manuscript material.

Is it better to digitize books in Unicode or Geetanjali?

Always digitize and store in Unicode. Unicode is the portable, font-independent, web-compatible format. Store your master digital text file in Unicode regardless of what DTP pipeline you use for print. If your print pipeline requires Geetanjali (for PageMaker), use Rupantarak to convert Unicode to Geetanjali as the final step before press layout. Never use Geetanjali as your master storage format — it is encoding-locked and will break on web, in modern DTP, and in any future migration.

Ready to try Get DRISTI OCR?

Professional tools trusted by publishers and DTP professionals across Assam.

Get Get DRISTI OCR