What is the full workflow to digitize and reprint an Assamese book?

The complete workflow is: (1) Scan at 300–600 DPI in grayscale, (2) preprocess images for skew and contrast, (3) run DRISTI OCR to extract Unicode text, (4) proofread and correct OCR output, (5) if targeting legacy DTP, convert Unicode to Geetanjali with Rupantarak, (6) import into PageMaker or InDesign for layout, (7) export press-ready PDF. For digital publication, skip steps 5–6 and export directly to EPUB or HTML.

How long does it take to digitize a 300-page Assamese book?

Scanning a 300-page book takes 2–4 hours on a flatbed scanner (at 400 DPI, 30–60 seconds per page). DRISTI OCR processes 300 pages in approximately 10–15 minutes. Proofreading — the most time-intensive phase — takes 8–20 hours depending on scan quality and the typist's familiarity with the text. Layout and design in InDesign adds another 10–20 hours for a book-length project.

Does DRISTI output Unicode or Geetanjali text?

DRISTI outputs Unicode Assamese text as its primary format. This is the correct format for editing, digital publication, and modern DTP. If your print pipeline requires Geetanjali (for PageMaker), use Rupantarak to convert the Unicode DRISTI output to Geetanjali encoding as a final step before print layout.

Can old Assamese books printed in letterpress be digitized with OCR?

Yes, but letterpress Assamese printing (pre-1980) requires higher resolution scanning (400–600 DPI) and benefits from DRISTI's historical document model. Letterpress character forms differ from digital font designs in stroke termination and glyph proportions. Accuracy will be lower than for modern printing but significantly better than manual transcription. Post-OCR proofreading is essential for letterpress material.

How to Digitize an Assamese Book: Complete End-to-End Workflow

The Two Goals of Assamese Book Digitization

Before starting any digitization project, define which output you are targeting. These are not the same workflow:

Goal A: Digital archive / searchable PDF — Preserve the text in searchable, accessible form. Output is Unicode text and/or a searchable PDF.

Goal B: Reprint / republication — Produce a new printed edition. Output is a press-ready PDF from InDesign or PageMaker, requiring Geetanjali encoding if using a legacy DTP pipeline.

The workflow below covers both paths. The split occurs at Step 5.

The Complete Workflow

Step 1: Assess the Source Material

Before scanning, assess what you have:

Print era: Post-1990 (digital font), 1975–1990 (phototypesetting), pre-1975 (letterpress)
Paper condition: Clean, yellowed, foxed, water-damaged, brittle
Binding: Sewn binding (can be safely opened flat), glued spine (risk of spine damage when opened flat), stapled

For brittle or valuable originals, consider a book scanning stand or V-shaped cradle rather than pressing the book flat on a flatbed scanner. Some books must be scanned without full opening to avoid spine damage.

Step 2: Scan at the Correct Resolution

Source Material	Recommended DPI	Format
Clean post-1990 printed book	300 DPI	Grayscale TIFF
Newsprint or pre-1990 book	400 DPI	Grayscale TIFF
Letterpress or damaged material	600 DPI	Grayscale TIFF
Sanchipat palm-leaf manuscript	600 DPI	Color TIFF
Phone camera capture	Maximum resolution, controlled lighting	JPEG (suboptimal)

Scan in grayscale to TIFF format. Do not use JPEG for archival scans — JPEG compression introduces artifacts in the fine features of Assamese letterforms.

Step 3: Preprocess the Images

Before loading into DRISTI OCR, apply preprocessing:

Deskew: Correct page tilt to within ±0.5 degrees of horizontal
Despeckle: Remove noise pixels with minimum connected-component size 4–6 pixels at 300 DPI
Adaptive binarization: Apply Sauvola binarization for faded or uneven pages
Contrast enhancement: For Sanchipat, apply CLAHE before binarization

DRISTI includes automatic preprocessing on import that handles most cases. Manual preprocessing is recommended for letterpress material or severely degraded pages.

Step 4: Run DRISTI OCR

Load the preprocessed images into DRISTI. For a book-length project:

Point DRISTI to the folder containing page images
Select language: Assamese (includes Assamese-specific characters ৰ, ৱ)
Enable batch processing mode
If material is pre-1990, select the historical document model in settings
Start recognition — DRISTI processes approximately 250 pages in under 10 minutes
Export output as plain text or Word document, with one file per page or one consolidated document

Step 5: Proofread and Correct OCR Output

This is the most labor-intensive step and cannot be eliminated. Even at 98% character accuracy, a 300-page book with 2000 characters per page contains approximately 12,000 character errors — enough to require careful reading of every page.

Priority error types to check:

Hasanta (্) drops — critical because they change consonant sequences entirely
Matra confusion: ি (short i) vs ী (long i), ু vs ূ, ে vs ে
Conjunct misclassification at font boundaries (chapter headings vs body text)
Assamese numerals (০–৯) vs Latin numerals (0–9) in context
Punctuation from the original vs OCR-introduced punctuation

Use Unicode-aware text editors (Notepad++, Visual Studio Code) with Assamese keyboard input for corrections. Do not use PageMaker or legacy DTP software for this step — editing in Unicode text editors preserves the Unicode encoding integrity of the output.

Step 6A (Digital Archive Path): Finalize Unicode and Export

For digital publication or searchable archival:

Final proofreading pass in the Unicode text file
Apply paragraph structure (headings, chapters) in a Unicode-aware word processor
Export to PDF/A (archival PDF) or EPUB for digital distribution
Generate searchable PDF from InDesign with embedded Unicode Assamese font

Step 6B (Reprint Path): Convert to Geetanjali with Rupantarak

For reprint via a PageMaker pipeline:

Take the proofread Unicode text from Step 5
Open Rupantarak and run Unicode → Geetanjali conversion
Rupantarak processes the entire book in seconds
Verify a sample of conjunct characters in the converted output

Step 7: DTP Layout

Import the converted text into your DTP application:

PageMaker 6.5 (Geetanjali): Import converted Geetanjali text; set font to Geetanjali; adjust columns, heading styles, chapter breaks
InDesign (Unicode): Place Unicode text directly; apply paragraph styles with Noto Serif Bengali or similar Unicode font; use InDesign’s Assamese/Bangla paragraph composer

Apply consistent paragraph styles throughout. Set proper leading (line spacing) — Assamese text requires more vertical space than Latin text at equivalent point size because vowel marks extend above and below the baseline.

Step 8: Design and Final Proofing

Design elements specific to Assamese books:

Chapter opening pages: Traditional Assamese publications use specific ornamental conventions
Running headers: Ensure correct font and encoding in the master page
Page numbers: Assamese numerals (০, ১, ২…) vs Arabic; choose based on the original edition

Print a full physical proof before generating the press PDF. Assamese conjunct rendering errors that look fine on screen sometimes appear different in print output, particularly in grayscale printing where tonal compression can obscure fine matra distinctions.

Step 9: Export Press-Ready PDF

Export at 300 DPI with fonts embedded (for PageMaker: Export as PDF with Geetanjali embedded; for InDesign: PDF/X-4 with Unicode fonts embedded)
Include bleed marks if the reprint format differs from the original page size
Deliver to press with font license documentation

Positioning DRISTI and Rupantarak as Core Tools

No other combination of tools handles the complete Assamese book digitization workflow as directly. DRISTI OCR provides the text extraction with Assamese-specific recognition. Rupantarak handles the encoding bridge between the Unicode editing environment and legacy print pipelines. Together, they reduce a project that once required weeks of manual retyping to a matter of hours of supervised processing and proofreading.

For publishers, libraries, or government digitization programs, this workflow scales linearly — a team running DRISTI in batch mode can process hundreds of books per month with appropriate scanner capacity. The bottleneck is proofreading, which requires Assamese-literate reviewers but does not require any technical expertise in encoding systems.

Assamese newspaper OCR guide — for digitizing newspaper archives rather than books
OCR image preprocessing guide — scan quality and binarization techniques
Why Assamese OCR is harder than English OCR — technical context for understanding error rates
How OCR makes book reprinting viable at scale — the business case for OCR-driven reprinting
Assamese DTP software guide — complete DTP toolkit overview
Assamese typing guide for DTP professionals — for the layout and manual correction phases
Unicode vs Geetanjali comparison — why encoding conversion is necessary for PageMaker pipelines

How to Digitize an Assamese Book: Complete End-to-End Workflow

The Two Goals of Assamese Book Digitization

The Complete Workflow

Step 1: Assess the Source Material

Step 2: Scan at the Correct Resolution

Step 3: Preprocess the Images

Step 4: Run DRISTI OCR

Step 5: Proofread and Correct OCR Output

Step 6A (Digital Archive Path): Finalize Unicode and Export

Step 6B (Reprint Path): Convert to Geetanjali with Rupantarak

Step 7: DTP Layout

Step 8: Design and Final Proofing

Step 9: Export Press-Ready PDF

Positioning DRISTI and Rupantarak as Core Tools

Frequently Asked Questions

Further Reading

Why Assamese Unicode Looks Wrong on Windows: Rendering Problems Explained

PageMaker to InDesign Migration for Assamese Publishing: A Practical Guide

A History of Assamese Font Encoding: From Typewriter to Unicode

The Two Goals of Assamese Book Digitization

The Complete Workflow

Step 1: Assess the Source Material

Step 2: Scan at the Correct Resolution

Step 3: Preprocess the Images

Step 4: Run DRISTI OCR

Step 5: Proofread and Correct OCR Output

Step 6A (Digital Archive Path): Finalize Unicode and Export

Step 6B (Reprint Path): Convert to Geetanjali with Rupantarak

Step 7: DTP Layout

Step 8: Design and Final Proofing

Step 9: Export Press-Ready PDF

Positioning DRISTI and Rupantarak as Core Tools

Related Guides and Tools

Frequently Asked Questions

Further Reading

Why Assamese Unicode Looks Wrong on Windows: Rendering Problems Explained

PageMaker to InDesign Migration for Assamese Publishing: A Practical Guide

A History of Assamese Font Encoding: From Typewriter to Unicode