What DPI should I use to scan Assamese books for OCR?

300 DPI is the minimum for clean post-1990 printed books. Use 400 DPI for newspapers and pre-1990 material. Use 600 DPI for faded letterpress prints, Sanchipat manuscripts, or heavily degraded documents. Higher DPI than necessary will not improve accuracy beyond a point but significantly increases file size and processing time.

Should I scan Assamese documents in grayscale or black and white?

Scan in grayscale (8-bit) and let the OCR software or a preprocessing step perform binarization. Scanning directly to binary (1-bit black and white) bakes a single global threshold into the image, which loses detail on uneven pages. Grayscale preserves this information and allows adaptive binarization to recover faint strokes.

How do I fix a skewed scan before running Assamese OCR?

Use deskewing software (DRISTI includes automatic deskew, or use tools like ScanTailor or GIMP's Rotate by fixed angle). Even a 1–2 degree tilt causes the Shirorekha to angle across columns, breaking character segmentation. Pages rotated more than 5 degrees from horizontal see significant accuracy drops in Assamese OCR.

Can I use a phone camera instead of a flatbed scanner for Assamese OCR?

Phone cameras can produce usable results if lighting is controlled (no shadows, no reflections), the document is completely flat, and the camera is held perpendicular to the page surface. Use the highest resolution setting and ensure even, diffuse lighting. For books, a book scanner stand prevents binding shadow. For archive-quality work, a flatbed scanner is strongly preferred.

Assamese OCR Image Preprocessing: Scan Quality Guide for Maximum Accuracy

Why Preprocessing Is Not Optional

Most OCR accuracy problems are set before recognition begins — they are failures of image quality, not failures of the recognition engine. An Assamese OCR engine can only work with the pixel information provided. If the image has skew, noise, or insufficient resolution, the recognition engine has no way to recover what was never captured.

The gap between a poorly prepared and well-prepared scan of the same document can exceed 15 percentage points in character accuracy for Assamese script — because features like the hasanta (্) subscript, the lower loop of ৰ, and the distinction between ো and ৌ (split matras) rely on fine detail that only high-quality scans preserve.

For the structural reasons why Assamese OCR is more sensitive to scan quality than English OCR, see the technical breakdown of Assamese OCR challenges. For the complete scan-to-text workflow including what happens after preprocessing, see the Assamese image to text guide.

Resolution: The Single Most Important Variable

DPI Setting	Best For	Expected Accuracy Impact
150 DPI	Nothing — insufficient for Assamese	Very poor; fine features lost
200 DPI	Large-format display prints only	Poor; matra confusion frequent
300 DPI	Clean post-1990 books, modern newspapers	Good baseline for clean material
400 DPI	Pre-1990 books, newsprint, moderate fading	Significant improvement over 300 for older material
600 DPI	Sanchipat manuscripts, letterpress, heavy fading	Maximum practical benefit; diminishing returns above this
1200 DPI	Archival master copies only; not for direct OCR	File sizes impractical; resample down to 600 for OCR

Practical rule: When in doubt, scan at 400 DPI. The increased file size is manageable, and the accuracy improvement on borderline material is consistently meaningful.

Grayscale vs Binary Scanning

Flatbed scanners offer a scan mode choice: grayscale (8 bits per pixel, 256 shades) or binary (1 bit per pixel, pure black or white). Always choose grayscale.

Binary scanning applies a global binarization threshold — a single cutoff value below which all pixels are black and above which all are white. On a clean, evenly lit page with fresh ink, this works. On:

Pages with uneven paper yellowing
Documents where ink faded differently across the page
Sanchipat manuscripts with natural fiber texture

…a global threshold destroys information. Where the threshold is set to recover the faint ink, the background noise becomes black. Where it is set to eliminate background, faint strokes disappear.

Adaptive binarization — where the threshold varies locally across the image based on neighborhood pixel statistics — recovers faint strokes while suppressing background. DRISTI OCR applies Sauvola adaptive binarization internally; ScanTailor Advanced provides it as a preprocessing step for batch scanning pipelines.

Deskewing: The Shirorekha Alignment Problem

The Assamese Shirorekha (the horizontal top bar running across all characters in a word) is a geometric reference line that OCR engines use for character segmentation and zone detection. When a page is scanned at an angle, the Shirorekha tilts, disrupting:

Word boundary detection
Character baseline alignment
Matra positioning relative to the host consonant
Column detection in multi-column newspaper layouts

A 2-degree skew — barely visible to the eye — introduces enough vertical displacement across a 15-cm page width that the Shirorekha at the left margin sits at a different pixel row than the Shirorekha at the right margin. Segmentation algorithms designed for horizontal text fail non-trivially at this scale.

Deskewing tools and methods:

DRISTI automatic deskew — Applied automatically on import; handles up to approximately 10-degree tilt
ScanTailor — Free, excellent for batch book scanning; detects page orientation and straightens
GIMP/ImageMagick rotation — Manual measurement of skew angle using the measure tool; rotate by exact degrees

For severely distorted pages (books not pressed flat during scanning), geometric correction is more appropriate than simple rotation.

Despeckling: The Tradeoff Problem

Newsprint scanning produces speckle noise — small isolated black pixels from paper fiber, ink bleed, and scanning artifacts. Despeckling removes these isolated pixels below a minimum connected-component size.

The danger for Assamese text: the hasanta (virama, ্) is a small subscript mark. Aggressive despeckling with a minimum size larger than the hasanta’s pixel footprint at 300 DPI will erase every hasanta on the page, turning all conjuncts into ambiguous consonant sequences.

Safe despeckling parameters at 300 DPI:

Minimum connected component: 4–6 pixels
Do not apply despeckling before deskewing (skewed text creates artificial small components)
Apply despeckling in isolation from the main text zone; do not apply uniformly to the entire page

Contrast Enhancement for Faded Material

Sanchipat manuscripts — written on palm leaves, often centuries old — present a unique contrast problem. The ink is dark resin on a golden-tan palm leaf surface. The contrast exists but is compressed into a narrow tonal range. A direct scan produces a low-contrast image where automatic thresholds fail.

Workflow for Sanchipat:

Scan at 600 DPI in color (not grayscale) — captures the spectral difference between ink and leaf surface
Convert to grayscale using a channel mix that enhances the red channel (palm leaf ink absorbs differently in red vs blue)
Apply CLAHE (Contrast Limited Adaptive Histogram Equalization) to stretch local contrast
Apply Sauvola binarization
Run through DRISTI OCR with the historical document model selected

For detailed comparison of OCR software that handles historical material, see the best Assamese OCR software comparison.

Phone Camera vs Flatbed Scanner

Factor	Flatbed Scanner	Phone Camera
Resolution consistency	Consistent, calibrated	Variable; depends on distance and lens
Geometric distortion	Minimal	Book spine causes binding shadow and curve
Lighting	Internal even illumination	Ambient light; shadow risk
DPI equivalent (typical)	Specified exactly	200–250 effective DPI at 30 cm
Speed	Slow (30–60 seconds per page)	Fast (2–3 seconds per page)
Batch workflow	Automatic document feeder available	Manual; requires physical handling
Best use case	Archive-quality digitization	Field capture, document verification

For production digitization of books or newspaper archives, a flatbed scanner is the correct tool. Phone cameras are appropriate for field capture of individual documents where archival quality is not the goal, or when the document cannot be physically moved to a scanning station.

The single highest return-on-investment preprocessing step is scanning at adequate resolution. If you are currently scanning at 200 DPI and experiencing accuracy problems, increasing to 300 or 400 DPI will improve results more than any downstream software processing.

Next Steps After Preprocessing

Once your images are correctly preprocessed, the next step is running them through DRISTI OCR for text extraction. The complete workflow for book-length projects is documented in the Assamese book digitization guide. For newspaper archives with multi-column layouts and mixed fonts, the Assamese newspaper OCR guide covers the newspaper-specific challenges.

After OCR, the output is Unicode text. If your DTP pipeline requires Geetanjali encoding (for PageMaker), convert it with Rupantarak. For new content creation rather than digitization, the Jahnabi Pro Keyboard provides professional Unicode Assamese input without the scanning step entirely.

Assamese OCR Image Preprocessing: Scan Quality Guide for Maximum Accuracy

Why Preprocessing Is Not Optional

Resolution: The Single Most Important Variable

Grayscale vs Binary Scanning

Deskewing: The Shirorekha Alignment Problem

Despeckling: The Tradeoff Problem

Contrast Enhancement for Faded Material

Phone Camera vs Flatbed Scanner

Next Steps After Preprocessing

Frequently Asked Questions

Further Reading

Why Assamese Unicode Looks Wrong on Windows: Rendering Problems Explained

How to Digitize an Assamese Book: Complete End-to-End Workflow

PageMaker to InDesign Migration for Assamese Publishing: A Practical Guide