Assamese OCR Image Preprocessing: Scan Quality Guide for Maximum Accuracy
A technical guide to preparing scanned images for Assamese OCR — covering DPI selection, binarization, deskewing, despeckling, contrast enhancement for faded manuscripts, and phone camera vs flatbed scanner tradeoffs.
Why Preprocessing Is Not Optional
Most OCR accuracy problems are set before recognition begins — they are failures of image quality, not failures of the recognition engine. An Assamese OCR engine can only work with the pixel information provided. If the image has skew, noise, or insufficient resolution, the recognition engine has no way to recover what was never captured.
The gap between a poorly prepared and well-prepared scan of the same document can exceed 15 percentage points in character accuracy for Assamese script — because features like the hasanta (্) subscript, the lower loop of ৰ, and the distinction between ো and ৌ (split matras) rely on fine detail that only high-quality scans preserve.
For the structural reasons why Assamese OCR is more sensitive to scan quality than English OCR, see the technical breakdown of Assamese OCR challenges. For the complete scan-to-text workflow including what happens after preprocessing, see the Assamese image to text guide.
Resolution: The Single Most Important Variable
| DPI Setting | Best For | Expected Accuracy Impact |
|---|---|---|
| 150 DPI | Nothing — insufficient for Assamese | Very poor; fine features lost |
| 200 DPI | Large-format display prints only | Poor; matra confusion frequent |
| 300 DPI | Clean post-1990 books, modern newspapers | Good baseline for clean material |
| 400 DPI | Pre-1990 books, newsprint, moderate fading | Significant improvement over 300 for older material |
| 600 DPI | Sanchipat manuscripts, letterpress, heavy fading | Maximum practical benefit; diminishing returns above this |
| 1200 DPI | Archival master copies only; not for direct OCR | File sizes impractical; resample down to 600 for OCR |
Practical rule: When in doubt, scan at 400 DPI. The increased file size is manageable, and the accuracy improvement on borderline material is consistently meaningful.
Grayscale vs Binary Scanning
Flatbed scanners offer a scan mode choice: grayscale (8 bits per pixel, 256 shades) or binary (1 bit per pixel, pure black or white). Always choose grayscale.
Binary scanning applies a global binarization threshold — a single cutoff value below which all pixels are black and above which all are white. On a clean, evenly lit page with fresh ink, this works. On:
- Pages with uneven paper yellowing
- Documents where ink faded differently across the page
- Sanchipat manuscripts with natural fiber texture
…a global threshold destroys information. Where the threshold is set to recover the faint ink, the background noise becomes black. Where it is set to eliminate background, faint strokes disappear.
Adaptive binarization — where the threshold varies locally across the image based on neighborhood pixel statistics — recovers faint strokes while suppressing background. DRISTI OCR applies Sauvola adaptive binarization internally; ScanTailor Advanced provides it as a preprocessing step for batch scanning pipelines.
Deskewing: The Shirorekha Alignment Problem
The Assamese Shirorekha (the horizontal top bar running across all characters in a word) is a geometric reference line that OCR engines use for character segmentation and zone detection. When a page is scanned at an angle, the Shirorekha tilts, disrupting:
- Word boundary detection
- Character baseline alignment
- Matra positioning relative to the host consonant
- Column detection in multi-column newspaper layouts
A 2-degree skew — barely visible to the eye — introduces enough vertical displacement across a 15-cm page width that the Shirorekha at the left margin sits at a different pixel row than the Shirorekha at the right margin. Segmentation algorithms designed for horizontal text fail non-trivially at this scale.
Deskewing tools and methods:
- DRISTI automatic deskew — Applied automatically on import; handles up to approximately 10-degree tilt
- ScanTailor — Free, excellent for batch book scanning; detects page orientation and straightens
- GIMP/ImageMagick rotation — Manual measurement of skew angle using the measure tool; rotate by exact degrees
For severely distorted pages (books not pressed flat during scanning), geometric correction is more appropriate than simple rotation.
Despeckling: The Tradeoff Problem
Newsprint scanning produces speckle noise — small isolated black pixels from paper fiber, ink bleed, and scanning artifacts. Despeckling removes these isolated pixels below a minimum connected-component size.
The danger for Assamese text: the hasanta (virama, ্) is a small subscript mark. Aggressive despeckling with a minimum size larger than the hasanta’s pixel footprint at 300 DPI will erase every hasanta on the page, turning all conjuncts into ambiguous consonant sequences.
Safe despeckling parameters at 300 DPI:
- Minimum connected component: 4–6 pixels
- Do not apply despeckling before deskewing (skewed text creates artificial small components)
- Apply despeckling in isolation from the main text zone; do not apply uniformly to the entire page
Contrast Enhancement for Faded Material
Sanchipat manuscripts — written on palm leaves, often centuries old — present a unique contrast problem. The ink is dark resin on a golden-tan palm leaf surface. The contrast exists but is compressed into a narrow tonal range. A direct scan produces a low-contrast image where automatic thresholds fail.
Workflow for Sanchipat:
- Scan at 600 DPI in color (not grayscale) — captures the spectral difference between ink and leaf surface
- Convert to grayscale using a channel mix that enhances the red channel (palm leaf ink absorbs differently in red vs blue)
- Apply CLAHE (Contrast Limited Adaptive Histogram Equalization) to stretch local contrast
- Apply Sauvola binarization
- Run through DRISTI OCR with the historical document model selected
For detailed comparison of OCR software that handles historical material, see the best Assamese OCR software comparison.
Phone Camera vs Flatbed Scanner
| Factor | Flatbed Scanner | Phone Camera |
|---|---|---|
| Resolution consistency | Consistent, calibrated | Variable; depends on distance and lens |
| Geometric distortion | Minimal | Book spine causes binding shadow and curve |
| Lighting | Internal even illumination | Ambient light; shadow risk |
| DPI equivalent (typical) | Specified exactly | 200–250 effective DPI at 30 cm |
| Speed | Slow (30–60 seconds per page) | Fast (2–3 seconds per page) |
| Batch workflow | Automatic document feeder available | Manual; requires physical handling |
| Best use case | Archive-quality digitization | Field capture, document verification |
For production digitization of books or newspaper archives, a flatbed scanner is the correct tool. Phone cameras are appropriate for field capture of individual documents where archival quality is not the goal, or when the document cannot be physically moved to a scanning station.
The single highest return-on-investment preprocessing step is scanning at adequate resolution. If you are currently scanning at 200 DPI and experiencing accuracy problems, increasing to 300 or 400 DPI will improve results more than any downstream software processing.
Next Steps After Preprocessing
Once your images are correctly preprocessed, the next step is running them through DRISTI OCR for text extraction. The complete workflow for book-length projects is documented in the Assamese book digitization guide. For newspaper archives with multi-column layouts and mixed fonts, the Assamese newspaper OCR guide covers the newspaper-specific challenges.
After OCR, the output is Unicode text. If your DTP pipeline requires Geetanjali encoding (for PageMaker), convert it with Rupantarak. For new content creation rather than digitization, the Jahnabi Pro Keyboard provides professional Unicode Assamese input without the scanning step entirely.
Frequently Asked Questions
What DPI should I use to scan Assamese books for OCR?
300 DPI is the minimum for clean post-1990 printed books. Use 400 DPI for newspapers and pre-1990 material. Use 600 DPI for faded letterpress prints, Sanchipat manuscripts, or heavily degraded documents. Higher DPI than necessary will not improve accuracy beyond a point but significantly increases file size and processing time.
Should I scan Assamese documents in grayscale or black and white?
Scan in grayscale (8-bit) and let the OCR software or a preprocessing step perform binarization. Scanning directly to binary (1-bit black and white) bakes a single global threshold into the image, which loses detail on uneven pages. Grayscale preserves this information and allows adaptive binarization to recover faint strokes.
How do I fix a skewed scan before running Assamese OCR?
Use deskewing software (DRISTI includes automatic deskew, or use tools like ScanTailor or GIMP's Rotate by fixed angle). Even a 1–2 degree tilt causes the Shirorekha to angle across columns, breaking character segmentation. Pages rotated more than 5 degrees from horizontal see significant accuracy drops in Assamese OCR.
Can I use a phone camera instead of a flatbed scanner for Assamese OCR?
Phone cameras can produce usable results if lighting is controlled (no shadows, no reflections), the document is completely flat, and the camera is held perpendicular to the page surface. Use the highest resolution setting and ensure even, diffuse lighting. For books, a book scanner stand prevents binding shadow. For archive-quality work, a flatbed scanner is strongly preferred.