This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.
Vision Transformers Handwritten Text Segmentation Object Detection Optical Character Recognition Deep Learning Document Analysis
This study does not require ethics committee permission or any special permission.
Primary Language | English |
---|---|
Subjects | Image Processing, Pattern Recognition |
Journal Section | Research Articles |
Authors | |
Early Pub Date | April 25, 2025 |
Publication Date | |
Submission Date | April 21, 2024 |
Acceptance Date | January 27, 2025 |
Published in Issue | Year 2025Volume: 9 Issue: 1 |
The works published in Journal of Innovative Science and Engineering (JISE) are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.