Recognition of Bangla and English Words in Bengali Texts Using a Modified BERT-base-NER Model

Md Parvez Hossain, Computer Science and Engineering, Green University of Bangladesh, Kanchon 1460, Bangladesh AND Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka, 1212, Bangladesh
Ohidujjaman Ohidujjaman, Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka, 1212, BangladeshFollow
Mohammad Shorif Uddin, Computer Science and Engineering, Green University of Bangladesh, Kanchon 1460, Bangladesh AND Computer Science and Engineering, Jahangirnagar University, Savar, 1342, Bangladesh
Mohammad Nurul Huda, Computer Science and Engineering, United International University, United City, Madani Avenue, Dhaka, 1212, Bangladesh
Tetsuya Shimamura, Graduate School of Science and Engineering, Saitama University, Saitama, 338-8570, Japan

Abstract

A fusion of Bangla and English words is frequently utilized, especially on social media. This phenomenon significantly hampers the learning and preservation of the Bengali language among future generations. This paper proposes a model to recognize Bangla and English words in Bengali texts. In addition, this study converts the detected English words into standard Bangla words. In this work, we redesign the BERT-base-NER model using the training input dataset. BERT is chosen for its strong contextual representation capabilities, which are well-suited to noisy and informal text. BERT-base-NER provides strong contextual embeddings but treats token labels independently, lacking explicit modeling of label dependencies. In this paper, the modified BERT-base-NER model, also referred to as the modified BERT-base-NER model, achieves state-of-the-art performance for the named entity recognition (NER) task. We modified the baseline BERT-base-NER model by integrating a BiLSTM layer on top of the contextual embeddings of the BERT method. This modification allows the model to better capture sequential patterns and dependencies within the input text, enhancing its ability to accurately identify named entities by combining BERTâ€™s deep contextual understanding with explicit sequence modeling. We use a holdout cross-validation process for both the training and testing cases. Among the total data, we applied 80% for training and 20% for testing, respectively. We convert the recognized English words into standard Bangla words using the Google Translate API (application programming interface). We develop a word-level annotated corpus of Bengali-English code-mixed 1742 sentences from social media. We utilize Cohen’s Kappa for inter-annotator agreement (IAA) in data annotation. We employ this prepared data to the baseline machine learning (ML) and deep learning (DL) approaches to evaluate the modified BERT-base-NER model. The ML method includes a support vector machine (SVM) and a Naive Bayes (NB) method. On the other hand, the DL approach involves a convolutional neural network (CNN), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM). Experimental results demonstrate that the modified BERT-base-NER model is highly accurate and efficient in recognizing Bangla and English words. The proposed modified BERT-base-NER model obtains the best outcome among the baseline approaches, with an accuracy of 95%.

Recommended Citation

Hossain, Md Parvez; Ohidujjaman, Ohidujjaman; Uddin, Mohammad Shorif; Huda, Mohammad Nurul; and Shimamura, Tetsuya (2025) "Recognition of Bangla and English Words in Bengali Texts Using a Modified BERT-base-NER Model," Iraqi Journal for Computer Science and Mathematics: Vol. 6: Iss. 4, Article 5.
DOI: https://doi.org/10.52866/2788-7421.1334
Available at: https://ijcsm.researchcommons.org/ijcsm/vol6/iss4/5

Download

Included in

Computer Engineering Commons

COinS

Recognition of Bangla and English Words in Bengali Texts Using a Modified BERT-base-NER Model

Authors

Abstract

Recommended Citation

Included in

Share

Search