•  
  •  
 

Abstract

A fusion of Bangla and English words is frequently utilized, especially on social media. This phenomenon significantly hampers the learning and preservation of the Bengali language among future generations. This paper proposes a model to recognize Bangla and English words in Bengali texts. In addition, this study converts the detected English words into standard Bangla words. In this work, we redesign the BERT-base-NER model using the training input dataset. BERT is chosen for its strong contextual representation capabilities, which are well-suited to noisy and informal text. BERT-base-NER provides strong contextual embeddings but treats token labels independently, lacking explicit modeling of label dependencies. In this paper, the modified BERT-base-NER model, also referred to as the modified BERT-base-NER model, achieves state-of-the-art performance for the named entity recognition (NER) task. We modified the baseline BERT-base-NER model by integrating a BiLSTM layer on top of the contextual embeddings of the BERT method. This modification allows the model to better capture sequential patterns and dependencies within the input text, enhancing its ability to accurately identify named entities by combining BERT’s deep contextual understanding with explicit sequence modeling. We use a holdout cross-validation process for both the training and testing cases. Among the total data, we applied 80% for training and 20% for testing, respectively. We convert the recognized English words into standard Bangla words using the Google Translate API (application programming interface). We develop a word-level annotated corpus of Bengali-English code-mixed 1742 sentences from social media. We utilize Cohen’s Kappa for inter-annotator agreement (IAA) in data annotation. We employ this prepared data to the baseline machine learning (ML) and deep learning (DL) approaches to evaluate the modified BERT-base-NER model. The ML method includes a support vector machine (SVM) and a Naive Bayes (NB) method. On the other hand, the DL approach involves a convolutional neural network (CNN), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM). Experimental results demonstrate that the modified BERT-base-NER model is highly accurate and efficient in recognizing Bangla and English words. The proposed modified BERT-base-NER model obtains the best outcome among the baseline approaches, with an accuracy of 95%.

Share

COinS