Researching on Analysis and creating Corpus from Primary level Sindhi language Book for Sindhi

Authors

  • Naveen Talpur Student
  • Mir Jahanzeb Talpur
  • Timotheous Samar

Keywords:

Sindhi corpus, UOPS, Sentimental analyis, Document term metrix

Abstract

Sindhi is an amusing vernacular with a large abundance of pieces of literature and non-literary works. Despite the availability of several books, newspapers, magazines, and internet resources for developing Sindhi text corpora, a suitable and effective textual corpus could not be generated and offered accessible for investigation, language characteristics research, semantics assessment, and information gathering systems. The paucity of tools for computational linguistics research and NLP apps for Sindhi is stimulating complications at this time. Moreover, we have built Sindhi text libraries to provide computer linguistics, NLP specialists, and academics with text resources. The Sindh Text Book Board and primary school textbooks are used to create the Sindhi text corpus. Using the 2-gram approach of the n-gram model, using the Document Term Matrix and TF-IDF models, a Sindhi belief text dataset is produced and evaluated. The dataset might be useful for research on linguistic suggested work, topic detection, and sentiment classification by aspect.

Downloads

Published

2023-03-10

How to Cite

Talpur, N., Talpur, M. J., & Samar, T. . (2023). Researching on Analysis and creating Corpus from Primary level Sindhi language Book for Sindhi. Repertus: Journal of Linguistics, Language Planning and Policy, 2(1), 37–48. Retrieved from https://rjllp.muet.edu.pk/index.php/repertus/article/view/24