Baselight
Sign In
kaggle

Pashto Sentiment Analysis Dataset

Kaggle

@kaggle.khairullahsahil_pashto_sentiment_dataset

Loading...
Loading...

High-Quality Pashto Sentiment Dataset for Machine Learning and Natural Language

Dataset Description

Pashto Sentiment Analysis Dataset

Overview

The Pashto Sentiment Dataset is a large-scale manually reviewed corpus developed for sentiment analysis and text classification in the Pashto language.

The dataset contains 10,229 Pashto sentences annotated with one of three sentiment labels:

  • Positive
  • Negative
  • Neutral

Pashto language is a low-resource language with limited publicly available NLP resources. This dataset aims to support research, education, and industrial applications in sentiment analysis, natural language understanding, and machine learning for Pashto language.


Why This Dataset?

Despite being spoken by tens of millions of people, Pashto language remains underrepresented in modern NLP research. The lack of high-quality annotated datasets presents a significant challenge for building robust language technologies.

This dataset was created to help researchers and developers:

  • Build sentiment classification systems
  • Fine-tune transformer models
  • Benchmark NLP algorithms
  • Develop Pashto language technologies
  • Support low-resource language research

Key Features

✅ 10,229 manually reviewed samples

✅ Three sentiment categories

✅ Balanced class distribution

✅ Real-world Pashto language content

✅ Suitable for machine learning and deep learning

✅ Ready-to-use CSV format

✅ Designed for academic and industrial research


Dataset Statistics

Metric Value
Total Samples 10,229
Language Pashto
Task Sentiment Analysis
Classes 3
Format CSV
Version 1.0

Class Distribution

Sentiment Count
Positive ~3,549
Negative ~3,324
Neutral ~3,355

The dataset provides a relatively balanced distribution across sentiment classes, making it suitable for supervised learning experiments.

Class Distribution

print(df["sentiment"].value_counts())

Dataset Schema

Column Description
sentence Pashto sentence or text sample
sentiment Sentiment label (positive, negative, neutral)

Example Records

Sentence Label
افغانستان د کرکټ لوبډلې مهمه بریا ترلاسه کړه. positive
د امنیتي پېښې له امله ګڼ شمېر خلک اغېزمن شول. negative
نن په کابل کې د هوا درجه معتدله وه. neutral

Data Collection

The dataset was compiled from publicly available Pashto-language content covering multiple domains, including:

  • Politics
  • International Affairs
  • Economy
  • Sports
  • Society
  • Technology
  • Culture
  • Health

The diverse coverage helps improve the dataset's usefulness across different NLP applications.


Annotation Process

The annotation workflow consisted of:

Step 1 — Data Collection

Pashto text samples were gathered from publicly available sources(BBC Pashto, VOA Pashto, Shamshad News ....)

Step 2 — Initial Labeling

Each sentence was assigned a sentiment category based on its overall semantic meaning.

Step 3 — Manual Review

All labels were reviewed and validated by the dataset creator.

Step 4 — Quality Assurance

The final dataset underwent cleaning and validation procedures before release.


Data Quality

Several quality-control measures were applied:

  • Duplicate removal
  • Missing-value removal
  • Text normalization
  • Formatting consistency checks
  • Label verification
  • Manual review

These steps were performed to improve dataset reliability and usability.


Potential Applications

Natural Language Processing

  • Sentiment Analysis
  • Opinion Mining
  • Text Classification
  • Language Understanding
  • Content Analysis

Machine Learning

  • Multi-Class Classification
  • Deep Learning
  • Transformer Fine-Tuning
  • Transfer Learning

Research

  • Low-Resource NLP
  • Cross-Lingual Learning
  • Pashto Language Technology
  • Benchmark Evaluation

Getting Started

Load the Dataset

import pandas as pd

df = pd.read_csv("pashto_sentiment_dataset.csv")

print(df.head())

Class Distribution

print(df["sentiment"].value_counts())

Recommended Models

Researchers may use this dataset with:

  • BERT
  • RoBERTa
  • XLM-RoBERTa
  • mBERT
  • DistilBERT
  • LLaMA-based classifiers
  • Traditional machine learning models

Dataset Characteristics

  • The dataset is written in Standard Pashto.
  • The corpus is derived from professionally written Pashto-language content.
  • The dataset covers a diverse range of topics, including politics, economics, sports, technology, health, society, culture, and international affairs.
  • The language follows commonly accepted Pashto writing conventions and editorial standards.
  • The dataset is suitable for sentiment analysis, text classification, transfer learning, and Pashto NLP research.
  • The corpus provides broad linguistic coverage across multiple news domains.

Intended Use

This dataset is intended for:

  • Research
  • Education
  • Benchmarking
  • Model Development
  • Academic Projects
  • Industrial NLP Applications

Citation

If you use this dataset in your research, publications, or projects, please cite:

@dataset{khail2026pashto,
  author       = {Khairullah Ibrahim Khail},
  title        = {Pashto Sentiment Analysis Dataset},
  year         = {2026},
  publisher    = {Kaggle},
  url          = {https://www.kaggle.com/datasets/khairullahsahil/pashto-sentiment-dataset},
  version      = {1.0}
}
Khairullah Ibrahim Khail

Pashto Sentiment Analysis Dataset

Version 1.0. Kaggle, 2026.

Author

Khairullah Ibrahim Khail


Version History

Version 1.0

  • Initial public release
  • 10,229 manually reviewed Pashto sentences
  • Three sentiment classes
  • CSV format

Acknowledgments

This dataset was created to support the growth of Pashto Natural Language Processing and encourage further research on low-resource languages.

Contributions, feedback, and research collaborations are welcome.Khairullah Ibrahim Khail


Related Datasets

Share link

Anyone who has the link will be able to view this.