High-Quality Pashto Sentiment Dataset for Machine Learning and Natural Language

Pashto Sentiment Analysis Dataset

Overview

The Pashto Sentiment Dataset is a large-scale manually reviewed corpus developed for sentiment analysis and text classification in the Pashto language.

The dataset contains 10,229 Pashto sentences annotated with one of three sentiment labels:

Positive
Negative
Neutral

Pashto language is a low-resource language with limited publicly available NLP resources. This dataset aims to support research, education, and industrial applications in sentiment analysis, natural language understanding, and machine learning for Pashto language.

Why This Dataset?

Despite being spoken by tens of millions of people, Pashto language remains underrepresented in modern NLP research. The lack of high-quality annotated datasets presents a significant challenge for building robust language technologies.

This dataset was created to help researchers and developers:

Build sentiment classification systems
Fine-tune transformer models
Benchmark NLP algorithms
Develop Pashto language technologies
Support low-resource language research

Key Features

✅ 10,229 manually reviewed samples

✅ Three sentiment categories

✅ Balanced class distribution

✅ Real-world Pashto language content

✅ Suitable for machine learning and deep learning

✅ Ready-to-use CSV format

✅ Designed for academic and industrial research

Dataset Statistics

Metric	Value
Total Samples	10,229
Language	Pashto
Task	Sentiment Analysis
Classes	3
Format	CSV
Version	1.0

Class Distribution

Sentiment	Count
Positive	~3,549
Negative	~3,324
Neutral	~3,355

The dataset provides a relatively balanced distribution across sentiment classes, making it suitable for supervised learning experiments.

Class Distribution

print(df["sentiment"].value_counts())

Dataset Schema

Column	Description
sentence	Pashto sentence or text sample
sentiment	Sentiment label (positive, negative, neutral)

Example Records

Sentence	Label
افغانستان د کرکټ لوبډلې مهمه بریا ترلاسه کړه.	positive
د امنیتي پېښې له امله ګڼ شمېر خلک اغېزمن شول.	negative
نن په کابل کې د هوا درجه معتدله وه.	neutral

Data Collection

The dataset was compiled from publicly available Pashto-language content covering multiple domains, including:

Politics
International Affairs
Economy
Sports
Society
Technology
Culture
Health

The diverse coverage helps improve the dataset's usefulness across different NLP applications.

Annotation Process

The annotation workflow consisted of:

Step 1 — Data Collection

Pashto text samples were gathered from publicly available sources(BBC Pashto, VOA Pashto, Shamshad News ....)

Step 2 — Initial Labeling

Each sentence was assigned a sentiment category based on its overall semantic meaning.

Step 3 — Manual Review

All labels were reviewed and validated by the dataset creator.

Step 4 — Quality Assurance

The final dataset underwent cleaning and validation procedures before release.

Data Quality

Several quality-control measures were applied:

Duplicate removal
Missing-value removal
Text normalization
Formatting consistency checks
Label verification
Manual review

These steps were performed to improve dataset reliability and usability.

Potential Applications

Natural Language Processing

Sentiment Analysis
Opinion Mining
Text Classification
Language Understanding
Content Analysis

Machine Learning

Multi-Class Classification
Deep Learning
Transformer Fine-Tuning
Transfer Learning

Research

Low-Resource NLP
Cross-Lingual Learning
Pashto Language Technology
Benchmark Evaluation

Getting Started

Load the Dataset

import pandas as pd

df = pd.read_csv("pashto_sentiment_dataset.csv")

print(df.head())

Class Distribution

print(df["sentiment"].value_counts())

Recommended Models

Researchers may use this dataset with:

BERT
RoBERTa
XLM-RoBERTa
mBERT
DistilBERT
LLaMA-based classifiers
Traditional machine learning models

Dataset Characteristics

The dataset is written in Standard Pashto.
The corpus is derived from professionally written Pashto-language content.
The dataset covers a diverse range of topics, including politics, economics, sports, technology, health, society, culture, and international affairs.
The language follows commonly accepted Pashto writing conventions and editorial standards.
The dataset is suitable for sentiment analysis, text classification, transfer learning, and Pashto NLP research.
The corpus provides broad linguistic coverage across multiple news domains.

Intended Use

This dataset is intended for:

Research
Education
Benchmarking
Model Development
Academic Projects
Industrial NLP Applications

Citation

If you use this dataset in your research, publications, or projects, please cite:

@dataset{khail2026pashto,
  author       = {Khairullah Ibrahim Khail},
  title        = {Pashto Sentiment Analysis Dataset},
  year         = {2026},
  publisher    = {Kaggle},
  url          = {https://www.kaggle.com/datasets/khairullahsahil/pashto-sentiment-dataset},
  version      = {1.0}
}

Khairullah Ibrahim Khail

Pashto Sentiment Analysis Dataset

Version 1.0. Kaggle, 2026.

Author

Khairullah Ibrahim Khail

Version History

Version 1.0

Initial public release
10,229 manually reviewed Pashto sentences
Three sentiment classes
CSV format

Acknowledgments

This dataset was created to support the growth of Pashto Natural Language Processing and encourage further research on low-resource languages.

Contributions, feedback, and research collaborations are welcome.Khairullah Ibrahim Khail

Related Datasets

Twitter Tweets Sentiment Dataset

@kaggle
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

@owid
SFC2014 - REACT EU Overview Allocation Vs Decided

@esifunds
Wars On Territory

@owid
TGS SC2 Nasal Positivity

@cdc

Twitter Tweets Sentiment Dataset

Ethnic Power Relations Dataset (ETH, 2021)

Trust Questions In The European Social Survey, Latinobarómetro And Afrobarometer

SFC2014 - REACT EU Overview Allocation Vs Decided

Wars On Territory

TGS SC2 Nasal Positivity