High-Quality Pashto Sentiment Dataset for Machine Learning and Natural Language
Dataset Description
Pashto Sentiment Analysis Dataset
Overview
The Pashto Sentiment Dataset is a large-scale manually reviewed corpus developed for sentiment analysis and text classification in the Pashto language.
The dataset contains 10,229 Pashto sentences annotated with one of three sentiment labels:
- Positive
- Negative
- Neutral
Pashto language is a low-resource language with limited publicly available NLP resources. This dataset aims to support research, education, and industrial applications in sentiment analysis, natural language understanding, and machine learning for Pashto language.
Why This Dataset?
Despite being spoken by tens of millions of people, Pashto language remains underrepresented in modern NLP research. The lack of high-quality annotated datasets presents a significant challenge for building robust language technologies.
This dataset was created to help researchers and developers:
- Build sentiment classification systems
- Fine-tune transformer models
- Benchmark NLP algorithms
- Develop Pashto language technologies
- Support low-resource language research
Key Features
✅ 10,229 manually reviewed samples
✅ Three sentiment categories
✅ Balanced class distribution
✅ Real-world Pashto language content
✅ Suitable for machine learning and deep learning
✅ Ready-to-use CSV format
✅ Designed for academic and industrial research
Dataset Statistics
| Metric | Value |
|---|---|
| Total Samples | 10,229 |
| Language | Pashto |
| Task | Sentiment Analysis |
| Classes | 3 |
| Format | CSV |
| Version | 1.0 |
Class Distribution
| Sentiment | Count |
|---|---|
| Positive | ~3,549 |
| Negative | ~3,324 |
| Neutral | ~3,355 |
The dataset provides a relatively balanced distribution across sentiment classes, making it suitable for supervised learning experiments.
Class Distribution
print(df["sentiment"].value_counts())
Dataset Schema
| Column | Description |
|---|---|
| sentence | Pashto sentence or text sample |
| sentiment | Sentiment label (positive, negative, neutral) |
Example Records
| Sentence | Label |
|---|---|
| افغانستان د کرکټ لوبډلې مهمه بریا ترلاسه کړه. | positive |
| د امنیتي پېښې له امله ګڼ شمېر خلک اغېزمن شول. | negative |
| نن په کابل کې د هوا درجه معتدله وه. | neutral |
Data Collection
The dataset was compiled from publicly available Pashto-language content covering multiple domains, including:
- Politics
- International Affairs
- Economy
- Sports
- Society
- Technology
- Culture
- Health
The diverse coverage helps improve the dataset's usefulness across different NLP applications.
Annotation Process
The annotation workflow consisted of:
Step 1 — Data Collection
Pashto text samples were gathered from publicly available sources(BBC Pashto, VOA Pashto, Shamshad News ....)
Step 2 — Initial Labeling
Each sentence was assigned a sentiment category based on its overall semantic meaning.
Step 3 — Manual Review
All labels were reviewed and validated by the dataset creator.
Step 4 — Quality Assurance
The final dataset underwent cleaning and validation procedures before release.
Data Quality
Several quality-control measures were applied:
- Duplicate removal
- Missing-value removal
- Text normalization
- Formatting consistency checks
- Label verification
- Manual review
These steps were performed to improve dataset reliability and usability.
Potential Applications
Natural Language Processing
- Sentiment Analysis
- Opinion Mining
- Text Classification
- Language Understanding
- Content Analysis
Machine Learning
- Multi-Class Classification
- Deep Learning
- Transformer Fine-Tuning
- Transfer Learning
Research
- Low-Resource NLP
- Cross-Lingual Learning
- Pashto Language Technology
- Benchmark Evaluation
Getting Started
Load the Dataset
import pandas as pd
df = pd.read_csv("pashto_sentiment_dataset.csv")
print(df.head())
Class Distribution
print(df["sentiment"].value_counts())
Recommended Models
Researchers may use this dataset with:
- BERT
- RoBERTa
- XLM-RoBERTa
- mBERT
- DistilBERT
- LLaMA-based classifiers
- Traditional machine learning models
Dataset Characteristics
- The dataset is written in Standard Pashto.
- The corpus is derived from professionally written Pashto-language content.
- The dataset covers a diverse range of topics, including politics, economics, sports, technology, health, society, culture, and international affairs.
- The language follows commonly accepted Pashto writing conventions and editorial standards.
- The dataset is suitable for sentiment analysis, text classification, transfer learning, and Pashto NLP research.
- The corpus provides broad linguistic coverage across multiple news domains.
Intended Use
This dataset is intended for:
- Research
- Education
- Benchmarking
- Model Development
- Academic Projects
- Industrial NLP Applications
Citation
If you use this dataset in your research, publications, or projects, please cite:
@dataset{khail2026pashto,
author = {Khairullah Ibrahim Khail},
title = {Pashto Sentiment Analysis Dataset},
year = {2026},
publisher = {Kaggle},
url = {https://www.kaggle.com/datasets/khairullahsahil/pashto-sentiment-dataset},
version = {1.0}
}
Khairullah Ibrahim Khail
Pashto Sentiment Analysis Dataset
Version 1.0. Kaggle, 2026.
Author
Khairullah Ibrahim Khail
Version History
Version 1.0
- Initial public release
- 10,229 manually reviewed Pashto sentences
- Three sentiment classes
- CSV format
Acknowledgments
This dataset was created to support the growth of Pashto Natural Language Processing and encourage further research on low-resource languages.
Contributions, feedback, and research collaborations are welcome.Khairullah Ibrahim Khail
Related Datasets
-
Twitter Tweets Sentiment Dataset
@kaggle
-
Wars On Territory
@owid