Churn prediction dataset with synthetic feedback and uplift labels

Overview

This dataset is an enhanced telecom customer churn benchmark designed for machine learning, business analytics, customer retention modelling, uplift modelling, and customer feedback analysis.

 The original churn prediction task is extended with additional synthetic business layers, including customer feedback, retention campaign simulation, uplift labels, and estimated business cost indicators.

 The goal is to make the dataset more useful for realistic customer analytics projects, where the objective is not only to predict churn, but also to understand customer behaviour, identify retention opportunities, and estimate the business impact of churn.

Dataset Objective

The main objective of this dataset is to support practical machine learning and business intelligence tasks related to customer churn and retention strategy.

Instead of focusing only on the question:

Will this customer churn?

this enhanced dataset also helps answer broader business questions such as:

Why might a customer churn?
Which customers should receive a retention offer?
Which customers are likely to respond to a campaign?
What is the expected business loss if a customer leaves?
Which customers are worth prioritizing for retention?

This makes the dataset suitable for both beginner and intermediate data science projects.

What the Dataset Contains

This dataset contains telecom customer information enriched with additional synthetic business data.

It includes customer usage behaviour, churn labels, customer feedback, marketing campaign information, uplift-related labels, and estimated business cost indicators.

The dataset is designed to support multiple types of analysis, including:

Churn prediction
Customer segmentation
Customer feedback analysis
Retention campaign analysis
Uplift modelling
Business cost estimation
Customer lifetime value analysis

Detailed file descriptions and variable explanations are provided separately.

Data Processing Summary

The dataset was prepared by cleaning and organizing the original telecom churn data, preserving the original training and testing structure, and creating a unified customer-level view.

Additional features were engineered to better represent customer behaviour and business value. These include usage-based indicators, customer value segments, churn risk levels, and support-related signals.

Synthetic customer feedback was generated using rule-based logic based on customer behaviour patterns such as service calls, international plan usage, pricing concerns, and churn status.

A synthetic retention campaign layer was also created to simulate treatment and control groups, campaign channels, offer types, customer responses, post-campaign churn behaviour, and uplift categories.

Finally, business-oriented indicators were added to estimate customer value, retention cost, expected loss, and potential gain if the customer is retained.

Possible Use Cases

This dataset can be used for many machine learning and analytics projects, such as:

Building a customer churn prediction model
Comparing classification models such as Logistic Regression, Random Forest, XGBoost, LightGBM, or Neural Networks
Performing exploratory data analysis on churn behaviour
Studying the effect of customer service calls on churn
Analysing synthetic customer feedback using NLP techniques
Building sentiment analysis models
Creating customer retention dashboards
Testing uplift modelling approaches
Estimating business value and churn-related losses
Designing customer targeting strategies for retention campaigns

The dataset is especially useful for projects that combine tabular data, text data, and business decision-making.

Important Notes

The customer feedback, campaign information, uplift labels, and business cost indicators are synthetically generated.

They are created for educational, benchmarking, and portfolio-building purposes. They should not be interpreted as real customer feedback, real campaign results, or real financial values.

The dataset is suitable for learning and experimentation, but it should not be used to make real business decisions without additional validation using real company data.

The uplift labels are also synthetic and are intended to help users practice uplift modelling concepts in a simple and accessible way.

Limitations

This dataset has some important limitations.

i. Part of the dataset is synthetic, especially the customer feedback, retention campaign information, uplift labels, and business cost indicators.

ii. The campaign outcomes are simulated using rule-based assumptions, not real randomized marketing experiments.

iii. The customer feedback texts are generated from customer behaviour patterns, so they may not fully capture the complexity of real customer language.

iv. The business cost values are estimated and simplified. They are useful for learning, but they do not represent real telecom financial calculations.

v. The dataset should be treated as an educational and benchmarking dataset, not as a production-ready business dataset.

Suggested Applications

This dataset is recommended for:

Machine learning practice
Churn prediction projects
Customer analytics dashboards
NLP + tabular data projects
Uplift modelling experiments
Business intelligence portfolios
Data science case studies
End-to-end ML pipeline demonstrations
Educational projects in marketing analytics

It can also be used to build a complete data science project, from data exploration and preprocessing to modelling, evaluation, interpretation, and business recommendations.

Related Datasets

Customer Churn Prediction

@kaggle
Yahoo Finance Historical Prices And Ticker Fundamentals

@yahoo
Ethnic Power Relations Dataset (ETH, 2021)

@owid
Fur Banning

@owid
Production: Crops And Livestock Products

@owid
APT Sandworm Dataset

@zenodo

Customer Churn Prediction

Yahoo Finance Historical Prices And Ticker Fundamentals

Ethnic Power Relations Dataset (ETH, 2021)

Fur Banning

Production: Crops And Livestock Products

APT Sandworm Dataset