Baselight
Sign In
kaggle

Customer Churn, Uplift & Feedback Dataset

Kaggle

@kaggle.harrachimustapha_customer_churn_uplift_and_feedback_dataset

Loading...
Loading...

Churn prediction dataset with synthetic feedback and uplift labels

Dataset Description

Overview

This dataset is an enhanced telecom customer churn benchmark designed for machine learning, business analytics, customer retention modelling, uplift modelling, and customer feedback analysis.

 The original churn prediction task is extended with additional synthetic business layers, including customer feedback, retention campaign simulation, uplift labels, and estimated business cost indicators.

 The goal is to make the dataset more useful for realistic customer analytics projects, where the objective is not only to predict churn, but also to understand customer behaviour, identify retention opportunities, and estimate the business impact of churn.

Dataset Objective

The main objective of this dataset is to support practical machine learning and business intelligence tasks related to customer churn and retention strategy.

Instead of focusing only on the question:

Will this customer churn?

this enhanced dataset also helps answer broader business questions such as:

  • Why might a customer churn?
  • Which customers should receive a retention offer?
  • Which customers are likely to respond to a campaign?
  • What is the expected business loss if a customer leaves?
  • Which customers are worth prioritizing for retention?

This makes the dataset suitable for both beginner and intermediate data science projects.


What the Dataset Contains

This dataset contains telecom customer information enriched with additional synthetic business data.

It includes customer usage behaviour, churn labels, customer feedback, marketing campaign information, uplift-related labels, and estimated business cost indicators.

The dataset is designed to support multiple types of analysis, including:

  • Churn prediction
  • Customer segmentation
  • Customer feedback analysis
  • Retention campaign analysis
  • Uplift modelling
  • Business cost estimation
  • Customer lifetime value analysis

Detailed file descriptions and variable explanations are provided separately.


Data Processing Summary

The dataset was prepared by cleaning and organizing the original telecom churn data, preserving the original training and testing structure, and creating a unified customer-level view.

Additional features were engineered to better represent customer behaviour and business value. These include usage-based indicators, customer value segments, churn risk levels, and support-related signals.

Synthetic customer feedback was generated using rule-based logic based on customer behaviour patterns such as service calls, international plan usage, pricing concerns, and churn status.

A synthetic retention campaign layer was also created to simulate treatment and control groups, campaign channels, offer types, customer responses, post-campaign churn behaviour, and uplift categories.

Finally, business-oriented indicators were added to estimate customer value, retention cost, expected loss, and potential gain if the customer is retained.


Possible Use Cases

This dataset can be used for many machine learning and analytics projects, such as:

  • Building a customer churn prediction model
  • Comparing classification models such as Logistic Regression, Random Forest, XGBoost, LightGBM, or Neural Networks
  • Performing exploratory data analysis on churn behaviour
  • Studying the effect of customer service calls on churn
  • Analysing synthetic customer feedback using NLP techniques
  • Building sentiment analysis models
  • Creating customer retention dashboards
  • Testing uplift modelling approaches
  • Estimating business value and churn-related losses
  • Designing customer targeting strategies for retention campaigns

The dataset is especially useful for projects that combine tabular data, text data, and business decision-making.


Important Notes

The customer feedback, campaign information, uplift labels, and business cost indicators are synthetically generated.

They are created for educational, benchmarking, and portfolio-building purposes. They should not be interpreted as real customer feedback, real campaign results, or real financial values.

The dataset is suitable for learning and experimentation, but it should not be used to make real business decisions without additional validation using real company data.

The uplift labels are also synthetic and are intended to help users practice uplift modelling concepts in a simple and accessible way.


Limitations

This dataset has some important limitations.

i. Part of the dataset is synthetic, especially the customer feedback, retention campaign information, uplift labels, and business cost indicators.

ii. The campaign outcomes are simulated using rule-based assumptions, not real randomized marketing experiments.

iii. The customer feedback texts are generated from customer behaviour patterns, so they may not fully capture the complexity of real customer language.

iv. The business cost values are estimated and simplified. They are useful for learning, but they do not represent real telecom financial calculations.

v. The dataset should be treated as an educational and benchmarking dataset, not as a production-ready business dataset.


Suggested Applications

This dataset is recommended for:

  • Machine learning practice
  • Churn prediction projects
  • Customer analytics dashboards
  • NLP + tabular data projects
  • Uplift modelling experiments
  • Business intelligence portfolios
  • Data science case studies
  • End-to-end ML pipeline demonstrations
  • Educational projects in marketing analytics

It can also be used to build a complete data science project, from data exploration and preprocessing to modelling, evaluation, interpretation, and business recommendations.


Related Datasets

Share link

Anyone who has the link will be able to view this.