E-commerce Sales Prediction Dataset
This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.
π Dataset Overview
The dataset includes 1,000 records across the following features:
Column Name |
Description |
Date |
The date of the sale (01-01-2023 onward). |
Product_Category |
Category of the product (e.g., Electronics, Sports, Other). |
Price |
Price of the product (numerical). |
Discount |
Discount applied to the product (numerical). |
Customer_Segment |
Buyer segment (e.g., Regular, Occasional, Other). |
Marketing_Spend |
Marketing budget allocated for sales (numerical). |
Units_Sold |
Number of units sold per transaction (numerical). |
π Data Summary
General Properties
Date:
- Range: 01-01-2023 to 12-31-2023.
- Contains 1,000 unique values without missing data.
Product_Category:
- Categories: Electronics (21%), Sports (21%), Other (58%).
- Most common category: Electronics (21%).
Price:
- Range: From 244 to 999.
- Mean: 505, Standard Deviation: 290.
- Most common price range: 14.59 - 113.07.
Discount:
- Range: From 0.01% to 49.92%.
- Mean: 24.9%, Standard Deviation: 14.4%.
- Most common discount range: 0.01 - 5.00%.
Customer_Segment:
- Segments: Regular (35%), Occasional (34%), Other (31%).
- Most common segment: Regular.
Marketing_Spend:
- Range: From 2.41k to 10k.
- Mean: 4.91k, Standard Deviation: 2.84k.
Units_Sold:
- Range: From 5 to 57.
- Mean: 29.6, Standard Deviation: 7.26.
- Most common range: 24 - 34 units sold.
π Data Visualizations
The dataset is suitable for creating the following visualizations:
- 1. Price Distribution: Histogram to show the spread of prices.
- 2. Discount Distribution: Histogram to analyze promotional offers.
- 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns.
- 4. Customer Segment Distribution: Bar plot of customer segments.
- 5. Price vs Units Sold: Scatter plot to show pricing effects on sales.
- 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts.
- 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness.
- 8. Correlation Heatmap: Identify relationships between features.
- 9. Pairplot: Visualize pairwise feature interactions.
π‘ How the Data Was Created
The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:
-
Feature Engineering:
- Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
- Generated dependent features like units sold based on logical relationships.
-
Data Simulation:
- Python Libraries: Used NumPy and Pandas to generate and distribute values.
- Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
-
Validation:
- Verified data consistency with no missing or invalid values.
- Ensured logical correlations (e.g., higher discounts β increased units sold).
Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.
π Example Usage: Sales Prediction Model
Hereβs an example of building a predictive model using Linear Regression:
Written in python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')
# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = LinearRegression()
model.fit(X_train, y_train)
# Predictions
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')