Name: E-commerce Sales Prediction Dataset
Creator: Kaggle
License: https://creativecommons.org/publicdomain/zero/1.0/

About this Dataset

E-commerce Sales Prediction Dataset

This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

📂 Dataset Overview

The dataset includes 1,000 records across the following features:

Column Name	Description
Date	The date of the sale (01-01-2023 onward).
Product_Category	Category of the product (e.g., Electronics, Sports, Other).
Price	Price of the product (numerical).
Discount	Discount applied to the product (numerical).
Customer_Segment	Buyer segment (e.g., Regular, Occasional, Other).
Marketing_Spend	Marketing budget allocated for sales (numerical).
Units_Sold	Number of units sold per transaction (numerical).

📊 Data Summary

General Properties

Date:

Range: 01-01-2023 to 12-31-2023.
Contains 1,000 unique values without missing data.

Product_Category:

Categories: Electronics (21%), Sports (21%), Other (58%).
Most common category: Electronics (21%).

Price:

Range: From 244 to 999.
Mean: 505, Standard Deviation: 290.
Most common price range: 14.59 - 113.07.

Discount:

Range: From 0.01% to 49.92%.
Mean: 24.9%, Standard Deviation: 14.4%.
Most common discount range: 0.01 - 5.00%.

Customer_Segment:

Segments: Regular (35%), Occasional (34%), Other (31%).
Most common segment: Regular.

Marketing_Spend:

Range: From 2.41k to 10k.
Mean: 4.91k, Standard Deviation: 2.84k.

Units_Sold:

Range: From 5 to 57.
Mean: 29.6, Standard Deviation: 7.26.
Most common range: 24 - 34 units sold.

📈 Data Visualizations

The dataset is suitable for creating the following visualizations:

1. Price Distribution: Histogram to show the spread of prices.
2. Discount Distribution: Histogram to analyze promotional offers.
3. Marketing Spend Distribution: Histogram to understand marketing investment patterns.
4. Customer Segment Distribution: Bar plot of customer segments.
5. Price vs Units Sold: Scatter plot to show pricing effects on sales.
6. Discount vs Units Sold: Scatter plot to explore the impact of discounts.
7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness.
8. Correlation Heatmap: Identify relationships between features.
9. Pairplot: Visualize pairwise feature interactions.

💡 How the Data Was Created

The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

Feature Engineering:
- Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
- Generated dependent features like units sold based on logical relationships.
Data Simulation:
- Python Libraries: Used NumPy and Pandas to generate and distribute values.
- Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
Validation:
- Verified data consistency with no missing or invalid values.
- Ensured logical correlations (e.g., higher discounts → increased units sold).

Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.

🛠 Example Usage: Sales Prediction Model

Here’s an example of building a predictive model using Linear Regression:

Written in python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

## Load the dataset
df = pd.read_csv('ecommerce_sales.csv')

## Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']

## Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model training
model = LinearRegression()
model.fit(X_train, y_train)

## Predictions
y_pred = model.predict(X_test)

## Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Tables

Ecommerce Sales Prediction Dataset

@kaggle.nevildhinoja_e_commerce_sales_prediction_dataset.ecommerce_sales_prediction_dataset

30.45 KB
1000 rows
7 columns


CREATE TABLE ecommerce_sales_prediction_dataset (
  "date" VARCHAR,
  "product_category" VARCHAR,
  "price" DOUBLE,
  "discount" DOUBLE,
  "customer_segment" VARCHAR,
  "marketing_spend" DOUBLE,
  "units_sold" BIGINT
);