Baselight

E-commerce Sales Prediction Dataset

Synthetic E-commerce Dataset for Sales Forecasting and Trend Analysis

@kaggle.nevildhinoja_e_commerce_sales_prediction_dataset

About this Dataset

E-commerce Sales Prediction Dataset

E-commerce Sales Prediction Dataset

This repository contains a comprehensive and clean dataset for predicting e-commerce sales, tailored for data scientists, machine learning enthusiasts, and researchers. The dataset is crafted to analyze sales trends, optimize pricing strategies, and develop predictive models for sales forecasting.

πŸ“‚ Dataset Overview

The dataset includes 1,000 records across the following features:

Column Name Description
Date The date of the sale (01-01-2023 onward).
Product_Category Category of the product (e.g., Electronics, Sports, Other).
Price Price of the product (numerical).
Discount Discount applied to the product (numerical).
Customer_Segment Buyer segment (e.g., Regular, Occasional, Other).
Marketing_Spend Marketing budget allocated for sales (numerical).
Units_Sold Number of units sold per transaction (numerical).

πŸ“Š Data Summary

General Properties

Date:

  • Range: 01-01-2023 to 12-31-2023.
  • Contains 1,000 unique values without missing data.

Product_Category:

  • Categories: Electronics (21%), Sports (21%), Other (58%).
  • Most common category: Electronics (21%).

Price:

  • Range: From 244 to 999.
  • Mean: 505, Standard Deviation: 290.
  • Most common price range: 14.59 - 113.07.

Discount:

  • Range: From 0.01% to 49.92%.
  • Mean: 24.9%, Standard Deviation: 14.4%.
  • Most common discount range: 0.01 - 5.00%.

Customer_Segment:

  • Segments: Regular (35%), Occasional (34%), Other (31%).
  • Most common segment: Regular.

Marketing_Spend:

  • Range: From 2.41k to 10k.
  • Mean: 4.91k, Standard Deviation: 2.84k.

Units_Sold:

  • Range: From 5 to 57.
  • Mean: 29.6, Standard Deviation: 7.26.
  • Most common range: 24 - 34 units sold.

πŸ“ˆ Data Visualizations

The dataset is suitable for creating the following visualizations:

  • 1. Price Distribution: Histogram to show the spread of prices.
  • 2. Discount Distribution: Histogram to analyze promotional offers.
  • 3. Marketing Spend Distribution: Histogram to understand marketing investment patterns.
  • 4. Customer Segment Distribution: Bar plot of customer segments.
  • 5. Price vs Units Sold: Scatter plot to show pricing effects on sales.
  • 6. Discount vs Units Sold: Scatter plot to explore the impact of discounts.
  • 7. Marketing Spend vs Units Sold: Scatter plot for marketing effectiveness.
  • 8. Correlation Heatmap: Identify relationships between features.
  • 9. Pairplot: Visualize pairwise feature interactions.

πŸ’‘ How the Data Was Created

The dataset is synthetically generated to mimic realistic e-commerce sales trends. Below are the steps taken for data generation:

  1. Feature Engineering:

    • Identified key attributes such as product category, price, discount, and marketing spend, typically observed in e-commerce data.
    • Generated dependent features like units sold based on logical relationships.
  2. Data Simulation:

    • Python Libraries: Used NumPy and Pandas to generate and distribute values.
    • Statistical Modeling: Ensured feature distributions aligned with real-world sales data patterns.
  3. Validation:

    • Verified data consistency with no missing or invalid values.
    • Ensured logical correlations (e.g., higher discounts β†’ increased units sold).

Note: The dataset is synthetic and not sourced from any real-world e-commerce platform.


πŸ›  Example Usage: Sales Prediction Model

Here’s an example of building a predictive model using Linear Regression:

Written in python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv('ecommerce_sales.csv')

# Feature selection
X = df[['Price', 'Discount', 'Marketing_Spend']]
y = df['Units_Sold']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Share link

Anyone who has the link will be able to view this.