Baselight

Dirty E-Commerce Data [80,000+ Products]

Practice Cleaning Dirty Data - 20+ category files and 80,000+ products

@kaggle.oleksiimartusiuk_e_commerce_data_shein

About this Dataset

Dirty E-Commerce Data [80,000+ Products]

E-commerce Product Dataset - Clean and Enhance Your Data Analysis Skills or Check Out The Cleaned File Below!

This dataset offers a comprehensive collection of product information from an e-commerce store, spread across 20+ CSV files and encompassing over 80,000+ products. It presents a valuable opportunity to test and refine your data cleaning and wrangling skills.

What's Included:

A variety of product categories, including:

  • Apparel & Accessories
  • Electronics
  • Home & Kitchen
  • Beauty & Health
  • Toys & Games
  • Men's Clothes
  • Women's Clothes
  • Pet Supplies
  • Sports & Outdoor
  • (and more!)

Each product record contains details such as:

  • Product Title
  • Category
  • Price
  • Discount information
  • (and other attributes)

Challenges and Opportunities:

Data Cleaning: The dataset is "dirty," containing missing values, inconsistencies in formatting, and potential errors. This provides a chance to practice your data-cleaning techniques such as:

  • Identifying and handling missing values
  • Standardizing data formats
  • Correcting inconsistencies
  • Dealing with duplicate entries

Feature Engineering: After cleaning, you can explore opportunities to create new features from the existing data, such as:

  • Extracting keywords from product titles and descriptions
  • Deriving price categories
  • Calculating average discounts

Who can benefit from this dataset?

  • Data analysts and scientists looking to practice data cleaning and wrangling skills on a real-world e-commerce dataset
  • Machine learning enthusiasts interested in building models for product recommendation, price prediction, or other e-commerce tasks
  • Anyone interested in exploring and understanding the structure and organization of product data in an e-commerce setting
  • By contributing to this dataset and sharing your cleaning and feature engineering approaches, you can help create a valuable resource for the Kaggle community!

Share link

Anyone who has the link will be able to view this.