Synthetic dataset of 1M+ vehicles for high-accuracy price prediction.

Overview

This comprehensive dataset contains 1,000,000 entries for used vehicles, designed specifically for training high-accuracy price prediction models. The data was synthetically generated using a Python script that establishes realistic correlations between a vehicle's attributes and its market price. It includes 25 of the most common car brands, covering a wide range of models and specifications.

How the Data Was Generated

The dataset was created programmatically. The script's logic ensures realistic data distributions and relationships, such as:

Depreciation: Vehicle age is the primary factor in price calculation, following an exponential decay curve.
Wear and Tear: Mileage is correlated with age and negatively impacts the final price.
Performance: Higher engine horsepower contributes positively to the vehicle's value.
Brand Value: The base price for each brand is different, reflecting real-world market positioning.

Potential Use Cases

This dataset is ideal for a variety of machine learning tasks, including:

Regression: Training a model to predict the price column.
Feature Engineering: Exploring new ways to combine features to improve model performance.
Exploratory Data Analysis (EDA): Practicing data visualization and uncovering patterns in automotive data.
Educational Purposes: A great resource for students and data scientists looking to work with a large, clean, and realistic dataset.

Related Datasets

Cars Prices Prediction Dataset

@kaggle
Yahoo Finance Historical Prices And Ticker Fundamentals

@yahoo
FRED Academic Data

@fred
FRED Prices

@fred
Producer Prices In Industry

@owid
2020 PREDICT Dataset (deprecated)

@ecjrc

Cars Prices Prediction Dataset

Yahoo Finance Historical Prices And Ticker Fundamentals

FRED Academic Data

FRED Prices

Producer Prices In Industry

2020 PREDICT Dataset (deprecated)