Red Wine DataSet by Kaggle | Ecommerce and Consumer Trends

About this Dataset

Red Wine DataSet

Datasets Description:

The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

Classification and Regression Tasks:
One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

Dataset Contents:
For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include:

Fixed acidity
Volatile acidity
Citric acid
Residual sugar
Chlorides
Free sulfur dioxide
Total sulfur dioxide
Density
pH
Sulphates
Alcohol

The output variable, based on sensory data, is denoted by:
12. Quality (score ranging from 0 to 10)

Usage Tips:
A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

Operational Workflow:
To efficiently utilize the dataset, the following steps are recommended:

Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA).
Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.'
Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage.
Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified').
Feed the Partitioning Node train data split output into the input of Decision Tree Learner node.
Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node.
Link the Decision Tree Learner Node output to the input of Decision Tree Node.
Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

Tools and Acknowledgments:
For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.

Tables

Winequality Red

@kaggle.soorajgupta7_red_wine_dataset.winequality_red

33.41 KB
1599 rows
12 columns


CREATE TABLE winequality_red (
  "fixed_acidity" DOUBLE,
  "volatile_acidity" DOUBLE,
  "citric_acid" DOUBLE,
  "residual_sugar" DOUBLE,
  "chlorides" DOUBLE,
  "free_sulfur_dioxide" DOUBLE,
  "total_sulfur_dioxide" DOUBLE,
  "density" DOUBLE,
  "ph" DOUBLE,
  "sulphates" DOUBLE,
  "alcohol" DOUBLE,
  "quality" BIGINT
);