Baselight

Red Wine DataSet

wine-quality-measure

@kaggle.soorajgupta7_red_wine_dataset

About this Dataset

Red Wine DataSet

Datasets Description:

The datasets under discussion pertain to the red and white variants of Portuguese "Vinho Verde" wine. Detailed information is available in the reference by Cortez et al. (2009). These datasets encompass physicochemical variables as inputs and sensory variables as outputs. Notably, specifics regarding grape types, wine brand, and selling prices are absent due to privacy and logistical concerns.

Classification and Regression Tasks:
One can interpret these datasets as being suitable for both classification and regression analyses. The classes are ordered, albeit imbalanced. For instance, the dataset contains a more significant number of normal wines compared to excellent or poor ones.

Dataset Contents:
For a comprehensive understanding, readers are encouraged to review the work by Cortez et al. (2009). The input variables, derived from physicochemical tests, include:

  1. Fixed acidity
  2. Volatile acidity
  3. Citric acid
  4. Residual sugar
  5. Chlorides
  6. Free sulfur dioxide
  7. Total sulfur dioxide
  8. Density
  9. pH
  10. Sulphates
  11. Alcohol

The output variable, based on sensory data, is denoted by:
12. Quality (score ranging from 0 to 10)

Usage Tips:
A practical suggestion involves setting a threshold for the dependent variable, defining wines with a quality score of 7 or higher as 'good/1' and the rest as 'not good/0.' This facilitates meaningful experimentation with hyperparameter tuning using decision tree algorithms and analyzing ROC curves and AUC values.

Operational Workflow:
To efficiently utilize the dataset, the following steps are recommended:

  1. Utilize a File Reader (for csv) to a linear correlation node and an interactive histogram for basic Exploratory Data Analysis (EDA).
  2. Employ a File Reader to a Rule Engine Node for transforming the 10-point scale to a dichotomous variable indicating 'good wine' and 'rest.'
  3. Implement a Rule Engine Node output to an input of Column Filter node to filter out the original 10-point feature, thus preventing data leakage.
  4. Apply a Column Filter Node output to the input of Partitioning Node to execute a standard train/test split (e.g., 75%/25%, choosing 'random' or 'stratified').
  5. Feed the Partitioning Node train data split output into the input of Decision Tree Learner node.
  6. Connect the Partitioning Node test data split output to the input of Decision Tree predictor Node.
  7. Link the Decision Tree Learner Node output to the input of Decision Tree Node.
  8. Finally, connect the Decision Tree output to the input of ROC Node for model evaluation based on the AUC value.

Tools and Acknowledgments:
For an efficient analysis, consider using KNIME, a valuable graphical user interface (GUI) tool. Additionally, the dataset is available on the UCI machine learning repository, and proper acknowledgment and citation of the dataset source by Cortez et al. (2009) are essential for use.

Share link

Anyone who has the link will be able to view this.