Context

This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)

Content

Attribute Information:

Id number: 1 to 214 (removed from CSV file)
RI: refractive index
Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
Mg: Magnesium
Al: Aluminum
Si: Silicon
K: Potassium
Ca: Calcium
Ba: Barium
Fe: Iron
Type of glass: (class attribute)
-- 1 building_windows_float_processed
-- 2 building_windows_non_float_processed
-- 3 vehicle_windows_float_processed
-- 4 vehicle_windows_non_float_processed (none in this database)
-- 5 containers
-- 6 tableware
-- 7 headlamps

Acknowledgements

https://archive.ics.uci.edu/ml/datasets/Glass+Identification
Source:

Creator:
B. German
Central Research Establishment
Home Office Forensic Science Service
Aldermaston, Reading, Berkshire RG7 4PN

Donor:
Vina Spiehler, Ph.D., DABFT
Diagnostic Products Corporation
(213) 776-0180 (ext 3014)

Inspiration

Data exploration of this dataset reveals two important characteristics :

The variables are highly corelated with each other including the response variables:
So which kind of ML algorithm is most suitable for this dataset Random Forest , KNN or other? Also since dataset is too small is there any chance of applying PCA or it should be completely avoided?
Highly Skewed Data:
Is scaling sufficient or are there any other techniques which should be applied to normalize data? Like BOX-COX Power transformation?

Tables

Glass

@kaggle.uciml_glass.glass

13.41 KB
214 rows
10 columns


CREATE TABLE glass (
  "ri" DOUBLE,
  "na" DOUBLE,
  "mg" DOUBLE,
  "al" DOUBLE,
  "si" DOUBLE,
  "k" DOUBLE,
  "ca" DOUBLE,
  "ba" DOUBLE,
  "fe" DOUBLE,
  "type" BIGINT
);