Context
Try to find the best predictors indicative of a home run, reported as log loss
Log Loss quantifies the accuracy of a classifier by penalizing false classifications.
Minimizing the Log Loss is equivalent to maximizing the accuracy of the classifier.
Content
Data set adapted from SLICE Competition Season 01 Episode 09 https://www.kaggle.com/c/sliced-s01e09-playoffs-1
These are only 2 hour competitions so time is limited. Here we can use the data set and take more time for analysis.
Acknowledgements
Adapted largely from David Robinson on YouTube. His modeling techniques are greatly appreciated in traversing the tidyverse()
Inspiration
Try to find the best predictors indicative of a home run, reported as log loss
Numeric predictors, categorical predictors, hybrid, how low can we go?
File descriptions
- train.csv - the training set (from 2020)
- test.csv - the test set (from 2021)
- sample_submission.csv - a sample submission file in the correct format
- park_dimensions.csv - various park details and dimensions (OPTIONAL)
Data dictionary
train.csv
- bip_id: unique identifier of ball in play
- game_date: date of game (YYYY-MM-DD)
- home_team: home team abbreviation
- away_team: away team abbreviation
- batter_team: batter's team abbreviation
- batter_name: batter's name
- pitcher_name: pitcher's name
- batter_id: batter's unique identifier
- pitcher_id: pitcher's unique identifier
- is_batter_lefty: binary encoding of left-handed batters
- is_pitcher_lefty: binary encoding of left-handed pitchers
- bb_type: batted ball type classification
- bearing: horizontal direction classification of ball leaving the bat (i.e. 'left' ball is traveling to the left side of the field)
- pitch_name: name of pitch type thrown
- park: unique identifier of park venue
- inning: inning number within game
- outs_when_up: current number of outs
- balls: current number of balls
- strikes: current number of strikes
- plate_x: ball position left(-) or right(+) of center plate (feet)
- plate_z: ball position above home plate (feet)
- pitch_mph: speed of pitched ball (miles per hour)
- launch_speed: speed of ball leaving the bat (miles per hour)
- launch_angle: vertical angle of ball leaving the bat (degrees relative to horizontal)
- is_home_run: binary encoding of home runs
test.csv
same as train.csv but without the target variable is_home_run
park_dimensions.csv
- park: unique identifier of park venue
- NAME: park name
- Cover: designation of stadiums with retractable roof or fixed dome
- LF_Dim: distance to left field wall (feet)
- CF_Dim: distance to center field wall (feet)
- RF_Dim: distance to right field wall (feet)
- LF_W: height of left field wall (feet)
- CF_W: height of center field wall (feet)
- RF_W: height of right field wall (feet)