Baselight

Baseball

MLB Predict Home Runs

@kaggle.jcraggy_baseball

About this Dataset

Baseball

Context

Try to find the best predictors indicative of a home run, reported as log loss

Log Loss quantifies the accuracy of a classifier by penalizing false classifications.
Minimizing the Log Loss is equivalent to maximizing the accuracy of the classifier.

Content

Data set adapted from SLICE Competition Season 01 Episode 09 https://www.kaggle.com/c/sliced-s01e09-playoffs-1

These are only 2 hour competitions so time is limited. Here we can use the data set and take more time for analysis.

Acknowledgements

Adapted largely from David Robinson on YouTube. His modeling techniques are greatly appreciated in traversing the tidyverse()

Inspiration

Try to find the best predictors indicative of a home run, reported as log loss

Numeric predictors, categorical predictors, hybrid, how low can we go?

File descriptions

  • train.csv - the training set (from 2020)
  • test.csv - the test set (from 2021)
  • sample_submission.csv - a sample submission file in the correct format
  • park_dimensions.csv - various park details and dimensions (OPTIONAL)

Data dictionary

train.csv

  • bip_id: unique identifier of ball in play
  • game_date: date of game (YYYY-MM-DD)
  • home_team: home team abbreviation
  • away_team: away team abbreviation
  • batter_team: batter's team abbreviation
  • batter_name: batter's name
  • pitcher_name: pitcher's name
  • batter_id: batter's unique identifier
  • pitcher_id: pitcher's unique identifier
  • is_batter_lefty: binary encoding of left-handed batters
  • is_pitcher_lefty: binary encoding of left-handed pitchers
  • bb_type: batted ball type classification
  • bearing: horizontal direction classification of ball leaving the bat (i.e. 'left' ball is traveling to the left side of the field)
  • pitch_name: name of pitch type thrown
  • park: unique identifier of park venue
  • inning: inning number within game
  • outs_when_up: current number of outs
  • balls: current number of balls
  • strikes: current number of strikes
  • plate_x: ball position left(-) or right(+) of center plate (feet)
  • plate_z: ball position above home plate (feet)
  • pitch_mph: speed of pitched ball (miles per hour)
  • launch_speed: speed of ball leaving the bat (miles per hour)
  • launch_angle: vertical angle of ball leaving the bat (degrees relative to horizontal)
  • is_home_run: binary encoding of home runs

test.csv

same as train.csv but without the target variable is_home_run

park_dimensions.csv

  • park: unique identifier of park venue
  • NAME: park name
  • Cover: designation of stadiums with retractable roof or fixed dome
  • LF_Dim: distance to left field wall (feet)
  • CF_Dim: distance to center field wall (feet)
  • RF_Dim: distance to right field wall (feet)
  • LF_W: height of left field wall (feet)
  • CF_W: height of center field wall (feet)
  • RF_W: height of right field wall (feet)

Share link

Anyone who has the link will be able to view this.