Baselight

KBO Player Performance Dataset 2018 - 2024

Player Performance Data for Korean Baseball Organization teams

@kaggle.clementmsika_kbo_player_performance_dataset_2018_2024

Loading...
Loading...

About this Dataset

KBO Player Performance Dataset 2018 - 2024

What is the KBO 리그?

The KBO is the acronym for the Korean Baseball Organization, the top level baseball league in South Korea.
This league is often compared to the MLB (Major League Baseball) in the United States.

There are currently 10 teams competing in the KBO:

  • Samsung Lions
  • Kia Tigers
  • LG Twins
  • SSG Landers (formerly SK Wyverns)
  • NC Dinos
  • Lotte Giants
  • KT Wiz
  • Kiwoom Heroes (formerly Nexen Heroes)
  • Hanwha Eagles
  • Doosan Bears

Context

I created this dataset as part of a computer science project at Georgia Tech. You can read the research paper here.

I crawled the KBO statistics data from the Sports Reference website baseball-reference.com in October and November 2024 using Python. In particular, for each of the 10 KBO teams, I extracted the player batting statistics and characteristics between 2018 and 2024.

Source

Content

This dataset contains 1,984 KBO player records.

The columns I used to implement a Bayesian regression model:

  • Year

This column refers to the KBO League season, which is the highest baseball league in South Korea and can be compared to the MLB in North America.

  • Team

There are 10 KBO teams between 2018 and 2024.

  • Name

The column refers to the baseball batter's name. The dataset is limited here to the batting teams.

  • Age

I use the batter’s age as a feature in the Bayesian model.

  • Handedness

This feature refers to the player’s batting side: ‘L’ for left-handed, ‘R’ for right-handed and ‘S’ for switch hitters.

  • Height

I enriched the Bayesian model with the player’s height, which is shown in the dataset in feet and inches.

  • Weight

We also integrated in the model the batter’s weight in pounds.

  • Home Run Rate

This column is the model target. The batter’s home run rate for a given season is defined as the ratio : number of home runs / number of plate appearances.

This KPI significantly accounts for the overall batter’s performance during a baseball season.

A home run occurs when ‘a batter hits a fair ball and scores on the play without being put out or without the benefit of an error’ .

Detailed information about each of the dataset column can be found on the website baseball-reference.com.

Inspiration

The Jupyter notebook used to implement the Bayesian regression model with this dataset : www.kaggle.com/code/clementmsika/predicting-kbo-player-performance-with-pymc

Tables

Kbo Dataset 2018–2024

@kaggle.clementmsika_kbo_player_performance_dataset_2018_2024.kbo_dataset_2018_2024
  • 119.78 KB
  • 1984 rows
  • 33 columns
Loading...

CREATE TABLE kbo_dataset_2018_2024 (
  "player_name" VARCHAR,
  "team" VARCHAR,
  "year" BIGINT,
  "home_run_rate" DOUBLE,
  "label" VARCHAR,
  "batting_side" VARCHAR,
  "throwing_hand" VARCHAR,
  "height" VARCHAR,
  "weight" BIGINT,
  "age" BIGINT,
  "g" BIGINT,
  "pa" BIGINT,
  "ab" BIGINT,
  "r" BIGINT,
  "h" BIGINT,
  "n_2b" BIGINT,
  "n_3b" BIGINT,
  "hr" BIGINT,
  "rbi" BIGINT,
  "sb" BIGINT,
  "cs" BIGINT,
  "bb" BIGINT,
  "so" BIGINT,
  "batting_avg" DOUBLE,
  "onbase_perc" DOUBLE,
  "slugging_perc" DOUBLE,
  "ibb" BIGINT,
  "onbase_plus_slugging" DOUBLE,
  "tb" BIGINT,
  "gidp" BIGINT,
  "hbp" BIGINT,
  "sh" BIGINT,
  "sf" BIGINT
);

Share link

Anyone who has the link will be able to view this.