Baselight

Horse Racing In HK

Data on thoroughbred racing in Hong Kong for fun and machine learning

@kaggle.gdaley_hkracing

Loading...
Loading...

About this Dataset

Horse Racing In HK

Can you beat the market?

Horse racing has always intrigued me - not so much from the point of view as a sport, but more from the view of it as a money market. Inspired by the pioneers of computerised horse betting, I'm sharing this dataset in the hope of finding more data scientists willing to take up the challenge and find new ways of exploiting it!

As always, the goal for most of us is to find information in the data that can be used to generate profit, usually by finding information that has not already been considered by the other players in the game. But I'm always interested in finding new uses for the data, whatever they may be.

Horse racing is a huge business in Hong Kong, which has two race tracks in a city that is only 1,104 square km. The betting pools are bigger than all US racetracks combined, which means that the opportunity is unlimited for those who are successful.

So are you up for it?

Content

The data was obtained from various free sources and is presented in CSV format. Personally-identifiable information, such as horse and jockey names, has not been included. However these should have no relevance to the purpose of this dataset, which is purely for experimental use.

There are two files:

races.csv

Each line describes the condition of an individual race.

race_id - unique identifier for the race

date - date of the race, in YYYY-MM-DD format (note that the dates given have been obscured and are not the real ones, although the durations between each race should be correct)

venue - a 2-character string, representing which of the 2 race courses this race took place at: ST = Shatin, HV = Happy Valley

race_no - race number of the race in the day's meeting

config - race track configuration, mostly related to the position of the inside rail. For more details, see the HKJC website.

surface - a number representing the type of race track surface: 1 = dirt, 0 = turf

distance - distance of the race, in metres

going - track condition. For more details, see the HKJC website.

horse_ratings - the range of horse ratings that may participate in this race

prize - the winning prize, in HK Dollars

race_class - a number representing the class of the race

sec_time1 - time taken by the leader of the race to reach the end of the end of the 1st sectional point (sec)

sec_time2 - time taken by the leader of the race to reach the end of the 2nd sectional point (sec)

sec_time3 - time taken by the leader of the race to reach the end of the 3rd sectional point (sec)

sec_time4 - time taken by the leader of the race to reach the end of the 4th sectional point, if any (sec)

sec_time5 - time taken by the leader of the race to reach the end of the 5th sectional point, if any (sec)

sec_time6 - time taken by the leader of the race to reach the end of the fourth sectional point, if any (sec)

sec_time7 - time taken by the leader of the race to reach the end of the fourth sectional point, if any (sec)

time1 - time taken by the leader of the race in the 1st section only (sec)

time2 - time taken by the leader of the race in the 2nd section only (sec)

time3 - time taken by the leader of the race in the 3rd section only (sec)

time4 - time taken by the leader of the race in the 4th section only, if any (sec)

time5 - time taken by the leader of the race in the 5th section only, if any (sec)

time6 - time taken by the leader of the race in the 6th section only, if any (sec)

time7 - time taken by the leader of the race in the 7th section only, if any (sec)

place_combination1 - placing horse no (1st)

place_combination2 - placing horse no (2nd)

place_combination3 - placing horse no (3rd)

place_combination4 - placing horse no (4th)

place_dividend1 - placing dividend paid (for place_combination1)

place_dividend2 - placing dividend paid (for place_combination2)

place_dividend3 - placing dividend paid (for place_combination2)

place_dividend4 - placing dividend paid (for place_combination2)

win_combination1 - winning horse no

win_dividend1 - winning dividend paid (for win_combination1)

win_combination2 - joint winning horse no, if any

win_dividend2 - winning dividend paid (for win_combination2, if any)

runs.csv

Each line describes the characteristics of one horse run, in one of the races given in races.csv.

race_id - unique identifier for the race

horse_no - the number assigned to this horse, in the race

horse_id - unique identifier for this horse

result - finishing position of this horse in the race

won - whether horse won (1) or otherwise (0)

lengths_behind - finishing position, as the number of horse lengths behind the winner

horse_age - current age of this horse at the time of the race

horse_country - country of origin of this horse

horse_type - sex of the horse, e.g. 'Gelding', 'Mare', 'Horse', 'Rig', 'Colt', 'Filly'

horse_rating - rating number assigned by HKJC to this horse at the time of the race

horse_gear - string representing the gear carried by the horse in the race. An explanation of the codes used may be found on the HKJC website.

declared_weight - declared weight of the horse and jockey, in lbs

actual_weight - actual weight carried by the horse, in lbs

draw - post position number of the horse in this race

position_sec1 - position of this horse (ranking) in section 1 of the race

position_sec2 - position of this horse (ranking) in section 2 of the race

position_sec3 - position of this horse (ranking) in section 3 of the race

position_sec4 - position of this horse (ranking) in section 4 of the race, if any

position_sec5 - position of this horse (ranking) in section 5 of the race, if any

position_sec6 - position of this horse (ranking) in section 6 of the race, if any

behind_sec1 - position of this horse (lengths behind leader) in section 1 of the race

behind_sec2 - position of this horse (lengths behind leader) in section 2 of the race

behind_sec3 - position of this horse (lengths behind leader) in section 3 of the race

behind_sec4 - position of this horse (lengths behind leader) in section 4 of the race, if any

behind_sec5 - position of this horse (lengths behind leader) in section 5 of the race, if any

behind_sec6 - position of this horse (lengths behind leader) in section 6 of the race, if any

time1 - time taken by the horse to pass through the 1st section of the race (sec)

time2 - time taken by the horse to pass through the 2nd section of the race (sec)

time3 - time taken by the horse to pass through the 3rd section of the race (sec)

time4 - time taken by the horse to pass through the 4th section of the race, if any (sec)

time5 - time taken by the horse to pass through the 5th section of the race, if any (sec)

time6 - time taken by the horse to pass through the 6th section of the race, if any (sec)

finish_time - finishing time of the horse in this race (sec)

win_odds - win odds for this horse at start of race

place_odds - place (finishing in 1st, 2nd or 3rd position) odds for this horse at start of race

trainer_id - unique identifier of the horse's trainer at the time of the race

jockey_id - unique identifier of the jockey riding the horse in this race

Acknowledgements

None of this research would have even started without me standing on the shoulders of giants such as William Benter, Ruth Bolton and Randall Chapman and many others who have published the results of their research.

Inspiration

It is probably not going to be enough to just take this dataset and feed it into Google Cloud Machine Learning, Azure MI, etc... but let me know if you find otherwise!

Questions that need to be answered include:

  • Feature engineering - what features are needed and how best to estimate them from the data given?
  • Modelling - what kind of model works best? Maybe more than one model?
  • Other data - is there any other data needed, apart from that given in this data set?

Tables

Races

@kaggle.gdaley_hkracing.races
  • 232.12 KB
  • 6349 rows
  • 37 columns
Loading...

CREATE TABLE races (
  "race_id" BIGINT,
  "date" TIMESTAMP,
  "venue" VARCHAR,
  "race_no" BIGINT,
  "config" VARCHAR,
  "surface" BIGINT,
  "distance" BIGINT,
  "going" VARCHAR,
  "horse_ratings" VARCHAR,
  "prize" DOUBLE,
  "race_class" BIGINT,
  "sec_time1" DOUBLE,
  "sec_time2" DOUBLE,
  "sec_time3" DOUBLE,
  "sec_time4" DOUBLE,
  "sec_time5" DOUBLE,
  "sec_time6" DOUBLE,
  "sec_time7" VARCHAR,
  "time1" DOUBLE,
  "time2" DOUBLE,
  "time3" DOUBLE,
  "time4" DOUBLE,
  "time5" DOUBLE,
  "time6" DOUBLE,
  "time7" VARCHAR,
  "place_combination1" BIGINT,
  "place_combination2" BIGINT,
  "place_combination3" DOUBLE,
  "place_combination4" DOUBLE,
  "place_dividend1" DOUBLE,
  "place_dividend2" DOUBLE,
  "place_dividend3" DOUBLE,
  "place_dividend4" DOUBLE,
  "win_combination1" BIGINT,
  "win_dividend1" DOUBLE,
  "win_combination2" DOUBLE,
  "win_dividend2" DOUBLE
);

Runs

@kaggle.gdaley_hkracing.runs
  • 1.93 MB
  • 79447 rows
  • 37 columns
Loading...

CREATE TABLE runs (
  "race_id" BIGINT,
  "horse_no" BIGINT,
  "horse_id" BIGINT,
  "result" BIGINT,
  "won" DOUBLE,
  "lengths_behind" DOUBLE,
  "horse_age" BIGINT,
  "horse_country" VARCHAR,
  "horse_type" VARCHAR,
  "horse_rating" BIGINT,
  "horse_gear" VARCHAR,
  "declared_weight" DOUBLE,
  "actual_weight" BIGINT,
  "draw" BIGINT,
  "position_sec1" BIGINT,
  "position_sec2" BIGINT,
  "position_sec3" BIGINT,
  "position_sec4" DOUBLE,
  "position_sec5" DOUBLE,
  "position_sec6" DOUBLE,
  "behind_sec1" DOUBLE,
  "behind_sec2" DOUBLE,
  "behind_sec3" DOUBLE,
  "behind_sec4" DOUBLE,
  "behind_sec5" DOUBLE,
  "behind_sec6" DOUBLE,
  "time1" DOUBLE,
  "time2" DOUBLE,
  "time3" DOUBLE,
  "time4" DOUBLE,
  "time5" DOUBLE,
  "time6" DOUBLE,
  "finish_time" DOUBLE,
  "win_odds" DOUBLE,
  "place_odds" DOUBLE,
  "trainer_id" BIGINT,
  "jockey_id" BIGINT
);

Share link

Anyone who has the link will be able to view this.