Tennis ATP Tour Australian Open Final 2019
A tribute to Novac Djokovic and Rafael Nadal
@kaggle.robseidl_tennis_atp_tour_australian_open_final_2019
A tribute to Novac Djokovic and Rafael Nadal
@kaggle.robseidl_tennis_atp_tour_australian_open_final_2019
Nowadays, in most sports either tracking or event data is available for sports data scientists to analyse leagues, teams, games or players. For example, in soccer event-based data is available for all major leagues by professional data providers like Opta, Statsbomb or Wyscout. For tennis this is different. Even though a camera-based tracking with Hawkeye is possible, this data is not available to the outside and only the largest courts are equipped with the system.
When I think about the latest breakthroughs in machine learning in image classification, detection, NLP (deepl.com) and audio recognition (Siri, Alexa) it is evident that all of these areas provide a huge amount of easily accessable data.
Personally, I expect that there would be way more research in tennis if there would be a large amount of freely available match data.
There exists statistics of all matches played on ATP Tour which are available from different sources. For example, Jeff Sackmans github repository is a great way to start. He also has a match charting project where point-by-point data is collected.
But when I think about tennis, it is about the movement of the players, their tactics, etc. It is the ball movement, the actual rallies and shots I want to be able to see and analyse.
Event data allows to capture positional, temporal and stroke information.
As a proof of concept, and a tribute to Novac Djokovic and Rafael Nadal, two of the greatest tennis players of all time, I manually annotated each rally and stroke of their Australian Open final 2019. Fortunately for me it only went over three sets.
The data consists of all points played in the match. It is build hierarchically from events, to rallies, to actual points.
I have already done the hard part of data cleaning, and the dataset is hopefully easy to understand and ready to use.
Positions
The x, y positions are with respect to the court coordinate system shown in Figure 1. They were calculated from the pixel coordinates through a direct linear transformation at the beginning of the match. (As the camera angle changed a bit during the match, some of the positions are off.)
Look into the data, see what you can find. Is there information about the game in positional, temporal and stroke information that can tell you more about the players and the match than simple match sheet statistics like the number of break points or first serves in?
You can use the dataset however you want, but here are some things you could start with.
To get you started, I have created a sample kernel. Find it here.
Anyone who has the link will be able to view this.