Baselight

Tennis ATP Tour Australian Open Final 2019

A tribute to Novac Djokovic and Rafael Nadal

@kaggle.robseidl_tennis_atp_tour_australian_open_final_2019

About this Dataset

Tennis ATP Tour Australian Open Final 2019

Context

Nowadays, in most sports either tracking or event data is available for sports data scientists to analyse leagues, teams, games or players. For example, in soccer event-based data is available for all major leagues by professional data providers like Opta, Statsbomb or Wyscout. For tennis this is different. Even though a camera-based tracking with Hawkeye is possible, this data is not available to the outside and only the largest courts are equipped with the system.
When I think about the latest breakthroughs in machine learning in image classification, detection, NLP (deepl.com) and audio recognition (Siri, Alexa) it is evident that all of these areas provide a huge amount of easily accessable data.
Personally, I expect that there would be way more research in tennis if there would be a large amount of freely available match data.
There exists statistics of all matches played on ATP Tour which are available from different sources. For example, Jeff Sackmans github repository is a great way to start. He also has a match charting project where point-by-point data is collected.
But when I think about tennis, it is about the movement of the players, their tactics, etc. It is the ball movement, the actual rallies and shots I want to be able to see and analyse.

Event data allows to capture positional, temporal and stroke information.
As a proof of concept, and a tribute to Novac Djokovic and Rafael Nadal, two of the greatest tennis players of all time, I manually annotated each rally and stroke of their Australian Open final 2019. Fortunately for me it only went over three sets.

Content

The data consists of all points played in the match. It is build hierarchically from events, to rallies, to actual points.

  • Points: a list of all points played in the final with information about the server, receiver, point type, number of strokes, time of rally, new score of the game.
  • Rallies: A list of all rallies with Server, Returner, etc.
  • Events: Each time a player hit the ball, the stroke type, position of the player, and position of the opponent were recorded.
  • Serves: For each successful serve, which was no failure, the position of the serve in the service box was recorded (whenever possible)

I have already done the hard part of data cleaning, and the dataset is hopefully easy to understand and ready to use.

Positions
The x, y positions are with respect to the court coordinate system shown in Figure 1. They were calculated from the pixel coordinates through a direct linear transformation at the beginning of the match. (As the camera angle changed a bit during the match, some of the positions are off.)

Inspiration

Look into the data, see what you can find. Is there information about the game in positional, temporal and stroke information that can tell you more about the players and the match than simple match sheet statistics like the number of break points or first serves in?
You can use the dataset however you want, but here are some things you could start with.

  • It is a great way to practice pandas to generate general statistics like points played, serve percentages, games won, breakpoints etc. and compare them with the statistics from other websites.
  • You can visualize the spatial positioning of the players on the court. I.e. answer the question if there is a difference between the return position of Nadal and Djokovic.
  • You can calculate movement statistics like distance covered.
  • You can calculate the percentage of forehand and backhands, or shot types like slice, topspin for each player.
  • You can find out where the players are serving to? (Do not forget that Nadal is a lefty).

To get you started, I have created a sample kernel. Find it here.

Share link

Anyone who has the link will be able to view this.