Cleaning and Transforming Data for Player Recruitment: My Personal Approach to Data Cleaning.
Nowadays, data has become an essential tool for identifying talent and improving team performance. It serves as a complement to video scouting, which, even in this era of AI and technology, remains the primary method for gathering critical player insights. However, data adds value by helping you answer “where and who to look at,” narrowing the scope of analysis. The real challenge lies in collecting it properly and unlocking its true potential: cleaning and transforming it for meaningful analysis. In this article, I’ll share my personal approach to cleaning and preparing football data from FBRef and Transfermarkt as part of my scouting project.
You can read the process I went through in order to obtain this data in my previous articles:
The goal was turning raw data from Championship, League One and League Two into a structured dataset ready to use for clustering.
For data cleaning purposes it is well known that Pandas is a great library to help with cleansing tasks (https://pandas.pydata.org/). There are also other libraries like Polars (https://pola.rs/) which is supposedly faster than Pandas. At the end, you can do it with the tool you prefer, as long as you feel comfortable, and know how to proceed.
I used Pandas and the approach I took is the following:
- Correcting variable names with the ‘janitor’ package initially: it automatically corrects all the variables and keeps them in lower case.
- Checking for data types consistency.
- Checking for null values.
- Checking and correcting the players position names like -> ‘MF’: ‘Midfielder’, ‘DF’: ‘Defender’, ‘FW’: ‘Forward’, ‘GK’: ‘Goalkeeper’.
- Check and correct the index and subindexes to only have one level in the dataframe.
After checking all of that I applied the following in :
- Filtered by positions: midfielders and center forwards only.
- Filtered players based on playtime: players who played over 400 minutes.
- Normalized metrics per 90 minutes: it ensures a fair comparison of player performance by accounting for differences in playing time. This way all the data would be normalized and ready to be scaled before using the clustering algorithm.
- Generating new features: by doing the per 90 normalization, new variables are created and this is a central part of a machine learning project, creating new features to enrich your modeling. Compared to other projects where you generate thousands of features, I created just a few new metrics upon the ones I already had. This helps narrow the scope of what I actually want to model (more on these variables in my next article about the clustering process).
On a side note: I had created a second positions column named ‘position_2_stats_standard’ because sometimes FBRef adds two positions to a player, like you can be [MF, DF] (midfielder and defender), but I realized I only needed the first position (so if you were MF, DF I’d stay with MF position), and I decided to drop that column and keep the original.
I ended up creating a function in a Python file to make it extensible for the three leagues, but when I was prototyping in a Jupyter notebook I played around with a method of cleaning data with Pandas that I really love which is called “chaining”. In a nutshell, chaining is a method from Pandas used as a way of stringing together multiple operations on a DataFrame or Series in your code. It’s like building a recipe step by step, each method adds a new layer of transformation, keeping your workflow clean and readable. I got acquainted with this thanks to the amazing book of Matt Harrison Effective Pandas: Patterns for Data Manipulation (highly recommend to improve your skills on Pandas).
The result would be a DataFrame like the following:
So far, I’ve described is the process of cleaning the FBRef data, but since my goal is to merge this data with the scraped one from Transfermarkt I needed, to clean the latter and then do the merge. It looked different to the FBRef data (let’s call it the combined leagues data) since what I have here is the players’ general information like team, age, season, country, position and value.
From these variables, I only used the ‘value’ variable to preprocess and merge with the FBRef combined leagues data, due to the fact that I already had the players name, age, country, position and team already there so it made sense only to include the ‘value’ column. At the end, what I want, is to shortlist midfielders and forwards that perform well and are affordable for the club.
And again, I used a function that executes a Pandas chain by:
- Dropping null values from the ‘value’ column.
- Removing the euro symbol.
- Replacing the values from ‘value’ by scientific notation according to the amount (thousands and millions).
- Convert these data types into numeric values.
- And doing some sort of ‘normalization’ between FBRef combined leagues data and Transfermarkt by normalizing player names for consistent merging.
The final output would be a merged DataFrame which I called ‘df_training’, ready to use for K-Means clustering, a process which I’ll go through into detail on my next article.
Up until here, I’ve shown, on a high level, the process I usually go through for cleaning data and generating new features. I do want to emphasize the importance of cleaning your data, or at least trying to in a first instance, for analyses purposes. It saves you from headaches and helps you have a better a model and better results, especially when analyzing player performance.
Thanks for reading and staying till the end, I hope you gain valuable insights and I’ll see you in my next article.
Always happy to chat!
You can follow me at:
X/Twitter account 👇🏻
Bluesky: https://bsky.app/profile/thenetpass.bsky.social
👉🏻 LinkedIn profile: https://www.linkedin.com/in/ricardo-heredia-9b97b0181/