# Predicting high-potential FIFA players using individual performance data
Merete Lutz, Jake Barnabe, Simon Frew, Waleed Mahmood

DSCI 522, Group 17

In [None]:
import pandas as pd
from myst_nb import glue

model_cross_val_scores_df = pd.read_csv("../results/tables/model_cross_val_scores.csv", index_col=0)
model_cross_val_scores_df.set_index(model_cross_val_scores_df.columns[0])
glue("model_cross_val_scores_df", model_cross_val_scores_df, display=False)

hyperparameter_rankings_df = pd.read_csv("../results/tables/hyperparameter_rankings.csv", index_col=0).round(3)
hyperparameter_rankings_df.index.names = ['Rank test score']
hyperparameter_rankings_df.rename(columns = {
    'mean_test_score': 'Mean test score:',
    'mean_train_score': 'Mean train score:',
    'param_svc__C': 'Value of C:',
    'param_svc__gamma': 'Value of gamma:',
    'mean_fit_time': 'Mean fit time:'
}, inplace=True)
glue("hyperparameter_rankings_df", hyperparameter_rankings_df, display=False)

hyperparameter_best_test_score = hyperparameter_rankings_df.iloc[0,0]
glue("hyperparameter_best_test_score", hyperparameter_best_test_score, display=False)

best_c = hyperparameter_rankings_df.iloc[0, 2]
glue("best_c", best_c, display=False)
best_gamma = hyperparameter_rankings_df.iloc[0, 3]
glue("best_gamma", best_gamma, display=False)

test_score_df = pd.read_csv("../results/tables/test_score.csv", index_col=0).T.round(3)
glue("test_score_df", test_score_df, display=False)

test_score = test_score_df.iloc[0,0]
glue("test_score", test_score, display=False)


## Summary

We attempt to construct a classification model using an RBF SVM classifier algorithm which uses FIFA22 player attribute ratings to classify players' potential with target classes "Low", "Medium", "Good", and "Great". The classes are split on the quartiles of the distribution of the FIFA22 potential ratings. Our model performed reasonably well on the test data with an accuracy score of {glue:text}`test_score:.3f`, with hyperparamters C: {glue:text}`best_c:.1f` & Gamma: {glue:text}`best_gamma:.3f`. However, we believe there is still significant room for improvement before the model is ready to be utilized by soccer clubs and coaching staffs to predict the potential of players on the field instead of on the screen. 

## Introduction

One of the most challenging jobs for sports coaches is deciding which players will make a positive addition to the team {cite}`how_to_evaluate_soccer_players`. A key step in evaluating which players to add to a team is predicting how their skill level will change over time. We can think of this in terms of their potential. FIFA22 by EA sports is the world's leading soccer video game. For each year's release, they rate players' skill levels in various aspects of the game such as shooting, passing, defending, etc. and give each player an overall rating as well as a rating of each player's potential. 

Here we ask if we can use a machine learning model to classify players by their potential given their attribute ratings. We have binned the continuous potential variable into four classes for the purpose of evaluating player talent as "Low", "Medium", "Good", and "Great". Answering this question is important as developing a model that can accurately predict the potential of players on FIFA22 could then be applied to the evaluation of soccer players in real life and be employed by coaches and scouts to help soccer clubs make good decisions on which players to add to the team and which to let go. 

## Methods
### Data
The data used in this analysis are from the video game FIFA22 by EA Sports. The data were downloaded with authentication from [Kaggle](https://www.kaggle.com/datasets/stefanoleone992/fifa-22-complete-player-dataset) and without authentication from [Sports-Statistics.com](https://sports-statistics.com/sports-data/fifa-2022-dataset-csvs/). Within documentation, these were were scraped from a publicly available website (https://sofifa.com/) with a permissive `robots.txt`. 


Each row of the dataset corresponds to a single player, and contains biometric information, ratings for various skills, like shooting accuracy, passing, dribbling, and player wages and market value. 

### Analysis
The Radial Basis Function (RBF) Support Vector Machine (SVM) RBF SVM model was used to build a classification model to predict whether a player has high potential or not (found in the potential column of the data set). The variables included in our model were selected from the list of different player statistics that are part of the dataset, including the statistics on their `speed`, `dribbling`, `shooting` etc. These are the variables that were used as features to fit the model. The hyperparameters `gamma` and `C` were chosen through the use of the automated optimization method from `scikit-learn` called `RandomizedSearchCV`. The Python programming language {cite}`van2009introduction` was used and the following Python packages were used to perform the analysis: Numpy {cite}`harris2020array`, Pandas {cite}`mckinney2010data`, altair {cite}`vanderplas2018altair`, SkLearn {cite}`pedregosa2011scikit` and SciPy {cite}`virtanen2020scipy`. The code used to perform the analysis and create this report can be found here: <https://github.com/UBC-MDS/2023_dsci522-group17/>.

## Results and Discussion
To look at whether each of the predictors might be useful to predict the class of the target variable `potential`, we plotted the distributions of each predictor from the training data set and coloured the distribution by class (Low: blue, medium: orange, Good: red, Great: green). These distributions are drawn up after we have scaled all of the features in the training dataset. In doing this, we can see that most of the distributions, for the features we have filtered to keep, have overlap but their spreads and centers are different, except for `height_cm` and `weight_kg` - which overlap almost completely across the different classes of `potential` - and `value_eur` - which has no distribution for classes other than `great` for `potential`. We chose to not omit these features from our model as they could still prove informative through interactions.



```{figure} ../results/figures/eda_plots.png
---
width: 900px
name: feature_bars_by_class
---
Comparison of the distributions of numeric predictors in the training set between the 4 levels of potential.
```

Through our model selection process, we were able to determine that the best model in our case would be an RBF SVM model. To determine the values for the hyperparameters that would give us the best estimator, we used the hyperparameter optimization method `RandomizedSearchCV` to perform a 5-fold cross validation, so that we are able to get the most suitable hyperparamters to obtain the best possible model to estimate and predict the class for `potential`. 

```{glue:figure} model_cross_val_scores_df
:figwidth: 750px
:name: "model_cross_val_scores_df"

Time to fit and score as well as test and train scores for each model we looked at.
```


We observe that the optimal hyperparameter values are C: {glue:text}`best_c:.1f` and Gamma: {glue:text}`best_gamma:.3f`. The cross-validation accuracy obtained with these hyperparameters is {glue:text}`hyperparameter_best_test_score:.3f`.

```{glue:figure} hyperparameter_rankings_df
:figwidth: 750px
:name: "hyperparameter_rankings_df"

The top 5 performing models from hyperparameter optimization.
```

And the accuracy of our model is {glue:text}`test_score:.3f`, i.e. it predicted quite well when run on the test data with low overfitting. However, before being implemented by coaches and scouts to evaluate players on the field, there are still some improvements to be made. These are explored in the next section.

```{glue:figure} test_score_df
:figwidth: 300px
:name: "test_score_df"

The final score of our model on the test data.
```

## Further Improvements
To improve our model further in the future, with the hopes of better predicting the potential of a player, there are a few improvements that can be made. First of all, we could include the growth of a player over the years, based on their performance in games. This would somewhat lead to us having a time-series dataset which we can use to create a feature that captures the growth of a player over the years. Second, we would include the effort that the player puts into their training. This can be a beneficial improvement that could lead to better predictive power in our model. Finally, we could include the reporting of the probability estimates of the prediction of the classes in for the `potential` of a player, so that a player scout knows with how much certainty a player might be classified as a `Great` player (for example).

## References
```{bibliography}
:style: unsrtalpha
```