Football Player Ranking with Classification and Streamlit App

Ugur Selim Ozen
İstanbul Data Science Academy
5 min readMar 7, 2022

--

0. Introduction

Hi everyone, the third project of Data Science Bootcamp in partnership with Istanbul Data Science Academy and Hepsiburada has been completed with presentations and so far it’s been a very instructive experience. Below I try to explain the third project that we did with my teammate Furkan Kutuk in that period, which was a football player ranking with classification using Python paired with NumPy, pandas, matplotlib, seaborn, scikit-learn, xgboost, lightgbm, catboost, streamlit, tableau, and SQLite.

First of all, you can visit the project’s GitHub repository from here and streamlit app’s GitHub repository from here.

1. Problem Statement

We are a sports analytics consulting team that serves in the football domain. Our client requests building a classification model for finding the best talented football players.

For this purpose, we built a variety of classification models using data that we read from the SQLite DB.

2. Methodology

We described the following roadmap as project methodology ;

  • Data Reading from DB with sqlite3
  • Data Cleaning and Transformation
  • Exploratory Data Analysis (EDA) with Tableau
  • Feature Selection and Modeling
  • Model Prediction
  • Interpreting Results
  • Football Player Ranking Streamlit App

3. Data Reading from DB with SQLite3

By using the following functions that we developed, we read all tables from DB that were written before via sqlite3.

Figure 1 : Data Reading from DB Process

getTableNames()

Using this function, we can get all table names from the SQLite DB.

getAllTables()

Using this function, we can get all table records for SQLite DB.

mergeAllData()

Using this function, we can get necessary table records from DB and merge them.

cleanAndgetCSV()

Using this function, we can write merged and cleaned data to CSV for using the model building in the next steps.

So, if we call the cleanAndgetCSV() function after the definition of all 4 functions, it will write data to CSV and after that, we can read it with pandas. Following is our data overview ;

Figure 2: Read Data Overview

4. Data Cleaning and Transformation

After we read and transformed data to dataframe, we made some transformations on columns such as extracting specific date parts from DateTime values and typecasting on some columns using split and astype functions like follows;

Finally, we applied the following methods to handle with “Unknown” values ;

  • Filling with mode
  • Filling with mean

As an example for “Unknown” value handling code ;

5. Exploratory Data Analysis (EDA)

We visualized some features via Tableau against to target variable which is OVERALL_RATING to observe whether there is any informative pattern like follows ;

Figure 3: Overall_rating vs Vision
Figure 4: Overall_rating vs Finishing
Figure 5: Overall_rating vs Potential
Figure 6: Overall_rating vs Reactions
Figure 7: Overall_rating vs Age vs Date

As seen above bar and time-series charts, those 5 features have positive or negative informative patterns with target variable OVERALL_RATING, also you can see the correlation heatmap given below ;

Figure 8: Correlation Heatmap

6. Feature Selection and Modeling

We built our model with 19 features in total and used the following method in the model building process ;

Figure 9: Train-Validation-Test Split

We have imbalanced data as it has too many 0 values compared to 1. Therefore, we applied oversampling method in the training dataset with SMOTE function as given below;

Figure 10: Oversampling in Imbalanced Data

Given below code shows us the building, predicting, evaluating, and K-Fold cross-validation process of many models with train-test split method ;

7. Model Prediction and Interpreting Results

We used Precision, Recall, and F-1 SCORE as evaluation metrics in model prediction and made it for train-test(%80-%20) split and K-Fold Cross-Validation.

Figure 11: Model Prediction Results

As seen upper prediction result table, the algorithms that work with ensemble methodology gave us better test results compared to others. Finally, we selected XGBOOST Model as it was the best one according to evaluation metrics.

8. Football Player Ranking Streamlit App

To predict our built model, we developed an interactive web app using Streamlit and deployed with it Heroku to a live web environment as you can see a sample figure from the app given below. You can test the on-live streamlit app from here or you can visit streamlit app’s GitHub repository from here.

Figure 12: Streamlit App

9. Conclusion

This is the end of the article. In this article, I try to explain in detail the third project of our Data Science Bootcamp. As a reminder, you can visit the project’s GitHub repository from here. If you wish you can follow me on Medium. Hope to see you in my next article…

--

--

Ugur Selim Ozen
İstanbul Data Science Academy

Data Engineer who is developing himself in Big Data Processing, Data Analytics, Machine Learning and Cloud Technologies. https://ugurselimozen.github.io/