Predict who will win a soccer match using tigergraph, pandas, and GSQL part 2/3

Sarthak Agrawal
5 min readFeb 21, 2021

Welcome back for part 2. If you haven’t read part 1 here is the https://sarthavader.medium.com/predict-who-will-win-a-soccer-match-using-tigergraph-pandas-and-gsql-part-1-3-65c2fef665e.

Anyway, in this article we’ll be messing around and reformatting a lot of data so it can easily go into our graph database. If you aren’t interested in the data formatting you can skip it and download the reformatted data here: https://github.com/sarthyparty/Soccer-Matches-Analysis. Then you can scroll down to the part where we map and upload the data.

Anyway, if you want to reformat the data with me, here is the link if you haven’t already downloaded it. https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017.

Now we are going to be using pandas, a python data science module to edit the csv. Open up a terminal window. To install pandas, run:

pip install pandas

Next, if you have jupyter notebook installed you can run:

jupyter notebook

Alternatively, if you don’t have it and don’t want to wait 20 minutes for it to install, you can use the python console. Simply run “python” in your terminal to open up the python console.

Create a new jupyter notebook file (or open the python console) in the folder that you have downloaded the data. For me, that was my Downloads folder.

The first thing we’ll do in our jupyter notebook (or python console) is import pandas. In the first line type or paste the following. To run a line in jupyter notebook you have to hit Shift + Enter. In the python console you can just hit Enter.

import pandas as pd

Read the csv file:

df = pd.read_csv('results.csv')

Let’s take a look at our dataset:

df.head()

Here, you can see all the data given to us. We won’t be using the location of the match, so you can drop the ‘city’ and ‘country’ columns.

df.drop(['city', 'country'], axis=1, inplace=True)

Next, we’ll use the year for filtering instead of a date column. We’ll use the date for generating unique IDs.

df['year'] = df['date'].apply(lambda date: int(date[0:4]))

The above code may be difficult to understand if you do not know about lambda functions in python, so I would recommend finding a quick video about them, they’re not super complex.

If you look back at the dataset, if doesn’t clearly say who won, rather, it gives the score for the match. It may be easy for us to clearly see who won, but when making queries, we’ll want to reduce the amount of unnecessary logic.

Now to get the winning team, we’ll use a for loop. There are ways to do this involving that are shorter and faster, but using a for loop is the easiest to understand intuitively.

First we’ll create an empty list:

who_won = []

Then we’ll run a for loop to add who won for each match.

for i in range(0,41876):
if df['home_score'][i] > df['away_score'][i]:
who_won.append(df['home_team'][i])
elif df['home_score'][i] == df['away_score'][i]:
who_won.append("Tie")
else:
who_won.append(df['away_team'][i])

Finally we’ll add the list to the DataFrame:

df['winner'] = pd.Series(who_won)

We’ll make a column for the score difference, and we’ll make an isFriendly column:

df['score_diff'] = df['home_score'] - df['away_score']
df['score_diff'] = df['score_diff'].apply(abs)
df['isFriendly'] = df['tournament'].apply(lambda x: x=='Friendly')
df.drop(['tournament', 'home_score', 'away_score'], axis=1, inplace=True)

Finally, for the matches, we’ll need to make unique IDs consisting of the home and away team, and the date.

df['id'] = df['home_team'] + df['date'] + df['away_team']

Nice job! You’ve finished reformatting the data for the matches. Now we’ll get the data for each team. This will just be a list of all the unique teams.

teams = pd.DataFrame({"team": pd.Series(df['home_team'].unique())})

And now finally, we’ll need to make a dataset for the edges between the team vertex and match vertex. To do this, there will have to be two edges for each match, one for each team.

edges = pd.DataFrame({'match': [None]*2*41876, 'team': [None]*2*41876})for i in range(0,41876):
edges['match'][2*i] = df['id'][i]
edges['team'][2*i] = df['home_team'][i]
edges['match'][2*i+1] = df['id'][i]
indivs['team'][2*i+1] = df['away_team'][i]

The above code will take a couple minutes. It feels like it’s not doing anything, but it’ll be finished soon.

Now that we’ve finished, you can download the dataframes.

df.to_csv('matches.csv')
edges.to_csv('edges.csv')
teams.to_csv('teams.csv')

Now that we’ve downloaded the final data, we’ll map and upload the data to our database.

After you’ve uploaded the 3 csv files, click on the matches file to start the mapping. Make sure to select the “Has Header” option. Then add the file.

Use the button next to the file upload button and then click on the file and then the match vertex. Map the data as such.

Then repeat the process for the the edges and teams.

My columns were named weirdly when I did the data mapping, but ‘home_un’ is your ‘team’ column and ‘ids’ is your ‘match’ column.

Now that you’ve finished that, you can publish your data mapping and move onto the load data tab.

Wait for the play button on the top left to appear, and then click it. It may take 20–30 seconds, and then you should have graph statistics like these.

Congrats! You’ve just uploaded data to a graph database you made yourself. Next we’ll discuss and write queries to our graph database in order to actually predict who will win a soccer match.

Part 3 link: https://sarthavader.medium.com/predict-who-will-win-a-soccer-match-using-tigergraph-pandas-and-gsql-part-3-3-fb728e0f6c81

--

--