Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-23-2019 05:00 PM
The Football Transfer window is currently open, which means players are moving around the globe for increasingly exorbitant fees.
The best place to keep track of what’s going on is the Latest Transfers page of the transfermarkt website, which captures details of the players, clubs, and fees involved.
Irfan and I were trying to work out where the money was flowing based on the transfers, so we decided to create a Neo4j Graph to help us out.
The scraping code to get the data from the transfermarkt website is in the mneedham/football-transfers repository. The transfers.json file contains JSON documents for all the transfers that happened since June 2019.
One line of the transfers.json file looks like this:
{
"season":"2019/2020",
"player":{
"href":"/antoine-griezmann/profil/spieler/125781",
"name":"Antoine Griezmann",
"position":"Centre-Forward",
"age":"28",
"image":"https://tmssl.akamaized.net//images/portrait/medium/125781-1533626871.jpg?lm=1533626889",
"nationality":"France"
},
"from":{
"href":"/atletico-madrid/startseite/verein/13",
"name":"Atl\u00e9tico Madrid",
"country":"Spain",
"league":"LaLiga",
"leagueHref":"/primera-division/transfers/wettbewerb/ES1",
"image":"https://tmssl.akamaized.net//images/wappen/tiny/13.png?lm=1519120744"
},
"to":{
"href":"/fc-barcelona/startseite/verein/131",
"name":"FC Barcelona",
"country":"Spain",
"league":"LaLiga",
"leagueHref":"/primera-division/transfers/wettbewerb/ES1",
"image":"https://tmssl.akamaized.net//images/wappen/tiny/131.png?lm=1406739548"
},
"transfer":{
"href":"/jumplist/transfers/spieler/125781/transfer_id/2552096",
"value":"\u00a3108.00m",
"timestamp":1563058800
}
}
We have players, from and to clubs, and the transfer itself. We’ll import that data into the following graph model:
We can execute the following query, that uses APOC’s Load JSON procedure, to create players, transfers, leagues, and clubs:
CALL apoc.load.json("https://github.com/mneedham/football-transfers/raw/master/data/transfers.json")
YIELD value
WITH value, apoc.text.replace(value.transfer.value, "£", "") AS transferValue
WHERE transferValue <> "?" AND transferValue <> "-"
MERGE (p:Player {id: value.player.href})
SET p.name = value.player.name
MERGE (from:Club {id: value.from.href})
SET from.name = value.from.name
FOREACH(ignoreMe IN CASE WHEN value.from.leagueHref = "" THEN [] ELSE [1] END |
MERGE (fromLeague:League {id: value.from.leagueHref})
SET fromLeague.name = value.from.league
MERGE (from)-[:IN_LEAGUE]->(fromLeague)
)
MERGE (to:Club {id: value.to.href})
SET to.name = value.to.name
FOREACH(ignoreMe IN CASE WHEN value.to.leagueHref = "" THEN [] ELSE [1] END |
MERGE (toLeague:League {id: value.to.leagueHref})
SET toLeague.name = value.to.league
MERGE (to)-[:IN_LEAGUE]->(toLeague)
)
MERGE (t:Transfer {id: value.transfer.href})
SET t.value = CASE
WHEN transferValue contains "k"
THEN toFloat(apoc.text.replace(transferValue, "k", ""))
* 1000
WHEN transferValue contains "m"
THEN toFloat(apoc.text.replace(transferValue, "m", ""))
* 1000000
ELSE 0.0 END
SET t.date = date(datetime({epochseconds:value.transfer.timestamp}))
MERGE (t)-[:OF_PLAYER]->(p)
MERGE (t)-[:FROM_CLUB]->(from)
MERGE (t)-[:TO_CLUB]->(to);
After we’ve done that we’ll run the following query to create a relationship from a league to the country that it belongs to:
CALL apoc.load.json("https://github.com/mneedham/football-transfers/raw/master/data/leagues.json")
YIELD value
MATCH (l:League {id: value.league})
MERGE (c:Country {name: value.country})
MERGE (l)-[:IN_COUNTRY]->(c);
Now we’ve ready to query the graph.
The following query finds the top 10 transfers by value, and returns the player and clubs involved:
MATCH (t:Transfer)-[:OF_PLAYER]->(player),
(from)<-[:FROM_CLUB]-(t)-[:TO_CLUB]->(to)
RETURN player.name, from.name, to.name,
apoc.number.format(t.value) AS price
ORDER BY t.value DESC
LIMIT 10
Atlético Madrid and Real Madrid appear on several of these transfers.
We can aggregate the queries involving these teams to see how much money has been involved in their transfers.
The following query finds the clubs that have spent and received the most money during the transfer window:
MATCH (club:Club)
WITH club,
apoc.coll.sumLongs(
[(club)<-[:FROM_CLUB]-(t) | t.value]) AS moneyIn,
apoc.coll.sumLongs(
[(club)<-[:TO_CLUB]-(t) | t.value]) AS moneyOut
RETURN club.name,
apoc.number.format(moneyIn) AS in,
apoc.number.format(moneyOut) AS out
ORDER BY moneyIn + moneyOut DESC
LIMIT 10
As we guessed, the Madrid clubs are at the top of the list. Surprisingly there aren’t any English clubs in the top 10.
What about if we only look at the money spent on transfers?
The following query finds the clubs that have spent and received the most money during the transfer window:
MATCH (club:Club)-[*2]->(country:Country)
WITH club, country,
apoc.coll.sumLongs(
[(club)<-[:TO_CLUB]-(t) | t.value]) AS moneyOut
RETURN club.name, country.name, apoc.number.format(moneyOut) AS out
ORDER BY moneyOut DESC
LIMIT 10
The Spanish teams still dominate the top 3 positions, but interestingly Aston Villa have spent the most money of the English teams so far. Presumably that will change by the end of the summer.
Next let’s go a level up, and see which countries money is flowing between.
The following query finds the total fees spent moving players from teams in one country to another, excluding transfers between clubs that play in the same country.
We also return the most expensive transfer between those countries:
MATCH (t:Transfer)-[:OF_PLAYER]->(player),
(fromCountry)<-[:IN_COUNTRY]-(fromLeague),
(fromLeague)<-[:IN_LEAGUE]-(from)<-[:FROM_CLUB]-(t),
(t)-[:TO_CLUB]->(to)-[:IN_LEAGUE]->(toLeague),
(toLeague)-[:IN_COUNTRY]->(toCountry)
WITH *
ORDER BY fromLeague, toLeague, t.value DESC
WITH fromLeague, toLeague, sum(t.value) AS totalFees,
fromCountry, toCountry,
collect({player: player.name, fee: t.value}) AS transfers
WHERE fromCountry <> toCountry
RETURN fromCountry.name, toCountry.name,
apoc.number.format(totalFees) AS total,
transfers[0].player AS player,
apoc.number.format(transfers[0].fee) AS fee,
size(transfers) AS numberOfTransfers
ORDER By totalFees DESC
LIMIT 10
The most money has transferred from Portugal to Spain, although this is a bit skewed by the transfer of João Felix, which accounts for almost 60% of the money flow.
We hope you enjoy the dataset and if you have any questions or suggestions on what we should do next let us know in the comments or send us an email to devrel@neo4j.com.
Football Transfers Graph was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
All the sessions of the conference are now available online