Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-30-2019 05:00 PM
This is the 2nd in a series of posts showing how to analyse Brexit with graphs. In this post we cluster MPs based on voting records.
On Friday I wrote a blog post showing how to do graph analysis of Brexit data using Neo4j, and towards the end of the first post I showed how to compute the similarities of MPs to Boris Johnson, based on the position that they took on different motions.
While doing this analysis I noticed that there were a lot of people who voted in an identical way on every motion, so I wanted to explore that further.
Robin Bramley also pointed me towards the CommonsVotes website, which has the voting records available in CSV format.
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}@markhneedham Might be interesting to track https://t.co/ute2TedmaF
This will be helpful for future Brexit analysis, so let’s quickly learn how to import data from there into Neo4j.
I’ve downloaded the data for all the indicative votes and put the files into the data/commonsvotes directory of the graphing-brexit GitHub repository.
We can import the data for all the indicative motions using the Cypher query below:
UNWIND [655,656,657,658,659,660,661,662] AS division
LOAD CSV FROM "https://github.com/mneedham/graphing-brexit/raw/master/data/commonsvotes/Division" + division + ".csv" AS row
// Create motion nodes
WITH division, collect(row) AS rows
MERGE (motion:Motion {division: trim(split(rows[0][0], ":")[1]) })
SET motion.name = rows[2][0]
// Skip the first 6 rows as they have metadata we don't need
WITH motion, rows
UNWIND rows[7..] AS row
// Create person, party, constituency, and corresponding rels
MERGE (person:Person {name: row[0]})
MERGE (party:Party {name: row[1]})
MERGE (constituency:Constituency {name: row[2]})
MERGE (person)-[:MEMBER_OF]->(party)
MERGE (person)-[:REPRESENTS]->(constituency)
WITH person, motion,
CASE WHEN row[3] = "Aye" THEN "FOR"
WHEN row[3] = "No" THEN "AGAINST"
ELSE "DID_NOT_VOTE" END AS vote
CALL apoc.merge.relationship(person, vote, {}, {}, motion)
YIELD rel
RETURN count(*)
Now that we’ve got the data loaded we’re going to explore those identical voters. I tried a few different approaches, and eventually settled on the following:
After we’ve done these two steps we want to have a node representing each community or cluster that we find, and a relationship from each person to their community.
Let’s get started!
We’re going to create a similarity graph of MPs, which is a fancy way of saying that we’re going to create relationships between pairs of MPs that voted in an identical way.
As in the first post, we’re going to use the Cosine Similarity algorithm from the Neo4j Graph Algorithms Library to do this. Before we create any relationships in our database, we’re going to run the non streaming version of the procedure with write: false and similarityCutoff: 1.0 (i.e. they behaved identically) to see how many pairs of identical voters we have without writing to the database.
The following Cypher query does this:
MATCH (p:Person), (c:Motion)
OPTIONAL MATCH (p)-[vote]->(c)
WITH p, c,
CASE WHEN type(vote) = "FOR" THEN 1
WHEN type(vote) = "DID_NOT_VOTE" THEN 0.5
ELSE 0 END AS score
WITH {item:id(p), weights: collect(score)} as userData
WITH collect(userData) as data
CALL algo.similarity.cosine(data, {
similarityCutoff: 1.0, write: false
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty
There are more than 10,000 similarity pairs for our 649 MPs! This should be a fun graph to work with.
We’ll now run this procedure with write:true so that SIMILAR relationships are created between nodes that have a similarity of 1.0. Below is a sample of the graph that is created:
Now we’re going to run the connected components algorithm over this similarity graph. But what does this algorithm do?
From the documentation page:
The Connected Components, or Union Find, algorithm finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set.
We can run the streaming version of this algorithm over our similarity graph by executing the following query:
// Run the algorithm for:
// nodes with 'Person' label, based on the 'SIMILAR' rel type
CALL algo.unionFind.stream('Person', 'SIMILAR', {direction: "BOTH"})
YIELD nodeId,setId
// Grouping by setId, create a 'Community' node for each community
WITH setId, collect(algo.getNodeById(nodeId)) AS nodes
MERGE (community:Community {
id: setId, type: "Connected Components"
})
// Create 'IN_COMMUNITY' rel from each node to its community
WITH community, nodes
UNWIND nodes AS node
MERGE (node)-[:IN_COMMUNITY]->(community);
This query returns a stream of setId, nodeId pairs. We then create a node with the label Community for each setId, and create a relationship from each node to its corresponding community. We can run the following query to visualise these communities:
MATCH (c:Community)
WHERE size((c)<-[:IN_COMMUNITY]-()) > 10
MATCH path = (c)<-[comm:IN_COMMUNITY]-(person)
RETURN path
LIMIT 200
So…how did the people in each of these communities vote? We can create a visualisation of how the communities voted by running the following Cypher query:
// Find communities with more than 10 people
MATCH (c:Community)
WITH c, size((c)<-[:IN_COMMUNITY]-()) AS size
WHERE size > 10
// Take one person to act as a representative of each community
MATCH (c)<-[rel:IN_COMMUNITY]-(person)
WITH c, collect(person)[0] AS person
// Check which motions that person voted for
WITH c, [(person)-[:FOR]->(motion:Motion) | motion] AS votes
// Create a virtual relationship from each community to the motion
UNWIND votes AS vote
CALL apoc.create.vRelationship(c,"FOR",{},vote) yield rel
RETURN *
In this query we pick one person to act as a representative for their community. We then check how they voted, and finally create a virtual relationship (using the APOC library) from each community to the motions voted for. This results in the following visualisation:
The yellow nodes represent communities, and the number indicates the number of MPs in that community. We can see that there’s a large community over to the right containing 106 people who voted for No Deal and to get Contingent preferential arrangements. We would assume that the people in those clusters are mostly in the Conservative Party.
Over the other side we have smaller clusters voting in favour of the other motions, including Jeremy Corbyn’s alternative plan, the revocation of article 50, and a Confirmatory public vote.
When I showed this visualisation to Michael, he suggested that we should identify the most influential people in each cluster, and I thought a simple way to do that would be to count the views on their Wikipedia page.
I wrote a script to extract these counts into a CSV file, and then imported those into Neo4j:
load csv with headers from "https://github.com/mneedham/graphing-brexit/raw/master/data/pageviews.csv" AS row
MATCH (p:Person {name: row.person})
SET p.pageviews = toInteger(row.pageviews);
Michael also suggested adding party labels to each MP so that we can colour them with their party colours on the visualisation. The following query does this:
match (p:Person)-[:MEMBER_OF]->(pa:Party)
CALL apoc.create.addLabels(p, [apoc.text.replace(pa.name, " ", "")]) YIELD node
RETURN count(*)
Now that we’ve done this we can update our query to add the top 3 most influential people per cluster into our visualisation:
// Find communities with more than 10 people
MATCH (c:Community)
WHERE size((c)<-[:IN_COMMUNITY]-()) > 10
// Find the top 3 most influential people per community
MATCH (c)<-[rel:IN_COMMUNITY]-(person)
WITH c, rel, size, person
ORDER BY person.pageviews DESC
WITH c, collect({person: person, rel:rel })[..3] AS topPeople
// Select the first of those people to represent the cluster
WITH c, topPeople, topPeople[0].person AS person
// Check how that person voted
WITH c, topPeople,
[(person)-[:FOR]->(motion:Motion) | motion] AS votes
// Create a virtual relationship from each community to the motion
UNWIND votes AS vote
CALL apoc.create.vRelationship(c,"FOR",{},vote)
yield rel
RETURN *
Now we can clearly see the Conservative clusters over to the right hand side. The biggest one contains Boris Johnson, Jacob Rees-Mogg, and Priti Patel, while the other one contains some lesser known members of the party.
Over to the left the SNP are in yellow and are in favour of having a public vote and revoke article 50. The Independent Party have also voted in favour of both of those motions. The reason those two communities are separate is because we took AGAINST and DID_NOT_VOTE relationships into account when building our similarity graph, and they must have voted differently there
Finally the Labour members are split out into several clusters, although all of them voted for Jeremy Corbyn’s motion. As with the SNP and Independent clusters, we would end up with a smaller number of clusters if we built a similarity graph based only on FOR relationships.
We’ve done most of the ground work, so let’s see what happens if we only take the motions that people voted for into consideration.
We’ll use a similar approach to build our similarity graph, but this time since we only care about one relationship type, we can use the Jaccard Similarity algorithm to compute similarity between nodes.
We want to create a SIMILAR_FOR relationship between people who voted for motions in an identical way. The following Cypher query will do this:
MATCH (p:Person)-[:FOR]->(motion:Motion)
WITH {item:id(p), categories: collect(id(motion))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard(data, {
similarityCutoff: 1.0,
write:true,
writeRelationshipType: "SIMILAR_FOR"
})
YIELD nodes, similarityPairs, writeRelationshipType, writeProperty
RETURN nodes, similarityPairs, writeRelationshipType, writeProperty;
And now we’ll run the connected components algorithm over this graph and, as before, we’ll connect people to community nodes:
// Run the algorithm for:
// nodes with 'Person' label, based on the 'SIMILAR_FOR' rel type
CALL algo.unionFind.stream('Person', 'SIMILAR_FOR', {
direction: "BOTH"
})
YIELD nodeId,setId
// Grouping by setId, create a 'Community' node for each community
WITH setId, collect(algo.getNodeById(nodeId)) AS nodes
MERGE (community:Community {
id: setId, type: "Connected Components FOR"
})
WITH community, nodes
// Create 'IN_COMMUNITY' rel from each node to its community
UNWIND nodes AS node
MERGE (node)-[:IN_COMMUNITY]->(community);
Once we’ve done this we can re-run the query that finds clusters, motions, and the most famous people per cluster. We determine the most famous people based on the number of views their Wikipedia pages received.
MATCH (c:Community {type: "Connected Components FOR"})
WITH c, size((c)<-[:IN_COMMUNITY]-()) AS size
WHERE size > 10
MATCH (c)<-[rel:IN_COMMUNITY]-(person)
WITH c, rel, size, person
ORDER BY person.pageviews DESC
WITH c, collect({person: person, rel:rel })[..3] AS topPeople, size
WITH c, topPeople, topPeople[0].person AS person, size
WITH c, topPeople, size, [(person)-[:FOR]->(motion:Motion) | motion] AS votes
UNWIND votes AS vote
CALL apoc.create.vRelationship(c,"FOR",{},vote) yield rel
RETURN *;
We’ve now got three clusters over on the right. The conservative clusters remain, but we now also have a cluster containing Democratic Unionist Party members who only voted in favour of the Contingent Preferential Arrangement.
Over on the left we now have a cluster containing members of the Conservative, Independent, and Liberal Democrat parties. This cluster votes in favour of Revocation to avoid no deal and the Confirmatory public vote.
There are then a bunch of different Labour clusters voting for some subset of the other 6 motions.
After the first post, Chris Eyre asked the following:
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}@markhneedham @askkerush Could you propose new political groupings based upon these?
It certainly seems like there could be some political groupings based on these clusters:
I’ve created a list of the other types of analysis that people suggested on twitter after I wrote the first post.
Jo Stichbury suggested that I explore whether MPs are voting in line with what their constituency wants:
body[data-twttr-rendered="true"] {background-color: transparent;}.twitter-tweet {margin: auto !important;}@markhneedham @GraknLabs Maybe the breakdown of the constituencies the MPs represent? Banded into leave/remain strong or weak, then look at their majorities and work out who is representing their community and who is acting in self interest?
As part of this post I’ve loaded the constituencies and the percentage of people that voted leave, and am now in a position to try and answer this question.
And if you have a general interest in graph analysis of data, you might enjoy the O’Reilly Graph Algorithms Book that Amy Hodler and I have been working on over the last 9 months. We’re in the final review stage and it should be available in the next few weeks.
You can register to get your free digital copy from the Neo4j website, at: neo4j.com/graph-algorithms-book.
Graphing Brexit: Clustering Edition was originally published in neo4j on Medium, where people are continuing the conversation by highlighting and responding to this story.
All the sessions of the conference are now available online