Neo4j

UncleKenroy · ‎07-26-2021

Hello I have a project that contains 109k nodes and 2.8 million connections. I have a place nodes, Visited edges and census block nodes. I am trying to find the top 10 places that have the most shared census block group nodes within a certain month.

Match (n:Place)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place) 

return n.name, m.name, count(distinct c) as cnt

order by cnt desc limit 10

I tried running it over night but it still did not work and I had to restart my machine to get the database back up again.

andrew_bowman · ‎07-26-2021

This is a very large graph-wide query.

First, I think we'd recommend running this on Neo4j 4.3.x, and creating a relationship index on :Visted(month):

Also since your pattern is symmetric, it would be better to add a filtering so you don't see symmetric results (where it's the same two places and the same count, with just the places swapped for n and m).

You can add WHERE id(n) < id(m), that should do the trick.

That said, if the issues you're facing are heap-related, it would be best to use subqueries to divide up the work, which should make aggregations less of a threat to your heap. Something like this maybe:

MATCH (n:Place)
CALL {
 WITH n
 MATCH (n)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place) 
 WHERE id(n) < id(m)
 RETURN n.name as name1, m.name as name2, count(c) as cnt
}
RETURN name1, name2, cnt
ORDER BY cnt DESC
LIMIT 10

This ensures the aggregations are performed with respect to only the matches from a single starting :Place node at a time, instead of executing over all possible paths in the graph, which was likely blowing your heap.

View solution in original post

andrew_bowman · ‎07-26-2021