cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Reducing a multigraph to a single graph - relationship query with "count" too slow/crashes

I am trying to reduce a multigraph to a single graph using the following query for the cypher projection:

CALL gds.graph.create.cypher(
    'myGraph',
    'MATCH (n:WebContent {componentId: 0} ) RETURN id(n) AS id, labels(n) as labels,',
    'MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
     WHERE id(n1)<id(n2)
	 RETURN id(n1) AS source, id(n2) AS target, count(m) AS weight'
)

This follows the example provided in the cypher projection docs shown below:

CALL gds.graph.create.cypher(
    'coauthorship-graph',
    'MATCH (n:Author) RETURN id(n) AS id, labels(n) as labels,',
    'MATCH (p1:Author)-[:WROTE]->(a:Article)<-[:WROTE]-(p2:Author)
     RETURN id(p1) AS source, id(p2) AS target, count(a) AS weight'
)

Some notes about my data for context: There are ~38,000 WebContent nodes in this particular component (i.e. componentId 0) and ~77,000 "m" nodes (with various labels) that link these WebContent nodes together in some way. The "m" node with the most WebContent nodes linked to it has about 1,000 WebContent nodes linked to it.

The problem is that this query never ends up completing and eventually crashes the db server. I tried giving a shared label to all the "m" nodes and using (m:SharedLabel) in the query above to see if that would help, but it still didn't work. Also, I want to note that componentId is an index for WebContent nodes so this also isn't contributing to the issue.

Here's the strange thing that I'm not yet understanding. I decided to examine things outside of the cypher projection and instead just directly query the database using the same query used in the relationship projection above. This is the first query I tried:

MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)

This query did run to completion, taking a little less than 3 minutes to complete, for about 85 million pairwise calculations. I wanted to try this one to see if the pairwise duration calculations between the date properties of n1 and n2 are feasible at this scale. I realized that this query actually calculates the duration for all n1->m<-n2 instances, not just for unique n1-n2 pairs (i.e. if a certain n1-n2 pair shares 5 nodes in between them, the time difference between their date properties will be calculated 5 times). So this gave me hope that this still finished within a reasonable time.

The second query I tried was the following:

MATCH (n1:WebContent {componentId: 0} )-[:Includes|Has_signature]->(m)<-[:Includes|Has_signature]-(n2:WebContent {componentId: 0} )
WHERE id(n1)<id(n2)
With n1, n2, count(m) as related_node_count
RETURN count(duration.inDays(n1.latestDate, n2.earliestDate).days)

By adding the line of With n1, n2, count(m) as related_node_count, this prevents calculations on duplicate n1-n2 pairs and thus only does the calculation once on each unique n1-n2 pair no matter how many "m" nodes they have in common (this is more analogous to the cypher projection query I'm trying to do above, which returns count(m) as the relationship weight between each unique n1-n2 node pair). This query, however, never completed and ended up crashing the server, similar to what happened with the cypher projection I tried.

This seems counter-intuitive to me - shouldn't the second query here in theory be faster than the one above, since it's not performing duplicate pairwise calculations? Or is there something innately slow about counting or collecting the "m" nodes first? I feel like I'm missing something, so any insight into what is going on here would be greatly appreciated! I feel like the amount of nodes I'm running this on is relatively small in the scheme of things, so I'm not sure why this isn't working. (side note: I tried all of these queries on smaller components and they work just fine - this particular component is the largest one I have).

5 REPLIES 5

clem
Graph Steward

Do you have the logs? That will help diagnose the crash.

Here is what it's showing in the logs - it's due to an OutOfMemoryError:

Feb 17 01:24:38 ip-10-0-0-96 neo4j[179982]: Terminating due to java.lang.OutOfMemoryError: Java heap space
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Failed with result 'exit-code'.
Feb 17 01:24:38 ip-10-0-0-96 systemd[1]: neo4j.service: Scheduled restart job, restart counter is at 2.

Try increasing the memory (obviously).

Trying adding PROFILE or EXPLAIN to the query to see what's going on.

Maybe you can split the query into multiple parts and then do a UNION on the results.

Thanks for the reply; yeah, after looking at the results of PROFILE on a smaller component, I definitely think it's a cardinality issue and is running unnecessary duplicate operations. I'm trying to figure out how this query can be written differently to avoid this (Also, I'm not sure why the example provided in the docs - i.e. the 'coauthorship-graph' - is written in this way if it results in high cardinality).

Actually I think it ultimately still comes down to a memory issue, caused by the count(m) operation between n1-n2 pairs.. The execution plan when I run this on a smaller component (of about 1500 WebContent nodes) shows that it takes 22MB of memory just to perform this count operation. See the screenshot below.
3X_e_c_ec747775701b63bdfbe09aa4401e93869cbe4227.png

So obviously on my larger component of ~38,000 WebContent nodes, this memory requirement is much higher and thus causes the crashing. This is concerning because we will eventually be needing to work with even larger components. I'm currently looking into a batching option described here.