Neo4j

s_chatzopoulos · ‎09-20-2019

In a neo4j database, I have created 4 node types: A, B, C and D each one with 1500 nodes.

For each node a in A and d in D, I want to find the tuples (a, d, count) where count is the number of paths of type A->B->C->D that connect nodes a and d.

The number of relations between types are: :AB (22382), :BC (22388) and :CD (22387)

I have tried the following Cypher query:

CALL apoc.export.csv.query("MATCH (a:A)-[:AB]->(b:B)-[:BC]->(c:C)-[:CD]->(d:D) RETURN a.`_id`, d.`_id`, count(*)", "results.csv", {})

That seems to return correct results but needs ~40 seconds to execute. Adding one more relation to the path seems to dramatically increase the execution time.

The query plan is the following:

I have increased the ulimit in the server and set

dbms.memory.heap.initial_size=5100m
dbms.memory.heap.max_size=5100m
dbms.memory.pagecache.size=6900m

in neo4j configuration as neo4j-admin memrec suggests.

I use Neo4j 3.5.8 community edition. Without exporting to a file, the query runs in 32 seconds. The query returns 1,927,493 tuples in the form (source_node, target_node, count_of_paths). Each tuple counts only a few paths (for most of them less than 10). Note that when adding two more relations to the same query, the execution time is increased to approximately two hours.

The PROFILE can be found here:

My goal is to compare the performance of Neo4j to a linear algebra library like Eigen as the same result can be obtained with matrix multiplication of the adjacency matrices of AB, BC and CD. Note that this matrix multiplication with Eigen takes less that a second to execute.

Is there a way to optimize such a query?

andrew_bowman · ‎09-20-2019

The planner seems to be doing a good amount of label scans node hash joins, and that doesn't seem to be the best approach here.

Since you already have the relationships between the nodes that force the types to be what you want (for example :CD will always connect a :C node to a 😄 node) you can leave the labels of the nodes out, and that should remove the possibility of the planner using separate label scans and hash joins.

Give this a try:

CALL apoc.export.csv.query("MATCH (a:A)-[:AB]->()-[:BC]->()-[:CD]->(d) WITH a, d, count(*) as totalPaths RETURN a.`_id`, d.`_id`, totalPaths", "results.csv", {})

s_chatzopoulos · ‎09-21-2019

Indeed, that seems to help a little. In fact the query time dropped from ~37sec to ~30sec.
The query plan for the new query is the following:

I was expecting a greater speedup though.
Is there probably a better way to re-write the query ?
Also it is not clear to me if the memory allocated to the cache is sufficient.
The column Cache H/M has zero values everywhere. What does this mean ? Is this normal ?

andrew_bowman · ‎09-22-2019

You've provided the EXPLAIN plan but not the PROFILE plan. Please PROFILE the query so we can see the rows and db hits.

EXPLAIN does not run the query, which is why you're not seeing cache hit or miss info.

s_chatzopoulos · ‎09-23-2019

Yes, sorry I had in mind the PROFILE when writing the previous comment, but I haven't posted it.

I was talking about the Cache H/M column being zero for all rows. Is that normal ?
Is there a chance that the cache is disabled ?
I have set dbms.memory.pagecache.size=6900m in the neo4j.conf

Neo4j

How to efficiently count all paths between two nodes in Cypher?