cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

A slow running cypher query

shai
Node Link

Hi,

I have this query. This query is going through a little more of a 1000000 nodes. It takes over 7 minutes to return and I'm trying to understand why.

First I'll start with the following query:

match (ap:AgamPolicy)
where not (ap)-[:RELATED_TO]->(pa:PolicyAccount)
return count(ap)

this query takes about 11 seconds returning a count of 680733 nodes.
But I need a little more

match (ap:AgamPolicy)
where not (ap)-[:RELATED_TO]->(pa:PolicyAccount)-[:LAST{fetch:true}]->(pr:PolicyReport)
return count(ap)

this query takes about 7 minutes to return a count of 733288 nodes.
There are a little more then a 1000000 AgamPolicy nodes.
There are about 625000 PolicyAccount nodes.
There are a few relationships going out of each PolictAccount but only one is LAST, Always going to a PolicyReport node.

Running explain on the query looks preety good. The row count in the "explain" is almost the correct answer and it takes a second.

2X_3_3bdd213aea20799f81db096dd94704e148b1315c.png

Here is the profile:

The complete DB is about 67Gb. The page cache is 10GB, Heap is 4Gb, which is not enough, I know, but I still think this query should not take this long. Could memory be the only issue here?

  • neo4j version, 3.5.1
  • Bolt through the browser
8 REPLIES 8

The short answer is this: You have two problems. First, negation in a where clause is inefficient because it has to retrieve all potential matches prior to negating them. Second, path match in a where clause is inefficient because it must recheck the path for every row of the preceding match clause.

The solution is think in terms of deforming and reforming your graph. Rather than treating it gently, as we must with relational databases, consider what graph patterns would make your query efficient. Then create that pattern, and query against it instead.

You graph must be structured for efficient queries to be useful, that's why good data design models are much more important to Graph DBs than anywhere else.

The hard and fast rule is this:
Relationships between nodes should be many in, out few Or more simply, arrows from children to parents, Match in the direction of your arrows.

That said, try a quick fix for now:
(watch the semicolons ;, if you're using the Browser, you'll need to run each command separately.)

MATCH (ap:AgamPolicy)
  SET ap.hasReport = 0; # give them all a default 0

MATCH (ap:AgamPolicy)-[:RELATED_TO]->(:PolicyAccount)-[:LAST{fetch:true}]->(:PolicyReport)
  SET ap.hasReport = 1; # those with a report, set to 1

MATCH (ap:AgamPolicy)
  WHERE ap.hasReport = 0
  RETURN count(ap); # match those still without a report

MATCH (ap:AgamPolicy)
  REMOVE ap.hasReport; # revert changes, removing that property

Thanks,

What you wrote opens up a whole discussion for me.

This query is from an on going report. Data of AgamPolicy is received on a weekly basis where as data on PolicyAccount we receive hourly. It is possible for me to set the "hasReport" property through code whenever the AgamPolicy arrives, but this is kind of an "index" solution which puts in question the whole use of a graph for doing such reports. It seems that I need another database solution for running reports, which is not an uncommon thing to do, but is an expensive solution.

Another issue with the query is the '[:LAST{fetch:false}]'. This is also something that neo4j is not good in doing. Thus, to make this part of the query more efficient I will need to create a relationship for FETCH and a relationship for NOT_FETCH from PolicyAccount -> PolicyReport. It seems that the graph db is good on running through clean relationships between Labels, when properties are involved it slows down heavily and needs a lot of indexing. Problem is reality is not built this way.

Perhaps I could be of more assistance if you could tell me a bit more about your graph, and what exactly you're trying to get out of it.

There are three things regarding graphDBs that I am certain of:

  1. Anything that can be done in a relational database, can be done in a graph, almost always more efficiently.
  2. Designing a good graph is orders of magnitude simpler than querying it efficiently.
  3. Database design is orders of magnitude more important to a graph database than a relational database.

Fortunately, once we can clearly describe the situation, the problem, and the goal, we can find a solution.

Final Note:
LEFT: what you have
RIGHT: what you should have

shai
Node Link

Thanks for showing an interest. Here is the relevant part of the graph


there are over a million AP about 650000 PA. Both have a different life cycle. The relation RELATED_TO is only created when we receive an AP. There are about 120 CARRIER nodes thus these kind of become super nodes with many relations comming into them.
I took your previous advice so my query is now:

MATCH (p:CUST{status:'ACTIVE'})
MATCH (emp:Employer)<-[:EMPLOYEE_OF]-(e:Employee{status:'ACTIVE'})-[:ACCOUNT_OF]->(ei:ei)<-[:EMPLOYEE]-(p)
MATCH (p)-[:AGAM]->(ai:ai)<-[:ACCOUNT_OF]-(ap:ap{status:'ACTIVE',hasReport:false}) 
MATCH (ap)-[:CARRIER]->(prod:Carrier) 
return p, ap, prod

Here is the completion of the graph to match the query.

As you can see I don't query the RELATED_TO part at all. I make this decision when I create the AP

Sorry for the delay in response, it has been a busy month.

Glad I could help a bit! How is your performance looking with that latest query? Really looks like your close to good solution. Now I'm curious how the optimizer handled the second MATCH statement.

shai
Node Link

the query improved greatly and return a response, although it does take over a minute to return this response.

What bothers me is that solving the issue by adding this property is like using in index in normal relational DB. I thought traversing through relationships in a graph db should be faster then such indexing.

Performance depends on your graph-structure, and query. It still needs to retrieve all the nodes through each relationship step. There is no escaping some basic patterns, like linked-lists, keys, and indices. GraphDBs let you build and query graphs and patterns that aren't possible in relational dbs, while also letting you avoid rdb patterns that force nulls, empty foreign keys, polymorphism, etc.

If you're trying to load a million nodes that are connected to a million nodes, there is no avoiding that your RAM will have to load them all, and your processor will have to check them all.

Typically where relational dbs lose to graphs is on multiple table joins, and especially when you have minimal information (or even no information) about how those tables are supposed to be joined, or what tables to join, or if the number of joins to perform is variable. All this is extremely easy to model with relationships and query over with Cypher, and relationship traversal is much more efficient than table joins, since efficiency is based on the number of those types of relationships on the nodes being traversed, and not the total number of those relationships in the graph (whereas table join efficiency is based on the number of total entries in the table, and gets worse the more tables that must be joined).

Aside from this, the efficiency of graph query performance is typically on how you can get away with touching as little of the graph as possible. That typically starts from initial index lookup (as opposed to a label scan and filter), modeling with more specific relationships as opposed to general ones (at least where there can be many of a general relationship type on a node...this is a case of quick selection being more efficient than expanding out many relationships and filtering on properties), and filtering and limiting early if possible.

But some operations are faster than others, and rolling up the presence of a certain (hopefully not easily changing) graph pattern into a property or label on a node (or even some additional relationship to the end node of the pattern) is a decent way to improve efficiency, especially when coupled with an index on that property (if it's modeled as a property). After all, a single index lookup (or even a label scan and property filter) will be more efficient than a more complex graph pattern plus property filtering.