Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-03-2023 11:56 AM - edited 01-04-2023 10:31 AM
I'm trying to understand why below query is taking too long to retrieve results. I have mocked up the values used but the below query is right and is returning 40 records (a node has 8 diff values and z node has 5 diff values so total 40 combinations). It's taking 2.5 min to return those 40 records. Please let me know what the issue is here. I'm suspecting this to be Neo4j version and infrastructure we're using right now in production.
After the below query we have algo.kShortestPaths.stream so the whole thing together is taking more than 5 min. What do you suggest? Is there no other way where we can handle such combinations (a and z node combinations > 40) within 5 min?
Infrastructure details: Neo4j 3.5 community edition
2 separate datacenters, sync job - 64GB mem 16GB CPU 4 cores
Cypher Query:
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName AND toUpper(a.networkType) = 'networkType1' AND NOT (toUpper(a.equipmentTid) CONTAINS 'TEST')
WITH a.equipmentTid AS tid_A
MATCH pp = (a:EquipmentNode)-[rel:CONNECTED_TO]-(a1:EquipmentNode)
WHERE a.equipmentTid = tid_A AND ALL( t IN relationships(pp)
WHERE t.type IN ['Type1'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId1'] AND t.status IN ['status1', 'status2'] )
WITH a
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName AND toUpper(z.networkType) = 'networkType2' AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
WITH z.equipmentTid AS tid_Z, a
MATCH pp = (z:EquipmentNode)-[rel:CONNECTED_TO]-(z1:EquipmentNode)
WHERE z.equipmentTid=tid_Z AND ALL(t IN relationships(pp)
WHERE t.type IN ['Type2'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId2'] AND t.status IN ['status1', 'status2'])
WITH DISTINCT z, a CALL algo.kShortestPaths.stream(a, z, 5, 'mileage', {
nodeQuery:'MATCH (n:EquipmentNode) RETURN id(n) AS id ',
relationshipQuery:'MATCH p = (n:EquipmentNode)-[r:CONNECTED_TO]-(m:EquipmentNode) WHERE ALL ( Q IN relationships(p) WHERE Q.type IN ["type1"] AND Q.totalChannels > 0 AND (Q.totalChannelsUsed < Q.totalChannels) AND Q.networkId IN ["networkId1"] AND Q.status IN ["status1", "status2"] ) RETURN id(n) AS source, id(m) AS target, r.mileage AS weight, r.trailName AS linkName, r.networkId AS linkNetworkId ',
maxDepth:80,
direction:'BOTH',
defaultValue:0.0,
write: false ,
graph:'cypher',
path: true
}) YIELD path, index, nodeIds, costs
WITH nodeIds, a, z, index, path, costs, [node IN algo.getNodesById(nodeIds) | node.locationClli] AS clliInRoutes, [node IN algo.getNodesById(nodeIds) | node.nodeId] AS tidsInRoutes
UNWIND range(0, size(nodeIds)-2) AS idx
WITH clliInRoutes, tidsInRoutes, algo.getNodeById(nodeIds[idx]) AS current, algo.getNodeById(nodeIds[idx+1]) AS next, path, a, z, index, nodeIds, costs
WITH clliInRoutes, tidsInRoutes, path, current, next, a, z, index, nodeIds, costs
MATCH pp=(current)-[t:CONNECTED_TO]-(next)
WHERE ALL( t IN relationships(pp)
WHERE t.type IN ['type1'] AND (t.totalChannels > 0 AND t.totalChannelsUsed < t.totalChannels) AND t.networkId IN ['networkId1'] AND t.status IN ['status1', 'status2'] )
WITH clliInRoutes, tidsInRoutes, path AS p, a, z, index, nodeIds, costs, relationships(pp) AS connections
UNWIND connections AS routes
WITH clliInRoutes, tidsInRoutes, costs, index, nodeIds, routes, connections, p, a, z, size(nodeIds)-1 AS noOfLinks, reduce(acc = 0.0, cost IN costs | acc + cost) AS totalMileage, EXTRACT(n IN connections | n.trailName) AS shortestPath, a.nodeId AS startNode, z.nodeId AS endNode, a.locationClli AS startLocation, z.locationClli AS endLocation, REDUCE(COST=0.0, r IN connections | COST+toFloat(r.cost)) AS totalCost, 10^2 AS factor
WITH clliInRoutes, tidsInRoutes, costs, index, p, nodeIds, routes, factor, totalMileage, totalCost, shortestPath, noOfLinks, startNode, endNode, startLocation, endLocation
WHERE NOT (routes.trailName IN ['trail1','trail2'])
RETURN
This query was built to handle small combinations upto 4 total a and z node combinations but today we might have combinations greater than 10 or 40 or 100 so this is timing out. I'm not sure if there's a better way to write the query to improve performance assuming the community edition is good enough for our case.
Solved! Go to Solution.
01-05-2023 06:45 AM
Hi @rakesh ,
Couple of things.
1. Neo4J 3.5 is on EOL, so there's no support on it anymore.
2. Use of subqueries that may help you (in general) avoid undesired extra operations is not there.
3. Most important, and here we get into the real problem. Your query relies on property based filtering on relationship while traversing. This has a really high computational cost on plain cypher (some improvement is already there in Neo4J 5), so you may like to create your custom procedure that traverse the graph based on your custom logic. This can be used starting from 4 as well. https://neo4j.com/docs/java-reference/current/traversal-framework/
If you have simple ways to reload your data, I would strongly suggest you to move directly to Neo4j 5.
01-03-2023 04:08 PM
Hi @rakesh ,
This looks like a perfect scenario for a Traversal API . It supported from 4.0 tho, so you may need to move to a most recent version of Neo4J. (Also considering that 3.5 it's not supported EOL anymore since May this year)
01-04-2023 10:33 AM
thanks for the reply. I have added the rest of the algo to the post. So before even getting to the algo to get the a and z values it was taking upto 2 min so I'm wondering what was going on. Does the query get all the combinations and execute the algo or does it execute the algo for each combination? Overall please let me know if upgrading to a higher version will make things to fetch faster?
01-03-2023 10:47 PM
I am surprised you are getting a result, since your first match's 'where' clause is a concatenation of expressions with AND and one of the component expressions should always be false. The expression is the one below where you are testing whether an upper case string is equal to a literal string that is not all upper case.
toUpper(a.networkType) = 'networkType1'
I may be misinterpreting your query, but this is my analysis. Your first match produces EquipmentNodes represented by 'a'. Each of their 'equipmentTid' values are passed to the next phase of the query. In this phase you seem to be matching to get the same EquipmentNode based on the 'equipmentTid' passed through the 'with' clause (this 'match' seems redundant). You are matching each of these EquipmentNodes to another EquipmentNode with specific relationships properties. You are passing the original 'a' value through to the next stage and are ignoring the other parts of the path that is match. This to me means you don't care about the other values of the path, but just that a path exists. If no path exists, then the query stops; if a path exists the value of 'a' is passed on. As such, you can use an 'exists' predicate here instead of a match that will create multiple unnecessary rows that you eventually reduce through 'Distinct'.
Next, the value of 'a' that is passed to the third match is not used in the match. As a result, the same query is being executed for each value of 'a' passed through the 'with' clause. This takes more time and produces more redundant rows of data. This 'match' block is identical to the first, but with a few different literal values as parameters.
You stated that 'a' has eight values and 'z' has five values, resulting in forty rows. I suspect the rows of data have the equipentTid for each 'a' repeated for each value of 'z'. Since the 'a' nodes and 'z' nodes are not related, there really are only thirteen rows of data. The forty rows resulted from the Cartesian product that was created following the 'with a' line and subsequent uncorrelated 'match'.
If my understand is correct, it would be more efficient to execute the two blocks of 'match' statements using a 'union' instead. This would avoid the Cartesian product creating forty rows of data when there are only thirteen equipmentTid's found. I believe the following query should produce the same thirteen equipmentTid values.
Note, I removed the 'all' predicate because the path pattern has only one relationship, so relationships(pp) will consist of one element (rel); therefore, there is no need to use the 'all' predicate. I added 'siteName' to the output so you can identify which query block the equipmentTid came from. Remove it once you see the results are what you expected and you will get the distinct list of equipmentTid's across both siteNames.
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName
AND toUpper(a.networkType) = 'networkType1'
AND NOT (toUpper(a.equipmentTid) CONTAINS 'TEST')
AND EXISTS {
MATCH (a)-[rel:CONNECTED_TO]-(:EquipmentNode)
WHERE rel.type IN ['Type1']
AND (rel.totalChannels > 0 AND rel.totalChannelsUsed < rel.totalChannels)
AND rel.networkId IN ['networkId1']
AND rel.status IN ['status1', 'status2']
}
RETURN DISTINCT a.equipmentTid as equipmentTid, 'siteName1' as siteName
UNION
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName
AND toUpper(z.networkType) = 'networkType2'
AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
AND EXISTS {
MATCH (z)-[rel:CONNECTED_TO]-(:EquipmentNode)
WHERE rel.type IN ['Type2']
AND (rel.totalChannels > 0 AND rel.totalChannelsUsed < rel.totalChannels)
AND rel.networkId IN ['networkId2']
AND rel.status IN ['status1', 'status2']
}
RETURN DISTINCT z.equipmentTid as equipmentTid, 'siteName2' as siteName
One last note, your first 'match' in each block is looking for a relationship between a SiteNode and EquipmentNode have equal 'siteName' and 'locationClli' values respectively. This seems odd, as would you have these two nodes related if they had different values? If yes, then could you create a new relationships between these two types of nodes so you can search for them faster?
I apologize if I completely missed the mark. Let me know where I went wrong and we can try to optimize the query.
01-04-2023 11:08 AM
thanks for replying. We are actually using all caps in the below query snippet this is just me modifying the values as it is insignificant for this post. I have updated my post to include the whole cypher query that we are using to retrieve the shortest paths between A and Z sites (which can have multiple equipmentTids). The reason I didn't include the whole cypher query initially is I thought retrieving the A and Z values itself is taking too long before proceeding to the algo.
toUpper(a.networkType) = 'networkType1'
What we are trying to do here is we have nodes A and Z. We're trying to find the shortest routes between A and Z with A as starting node and Z as ending node. For that we're trying to define the nodes A and Z with some attributes we already know about them like their name and what network does it belong to etc. Each site A and Z can have several equipmentTids which are the actual starting points to finding different paths to Z site or equipmentTids at Z site.
You have understood the problem well I think. I think we need the cartesian product here as we are trying to find all possible shortest routes between A and Z equipmentTids so cartesian product of combinations sounds about right to me. So in my example A has 8 equipmentTids and Z has 5 equipmentTids so in theory there could be 40 different routes starting from A to Z site. I have tried executing your suggestion but it didn't work with below exception which I'm trying to understand and fix. But if your suggestion doesn't return the cartesian product then it wouldn't work for us right?
Neo.ClientError.Statement.SyntaxError: Invalid input '(': expected whitespace, comment, ':', ',' or '}' (line 6, column 11 (offset: 223)) " MATCH (a)-[rel:CONNECTED_TO]-(:EquipmentNode)"
Coming to your last paragraph each SiteNode can have multiple EquipmentNodes so in my case if we are trying to search routes between 2 sitenodes then at this point I only have sitenames of the sitenodes but in reality it is the equipmentNodes in each of these sitenames which are actually connected. If we are trying to find routes between specific equpementNodes then I wouldn't be using sitenames in that query. Each equipmentNode has an attribute locationClli/sitename referring to its siteNode's sitename where the equipmentnode is actually located.
01-04-2023 03:31 PM
I am not sure what the error is. I don't get a syntax error. You can see examples of the syntax in the manual. What version do you have?
https://neo4j.com/developer/cypher/subqueries/
I understand the need for the Cartesian product. Please try the query below to see if it gives the correct results and is faster. We can look at the remaining part of the query once we have retrieving the equipment id's optimized.
explain
CALL {
MATCH (s:SiteNode {siteName: 'siteName1'})-[rl:CONNECTED_TO]-(a:EquipmentNode)
WHERE a.locationClli = s.siteName
AND toUpper(a.networkType) = 'networkType1'
AND NOT toUpper(a.equipmentTid) CONTAINS 'TEST'
AND EXISTS {
MATCH (a)-[rel:CONNECTED_TO]-(:EquipmentNode)
WHERE rel.type IN ['Type1']
AND (rel.totalChannels > 0 AND rel.totalChannelsUsed < rel.totalChannels)
AND rel.networkId IN ['networkId1']
AND rel.status IN ['status1', 'status2']
}
RETURN collect(DISTINCT a.equipmentTid) as a_list
}
CALL {
MATCH (d:SiteNode {siteName: 'siteName2'})-[rl:CONNECTED_TO]-(z:EquipmentNode)
WHERE z.locationClli = d.siteName
AND toUpper(z.networkType) = 'networkType2'
AND NOT (toUpper(z.equipmentTid) CONTAINS 'TEST')
AND EXISTS {
MATCH (z)-[rel:CONNECTED_TO]-(:EquipmentNode)
WHERE rel.type IN ['Type2']
AND (rel.totalChannels > 0 AND rel.totalChannelsUsed < rel.totalChannels)
AND rel.networkId IN ['networkId2']
AND rel.status IN ['status1', 'status2']
}
RETURN collect(DISTINCT z.equipmentTid) as z_list
}
UNWIND a_list as a
UNWIND z_list as z
RETURN a, z
01-04-2023 04:42 PM
Don't forget 3.5 limitations 🙂
01-04-2023 07:09 PM
@bennu_neo is correct. I noticed you did mention you are using 3.5. The existential subquery in ‘where’ clause was introduced in 4.0. This would explain your syntax error.
https://neo4j.com/docs/cypher-manual/4.0/clauses/where/#existential-subquery-with-where
01-05-2023 06:33 AM
Can you mention few limitations so that I can inform my management so that we can push for upgrade?
01-05-2023 06:45 AM
Hi @rakesh ,
Couple of things.
1. Neo4J 3.5 is on EOL, so there's no support on it anymore.
2. Use of subqueries that may help you (in general) avoid undesired extra operations is not there.
3. Most important, and here we get into the real problem. Your query relies on property based filtering on relationship while traversing. This has a really high computational cost on plain cypher (some improvement is already there in Neo4J 5), so you may like to create your custom procedure that traverse the graph based on your custom logic. This can be used starting from 4 as well. https://neo4j.com/docs/java-reference/current/traversal-framework/
If you have simple ways to reload your data, I would strongly suggest you to move directly to Neo4j 5.
01-05-2023 07:40 AM
would creating more indexes help here?
01-05-2023 07:59 AM
I created an index for equipmentTid as my lead asked me to try that and see. I tried this so the query returning the combinations completed within seconds but the algo itself has timed out. So I guess there's no other way.
All the sessions of the conference are now available online