cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

apoc.path.subgraphAll: different results depending of Breadth or Depth First Search parameter

Dear community, I want to ask for your help to understand something.
I am using the function apoc.path.subgraphAll to get the subgraph containing 2 hops from the source node in any direction and with any relationship type. This function has the parameter bfs to select the way to traverse the graph, Breadth or Depth first search. I assumed the results would be the same with any of those parameters, just the calculation time would be different. However using BFS I get more results than using DFS, is that correct? Do am I doing something wrong?

Here is my code:

MATCH (concept:Concept) 
WHERE concept.id in ['/c/en/sweeping']  
CALL apoc.path.subgraphAll(concept, { 
    relationshipFilter: null, 
    minLevel: 0, 
    maxLevel: 2, 
    bfs: true }) 
YIELD nodes, relationships 
RETURN nodes, relationships;

I attach some images so you can see the difference in the results


I would really appreciate your help.

Best regards,
Andrés

1 ACCEPTED SOLUTION

Hi Andres, but unfortunately this is not true: "At the end the same set of nodes will be visited in both cases."

The reason for this is that we're using NODE_GLOBAL uniqueness in the procedure, meaning that once a node has been visited once, it won't be visited again. This is useful when we want to prevent redundant processing/visiting of the same nodes, but it will conflict with DFS expansion.

When we use BFS, we encounter nodes at the earliest possible point we could possibly encounter them. Meaning that if some connected node (let's call it x) is reachable at both 1 and 2 hops from your starting node, BFS will encounter node x at the first hop, and then it can look beyond it to connected nodes at the second hop. Because of NODE_GLOBAL uniqueness, even though node x is ALSO reachable at 2 hops, that path won't be processed. But it doesn't need to be. The proc is meant to find reachable nodes, it doesn't need to know or process all possible paths to them. Once those reachable nodes are found, only then does it find all relationships between those nodes.

But if you're using DFS, it might (by chance) expand the path where node x is encountered at 2 hops away first. When it eventually gets to the path where it would have encountered node x at 1 hop away, it will not process that path, because node x has already been visited. This means that although there are additional nodes at 2-hop distance from the starting node (and 1 hop from node x) they will not be visited because node x was first visited at a 2-hop distance, and your proc is configured with a maxLevel of 2.

So the order of expansion/visitation matters in DFS. If something changes in the planner such that a different order of expansion occurs, then your results could differ. It's best to use BFS for more reliable and deterministic results.

View solution in original post

4 REPLIES 4

@ameyasoft, thank you very much for your reply. Nice explanation indeed. However, my question is why the same function apoc.path.subgraphAll running over the same data give different results just by changing the way I traverse the graph?

Shouldn't both DFS and BFS give the same results (same nodes and relationships) starting from a particular node up to 2 hops? Just the order of how the nodes in the subgraph are traversed is what changes. At the end the same set of nodes will be visited in both cases.

Hi Andres, but unfortunately this is not true: "At the end the same set of nodes will be visited in both cases."

The reason for this is that we're using NODE_GLOBAL uniqueness in the procedure, meaning that once a node has been visited once, it won't be visited again. This is useful when we want to prevent redundant processing/visiting of the same nodes, but it will conflict with DFS expansion.

When we use BFS, we encounter nodes at the earliest possible point we could possibly encounter them. Meaning that if some connected node (let's call it x) is reachable at both 1 and 2 hops from your starting node, BFS will encounter node x at the first hop, and then it can look beyond it to connected nodes at the second hop. Because of NODE_GLOBAL uniqueness, even though node x is ALSO reachable at 2 hops, that path won't be processed. But it doesn't need to be. The proc is meant to find reachable nodes, it doesn't need to know or process all possible paths to them. Once those reachable nodes are found, only then does it find all relationships between those nodes.

But if you're using DFS, it might (by chance) expand the path where node x is encountered at 2 hops away first. When it eventually gets to the path where it would have encountered node x at 1 hop away, it will not process that path, because node x has already been visited. This means that although there are additional nodes at 2-hop distance from the starting node (and 1 hop from node x) they will not be visited because node x was first visited at a 2-hop distance, and your proc is configured with a maxLevel of 2.

So the order of expansion/visitation matters in DFS. If something changes in the planner such that a different order of expansion occurs, then your results could differ. It's best to use BFS for more reliable and deterministic results.

Hi Andrew,

Thank you very much for such a nice explanation of how the proc actually works. Now everything is clear to me. I was using DFS because it is a lot faster than BFS in the graph I am working on. Then, I noticed the difference in the results and that's why I was wondering if I was doing something wrong. Now I will analyze if for my use case it is still valid to use the results provided by DFS.

Thanks once again!.