Neo4j

andres_tellog · ‎09-20-2021

Dear community, I want to ask for your help to understand something.
I am using the function apoc.path.subgraphAll to get the subgraph containing 2 hops from the source node in any direction and with any relationship type. This function has the parameter bfs to select the way to traverse the graph, Breadth or Depth first search. I assumed the results would be the same with any of those parameters, just the calculation time would be different. However using BFS I get more results than using DFS, is that correct? Do am I doing something wrong?

Here is my code:

MATCH (concept:Concept) 
WHERE concept.id in ['/c/en/sweeping']  
CALL apoc.path.subgraphAll(concept, { 
    relationshipFilter: null, 
    minLevel: 0, 
    maxLevel: 2, 
    bfs: true }) 
YIELD nodes, relationships 
RETURN nodes, relationships;

I attach some images so you can see the difference in the results

I would really appreciate your help.

Best regards,
Andrés

andrew_bowman · ‎09-24-2021

Hi Andres, but unfortunately this is not true: "At the end the same set of nodes will be visited in both cases."

The reason for this is that we're using NODE_GLOBAL uniqueness in the procedure, meaning that once a node has been visited once, it won't be visited again. This is useful when we want to prevent redundant processing/visiting of the same nodes, but it will conflict with DFS expansion.

When we use BFS, we encounter nodes at the earliest possible point we could possibly encounter them. Meaning that if some connected node (let's call it x) is reachable at both 1 and 2 hops from your starting node, BFS will encounter node x at the first hop, and then it can look beyond it to connected nodes at the second hop. Because of NODE_GLOBAL uniqueness, even though node x is ALSO reachable at 2 hops, that path won't be processed. But it doesn't need to be. The proc is meant to find reachable nodes, it doesn't need to know or process all possible paths to them. Once those reachable nodes are found, only then does it find all relationships between those nodes.

But if you're using DFS, it might (by chance) expand the path where node x is encountered at 2 hops away first. When it eventually gets to the path where it would have encountered node x at 1 hop away, it will not process that path, because node x has already been visited. This means that although there are additional nodes at 2-hop distance from the starting node (and 1 hop from node x) they will not be visited because node x was first visited at a 2-hop distance, and your proc is configured with a maxLevel of 2.

So the order of expansion/visitation matters in DFS. If something changes in the planner such that a different order of expansion occurs, then your results could differ. It's best to use BFS for more reliable and deterministic results.

View solution in original post

ameyasoft · ‎09-21-2021

Check this link for nicer explanation.
https://www.techiedelight.com/depth-first-search-dfs-vs-breadth-first-search-bfs/

andres_tellog · ‎09-22-2021

@ameyasoft, thank you very much for your reply. Nice explanation indeed. However, my question is why the same function apoc.path.subgraphAll running over the same data give different results just by changing the way I traverse the graph?

Shouldn't both DFS and BFS give the same results (same nodes and relationships) starting from a particular node up to 2 hops? Just the order of how the nodes in the subgraph are traversed is what changes. At the end the same set of nodes will be visited in both cases.

andrew_bowman · ‎09-24-2021

Hi Andres, but unfortunately this is not true: "At the end the same set of nodes will be visited in both cases."

The reason for this is that we're using NODE_GLOBAL uniqueness in the procedure, meaning that once a node has been visited once, it won't be visited again. This is useful when we want to prevent redundant processing/visiting of the same nodes, but it will conflict with DFS expansion.

When we use BFS, we encounter nodes at the earliest possible point we could possibly encounter them. Meaning that if some connected node (let's call it x) is reachable at both 1 and 2 hops from your starting node, BFS will encounter node x at the first hop, and then it can look beyond it to connected nodes at the second hop. Because of NODE_GLOBAL uniqueness, even though node x is ALSO reachable at 2 hops, that path won't be processed. But it doesn't need to be. The proc is meant to find reachable nodes, it doesn't need to know or process all possible paths to them. Once those reachable nodes are found, only then does it find all relationships between those nodes.

But if you're using DFS, it might (by chance) expand the path where node x is encountered at 2 hops away first. When it eventually gets to the path where it would have encountered node x at 1 hop away, it will not process that path, because node x has already been visited. This means that although there are additional nodes at 2-hop distance from the starting node (and 1 hop from node x) they will not be visited because node x was first visited at a 2-hop distance, and your proc is configured with a maxLevel of 2.

So the order of expansion/visitation matters in DFS. If something changes in the planner such that a different order of expansion occurs, then your results could differ. It's best to use BFS for more reliable and deterministic results.

andres_tellog · ‎09-25-2021

Hi Andrew,

Thank you very much for such a nice explanation of how the proc actually works. Now everything is clear to me. I was using DFS because it is a lot faster than BFS in the graph I am working on. Then, I noticed the difference in the results and that's why I was wondering if I was doing something wrong. Now I will analyze if for my use case it is still valid to use the results provided by DFS.

Thanks once again!.

Neo4j

apoc.path.subgraphAll: different results depending of Breadth or Depth First Search parameter