cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Commit History - all paths

dpetit1
Node Link

I have created a graph based on commit history of a source code control tool. The basic graph is

(c:Commit)-[:PARENT]->(p:Commit)

in which every commit except the first references one or more parent commits. The commit is indexed on its URI.
I'm looking to find all nodes between 2 given commits (answering the question: what commits have been added since some starting commit, excluding commits that aren't part of the path).

I have a query that works:

MATCH p = (c:Commit {uri:"latest_commit_uri"})-[:PARENT*0..10]->(o:Commit {uri:"starting_commit_uri"})
return p;

This gives me what I need, but doesn't perform very well if it remove the 0..10. I have tried with 2 commits that are fairly close to one another in the graph and it performs well, but the further apart the nodes get, the worse it gets. I would have expected the performance to be roughly linear since both nodes are indexed and can be found quickly.
For example, the shortest path between one node I have been working with and another node is 33 commits, and there are probably 200-300 nodes in all of the paths put together. But this query never returns.

My question is: is there a better way to do this? It seems to me this would be exactly the kind of query neo4j should excel at. What am I missing?

1 ACCEPTED SOLUTION

Thanks for the reply - I forgot to circle back on this. subgraphAll didn't do the trick, but subgraphNodes worked well. Given 2 commits (1 in the path of the other as a parent), a can now collect all commits that are "new" and all commits that are "missing". This works by retracing the entire commit history for each commit, and doing a comparison when finished.

MATCH (c: Commit {uri:"Commit1URI"})
CALL apoc.path.subgraphNodes(c, {
relationshipFilter: "PARENT>",
minLevel: 0
})
YIELD node
WITH collect(node) as latest
MATCH (c: Commit {uri:"Commit2URI)
CALL apoc.path.subgraphNodes(c, {
relationshipFilter: "PARENT>",
minLevel: 0
})
YIELD node
WITH latest, collect(node) as current
return [x in latest where not x in current] as whatsnewlist, [y in current where not y in latest] as whatsremovedlist

View solution in original post

4 REPLIES 4

This is looking for all possible paths that meet the pattern, so it won't stop looking even if it finds a path, it will exhaustively keep searching. If there is only one such possible path, then there are some optimizations we can make.

You could add a LIMIT 1 to the end.

Or you can use a shortestPath() function on the pattern:

MATCH p = shortestPath((c:Commit {uri:"latest_commit_uri"})-[:PARENT*0..10]->(o:Commit {uri:"starting_commit_uri"}))
return p;

This will do a bidirectional bfs expansion until such a path is found.

Also make sure to EXPLAIN the query to make sure index lookups are being used for both nodes.

dpetit1
Node Link

Andrew -
Thanks for the reply.
What I am really looking for is the subgraph of nodes between 2 nodes. My strategy was to traverse all possible paths between 2 nodes, then gather and dedupe the nodes (hence the question). But I can understand why this might be challenging on a huge graph (I didn't expect the perform to get so bad so quickly for the database in question, though - the paths are relatively simple and limited).
After a bit more research, I've decided to try apoc.path.subgraphAll, which appear to provide what I need. It's going to take me a bit to get it apoc installed to try it out. Do you think this will be an efficient approach?

You might want to check for yourself how complex this can get by counting the paths found. Try this, and see how it grows as you keep increasing the upper bound of the variable length pattern:

MATCH p = (c:Commit {uri:"latest_commit_uri"})-[:PARENT*0..10]->(o:Commit)
return count(p) as paths;

Note that this isn't including the filtering on the end, so this is just showing how many paths it has to check and filter to see if it meets the pattern.

SubgraphAll() may not be the best choice. It doesn't work so well when you have a specific end node. It is possible it will not be able to visit all the nodes you want, since it isn't about finding all possible paths between two nodes. It isn't a good fit for your use case. I would recommend sticking with Cypher for this one.

Thanks for the reply - I forgot to circle back on this. subgraphAll didn't do the trick, but subgraphNodes worked well. Given 2 commits (1 in the path of the other as a parent), a can now collect all commits that are "new" and all commits that are "missing". This works by retracing the entire commit history for each commit, and doing a comparison when finished.

MATCH (c: Commit {uri:"Commit1URI"})
CALL apoc.path.subgraphNodes(c, {
relationshipFilter: "PARENT>",
minLevel: 0
})
YIELD node
WITH collect(node) as latest
MATCH (c: Commit {uri:"Commit2URI)
CALL apoc.path.subgraphNodes(c, {
relationshipFilter: "PARENT>",
minLevel: 0
})
YIELD node
WITH latest, collect(node) as current
return [x in latest where not x in current] as whatsnewlist, [y in current where not y in latest] as whatsremovedlist