Neo4j

lorenzh · ‎06-21-2022

Hi, I am currently working on a project that involves machine learning over paths in a graph. For this, I have to extract all possible paths between a known start and end node. Unfortunately, the graphs that are involved tend to be rather large, such that normal cypher queries would be to slow to perform this kind of operation. I have been looking into the APOC- and the GDSL-libraries, but unfortunately I could not find an algorithm that would suit this kind of problem. The closest I could find in terms of the APOC-library, was the apoc.path.subgraphAll() procedure. As for the GDSL there appears to be no such procedure at all, since the BFS- and DFS-algorithms only return a single, but not all paths. As a last resort, I also considered implementing my own algorithm using the Pregel-API.

Did I miss any option or is that all there is? I'd be glad if there was a simple way performing this operation, as I do not have much time unfortunately. So if I could prevent having to resort to the Pregel-API, I'd be more than happy.

Thanks a ton!

bennu_neo · ‎06-21-2022

Hi @lorenzh ,

Have you tried something like...?

MATCH(start : {id : $startId}), (end {id : $idEnd})
with start, end
CALL apoc.path.expandConfig(start, {
   terminatorNodes : [end],
   uniqueness : 'RELATIONSHIP_PATH' //You may like changing this
   optional : true
}) yield path
return count(path)

Bennu

Oh, y’all wanted a twist, ey?

lorenzh · ‎06-22-2022

Thanks for your reply. While in theory this is exactly what I need, it appears that even on small graphs (~400 nodes, ~700 edges) and path deeper than five nodes, the algorithm is too slow. I am afraid, I'll have to go with GDS then and somehow manually implement what I need.

bennu_neo · ‎06-23-2022

Sounds fair. If you have a public instance/db dump that i can test I would be happy to help (Love this challenging problems) 😄

Oh, y’all wanted a twist, ey?

lorenzh · ‎06-24-2022

Thanks a lot, that is very kind! I have made some progress since, by resorting to the findAllPaths() algorithm found in the standard graphdb-library. It seems to work at least reasonably fast for depths of up to 100 nodes in my example.

Ultimately, however, it would be great to have this kind of method also in the GDSL, such that one could leverage graph projections. In this regard, I have tried using the pregel-API to solve this problem, but I gave up for now due to some of it's limitations. As far as I understand, the order in which messages are sent is not retained with pregel and there is no way to tell from which node a message originated, right? If that's correct, then I would assume there is also no way one could construct paths using computations on the individual nodes alone. There might be a chance to deal with is kind of problem within the masterCompute()method, but as I said, the solution I found seems to be acceptable for now.

If you want to give it a try for yourself, I could provide you with a db-dump of course 🙂

bennu_neo · ‎06-29-2022

Hi @lorenzh !

In general this a 'well known' problem.. the loopy lattices problem. It scales really bad, generally speaking, so it's always good knowing some context on every use case so it's possible to evaluate particular optimizations.

I'll take a look on the problem in general, but it could be good if you can share me that dump so we can land it on your use case. Pregel may be a good option (once you arrive to this levels you need to power of parallel computing). I saw some other post you made about, indeed handling everything with integer messages may be tricky, but lets try 😄

Oh, y’all wanted a twist, ey?

lorenzh · ‎06-30-2022

Thanks, so the problem I am facing is called loopy lattices? I knew this was going to be hard, but taking a closer look at it, now that I know what it's called, it's definitely harder than I imagined. It may still be possible in this case, but let me explain a little more what I am trying to do here:

The purpose of all of this is to run some static code analysis on a graph representation of a given program. The type of graph in particular is a so-called Sparse Value Flow Graph (SVFG), generated using a tool called SVF. In short, this graph representation models the flows of values throughout a program, either directly through name-taken variables or indirectly through address-taken values. Using the available methods, it is easy to answer questions such as: "Does a value produced at some point in the program, reach another defined point in the program". This is also known as a taint-style analysis. Imagine we have a function that introduces a user defined value into the program (e.g. the gets()-function) we must make sure that it never reaches the call-site of, lets say, the malloc()-function.

Now, what I want to do is, I want to retrieve not one single path between two of such points, I want to retrieve all possible paths, through which a value could flow from point A to point B. As the "S" in "SVFG" stands for "Sparse", the graph is not that densely connected. I counted an average of 2.5 outgoing edges per node. Another characteristic that may be beneficial in this case is that while some nodes are highly connected, the maximum depth of the paths they define are mostly very short. Nodes on longer paths are usually very sparsely connected. See the example below:

If you're interested and you'd like to help me out, I'd be very thankful. I've create d a dump of a small-ish httpd server called "darkhttpd" and pushed it to this GitHub-repository that you could have a look at. I could of course also provide you with push access, if you want to. And yeah, as you've already inferred correctly from my other post, I am currently giving pregel a try, since I had some mild success wrt. another problem 😉

Thanks a ton,

Lorenz

bennu_neo · ‎06-30-2022

Hi @lorenzh!

I'll take a look on that dump later today of possible. Can you gimme an example of two nodes that you are using to calculate every possible path? (preferably an slow combination).

Another thing, technically, considering the sequentially of executions, direction matters, no?

PS: This problem/project/idea/homework looks fun (hard and fun)

Oh, y’all wanted a twist, ey?

lorenzh · ‎06-30-2022

Hi, so I've run a couple of experiments and a particularly slow combination appears to be the following, where n is the source node and m is the destination node.

MATCH (n) WHERE n.n_hash = "42222368"
MATCH (m) WHERE m.n_hash IN ["43603232"]
RETURN n AS source, m AS sink

And yes, that's correct, the direction does matter. Using pregel, I thought it might be possible to do something along the lines of:

Beginning at the start node, mark it as reached and send the own node id to all neighbors
Neighbors that receive node ids mark themselves as reached and keep record of all ids sent to them
In the masterCompute() method, after each superstep a list of pairs is built for each node reached in the last step, such that each pair contains a) the id sent to the node and b) the own id of the node.
Now with these "puzzle pieces", paths can be constructed using the list of pairs from the last super step. For each node that occurs multiple times as a sender in the pairs, all paths ending in that node must be copied

SS 1   SS 2   SS 3 | SS n
A->B | B->C | E->F | ...
A->C | B->E | E->G | ...
A->D | C->E | E->D | ...

After SS 1
A->B
A->C
A->D

After SS 2
A->B->C
A->B->E
A->C->E
A->D

After SS 3
A->B->C
A->B->E->F
A->B->E->G
A->B->E->D
A->C->E->F
A->C->E->G
A->C->E->D
A->D

I haven't had the time to think all of that through and I am not sure how much it may speed up things, if it does at all. After all there is a lot of stuff going on the in masterCompute() method that is not being executed in parallel.

Thanks a lot,

Lorenz

bennu_neo · ‎06-30-2022

Hi @lorenzh ,

Quick question. From sink to source, which relationship are you planning to traverse (with direction)? svfg? direct +indirect svfg edge?

Oh, y’all wanted a twist, ey?

lorenzh · ‎06-30-2022

Ah yeah of course, sorry I forgot. The "svfg"-type edges are what you want to follow. Technically the different kinds of edges mark different types relationships. I introduced the "svfg"-edges as kind of a "super"-relationship. Makes it easier to project using native projections...

bennu_neo · ‎06-30-2022

Well @lorenzh , how many path do you expect on ur previous example? Using the relationshigFilter you get one (and fast).

MATCH (start) WHERE start.n_hash = "42222368"
MATCH (end) WHERE end.n_hash IN ["43603232"]
with start, end
CALL apoc.path.expandConfig(start, {
   terminatorNodes : [end],
   uniqueness : 'NODE_PATH',
   relationshipFilter: 'svfg>',
   optional : true
}) yield path
return path

Bennu

Oh, y’all wanted a twist, ey?

Neo4j

Best way of retrieving all paths between a start and end node