Neo4j

wumirose · ‎06-29-2022

Hi folks,

I am attempting to get a subgraph and graph data(as '.txt 'or other formats) from a big graph

Approach 1:

Randomly sample all nodes types from the large graph

MATCH (source: Node)-[r*..]-(target: Node)
WHERE source.name<>target.name
WITH source, target
SKIP 10
LIMIT 1+rand(10)
RETURN *

I couldn't get this to work because the estimated rows are large, and the connection times out frequently while streaming.

Approach 2:

Get some n hop relationship between 2 kinds of nodes, then extract the path data (including the source and target nodes, relationships, and the node data such as the node degree and node type). I have tried:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)<2
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

for Path length 2:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>1 AND length(P)<3
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

Then path length3:

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>2 
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

This yeilds like a million rows; however, I would like to sample the subpaths such that for a three hops subgraph, I can get 3000 total rows containing:

1000 rows of 1 hop connections( randomly sampled or top or bottom rows)
1000 rows of 2 hop connections
1000 rows of 3 hop connections

source source type relationship target target type PathLength

Any help will be greatly appreciated.

ameyasoft · ‎06-29-2022

Please explain little bit more of your data model. The 'Node' has a property 'name' besides 'type'? At each level are you expecting thousands of nodes? If so, then one source node is connected to thousands of target nodes at level 1. Here I am trying to understand your model to offer some solutions.

ameyasoft · ‎06-29-2022

Try this and check the numbers:

MATCH (source:Node{type: 'typeA'})
CALL apoc.path.spanningTree (source, {maxLevel: 1}}) YIELD path
WITH distinct length(p) as lvl, nodes(p) as n1, relationships(p) as rel
UNWIND n1 as n2
UNWIND rel as rels
RETURN lvl, count(distinct n2) as nodeCnt, count(distinct type(rels)) as relCnt

ameyasoft · ‎07-01-2022

I used your sample data and ran this query:

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})

WHERE source.name<>target.name

CALL apoc.algo.allSimplePaths(source, target, '', 4) YIELD path

with relationships(path) as rel , nodes(path) as n1, length(path) as lvl

unwind n1 as n2

unwind rel as rels

with lvl, collect(distinct n2.type) as lbl, collect(distinct id(n2)) as ids, collect(distinct rels.r) as r1

return lvl, lbl, ids, r1, size(ids) as cnt order by lvl

Result:

Screen Shot 2022-07-01 at 2.21.59 PM.png

ameyasoft · ‎07-01-2022

Please run the above query in your database. If there is too much data, then run for levels 1 and 2 and let me know the node counts. Based on the node counts we can try some methods to extract a subset of nodes from each level. This is not going a direct process and may involve several steps.

wumirose · ‎07-03-2022

I deeply appreciate your help, maybe a few more lines here could clarify my issues:

Say I have allsimplepaths(A, B, '', 3) that look like this:

[A –>relation1 –>B]
[A –>relation2 –>B]
[A –>relation2->C->relation 1–>B]
[A –>relation5->D->relation 3–>B]
[A –>relation2->Y->relation 1–>B]
[A –>relation2->E->relation 1–>B]
[A –>relation2->D–>relation2->F->relation 1–>B]
[A –>relation2->F–>relation4->Y->relation 2–>B]

Desired result: FOREACH pathlength, randomly return 1 row

[A –>relation2 –>B]
[A –>relation2->Y->relation 1–>B]
[A –>relation2->D–>relation2->F->relation 1–>B]

The result is representative of all pathlengths:

The first row:

[A –>relation2 –>B] is a sample from path length 1

The second row:

[A –>relation2->Y->relation 1–>B is a sample from path length 2

the third row:

[A –>relation2->D–>relation2->F->relation 1–>B]. is a sample from path length 3

ameyasoft · ‎07-05-2022

This code will export the results as a json file. For selecting random rows for each level you need to export the data for each level. Select the data rows for each level and you need to combine the results from each level.

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path

with relationships(path) as rels , nodes(path) as n1, length(path) as lvl
with lvl, collect(distinct n1) as n2, collect(distinct rels) as r2
with apoc.coll.toSet(apoc.coll.flatten(n2)) AS n12, apoc.coll.toSet(apoc.coll.flatten(r2)) AS r12, lvl

with n12 as nodes, r12 as relationships, lvl

WITH lvl, [ node in nodes | node {.*, label:labels(node)[0], id:tostring(id(node))}] as nodes,
[rel in relationships | rel {.*, fromNode:{label:labels(startNode(rel))[0], id:tostring(id(startNode(rel)))},type:type(rel), toNode:{label:labels(endNode(rel))[0], id:tostring(id(endNode(rel)))}}] as rels
With lvl, collect(distinct rels) as Allrels, collect(distinct nodes) as AllNodes order by lvl
WITH {nodes:AllNodes, relationships:Allrels, level:lvl} as json
RETURN apoc.convert.toJson(json)
Result:

wumirose · ‎06-30-2022

For instance:

I have a network from

CREATE (a:Node {name: 'mola', type: 'Molecule'})
                CREATE (g:Node {name: 'molg', type: 'Molecule'})
                CREATE (b:Node {name: 'drgb', type: 'Drug'})
                CREATE (h:Node {name: 'drgh', type: 'Drug'})
                CREATE (c:Node {name: 'mola', type: 'Disease'})
                CREATE (i:Node {name: 'disi', type: 'Disease'})
                CREATE (j:Node {name: 'disj', type: 'Disease'})
                CREATE (m:Node {name: 'dism', type: 'Disease'})
                CREATE (d:Node {name: 'chemd', type: 'Chemical'})
                CREATE (k:Node {name: 'chemk', type: 'Chemical'})
                CREATE (e:Node {name: 'genee', type: 'Gene'})
                CREATE (l:Node {name: 'genel', type: 'Gene'})
                CREATE (f:Node {name: 'mola', type: 'DNA'})
                MERGE (a)-[:REL {r: 'subclass_of'}]->(b)
                MERGE (a)-[:REL {r: 'cure'}]->(c)
                MERGE (a)-[:REL {r: 'inhibits'}]->(d)
                MERGE (b)-[:REL {r: 'heals'}]->(d)
                MERGE (c)-[:REL {r: 'causes'}]->(d)
                MERGE (c)-[:REL {r: 'expands'}]->(e)
                MERGE (d)-[:REL {r: 'kills'}]->(e)
                MERGE (d)-[:REL {r: 'involved_in'}]->(f)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (f)-[:REL {r: 'kills'}]->(l)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (l)-[:REL {r: 'kills'}]->(l)
                MERGE (m)-[:REL {r: 'heals'}]->(i)
                MERGE (a)-[:REL {r: 'part_of'}]->(e)
                MERGE (c)-[:REL {r: 'expands'}]->(m)
                MERGE  (e)-[:REL {r: 'interacts_with'}]->(f)

Using

MATCH (source),(target) 
            WHERE source<> 'None' AND target<>'None' AND source<target
            CALL apoc.algo.allSimplePaths(source, target, '', 4)
            YIELD path AS P
           RETURN P, length(P)

I got:

P length(P)

(mola)-[:REL {r: 'subclass_of'}]->(drgb),1

(mola)-[:REL {r: 'inhibits'}]->(chemd),1

(drgb)-[:REL {r: 'heals'}]->(disi),1

(chemd)-[:REL {r: 'kills'}]->(genee),1

(chemd)-[:REL {r: 'involved_in'}]->(mola),1

(disi)<-[:REL {r: 'heals'}]-(dism),1

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'heals'}]-(drgb),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),2

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),2

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd),2

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola)-[:REL {r: 'kills'}]->(genel),3

(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'kills'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3

(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'kills'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'inhibits'}]-(mola)-[:REL {r: 'cure'}]->(mola),3

(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),3

(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),4

(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'expands'}]->(dism),4

(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),4

(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4

My Question:

How can I randomly return only the subset of the path- representative of all path lengths? Eg.