cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Getting a subgraph from a big graph

wumirose
Node Clone

Hi folks,

I am attempting to get a subgraph and graph data(as '.txt 'or other formats) from a big graph

  • Approach 1: 

Randomly sample all nodes types from the large graph

 

 

 

 

MATCH (source: Node)-[r*..]-(target: Node)
WHERE source.name<>target.name
WITH source, target
SKIP 10
LIMIT 1+rand(10)
RETURN *​

 

 

 

 

I couldn't get this to work because the estimated rows are large, and the connection times out frequently while streaming. 

 

  •  Approach 2: 

Get some n hop relationship between 2 kinds of nodes, then extract the path data (including the source and target nodes, relationships, and the node data such as the node degree and node type). I have tried:

 

 

 

 

 

 

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)<2
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

 

 

 

 

 

 

 

for Path length 2:

 

 

 

 

 

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>1 AND length(P)<3
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

 

 

 

 

 

 

 

Then path length3:

 

 

 

 

 

MATCH (source:Node{type: 'typeA'}),(target:Node{type: 'typeB'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path AS Paths
WITH Paths AS P
WHERE length(P)>2 
SKIP 10
LIMIT 500
RETURN P, apoc.path.elements(P) as elements

 

 

 

 

 

This yeilds like a million rows; however, I would like to sample the subpaths such that for a three hops subgraph, I can get 3000 total rows containing: 

  • 1000 rows of 1 hop connections( randomly sampled or top or bottom rows)
  • 1000 rows of 2 hop connections
  • 1000 rows of 3 hop connections
    source source type relationship target target type PathLength
               

Any help will be greatly appreciated.

8 REPLIES 8

ameyasoft
Graph Maven

Please explain little bit more of your data model. The 'Node' has a property 'name' besides 'type'?  At each level are you expecting thousands of nodes? If so, then one source node is connected to thousands of target nodes at level 1. Here I am trying to understand your model to offer some solutions.

Try this and check the numbers:

  1. MATCH (source:Node{type: 'typeA'})
  2. CALL apoc.path.spanningTree (source, {maxLevel: 1}}) YIELD path
  3. WITH distinct length(p) as lvl, nodes(p) as n1, relationships(p) as rel
  4. UNWIND n1 as n2
  5. UNWIND rel as rels
  6. RETURN lvl, count(distinct n2) as nodeCnt, count(distinct type(rels)) as relCnt

I used your sample data and ran this query:

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 4) YIELD path

with relationships(path) as rel , nodes(path) as n1, length(path) as lvl
unwind n1 as n2
unwind rel as rels
with lvl, collect(distinct n2.type) as lbl, collect(distinct id(n2)) as ids, collect(distinct rels.r) as r1
return lvl, lbl, ids, r1, size(ids) as cnt order by lvl
Result:Screen Shot 2022-07-01 at 2.21.59 PM.png

 

 

Please run the above query in your database. If there is too much data, then run for  levels 1 and 2 and let me know the node counts. Based on the node counts we can try some methods to extract a subset of nodes from each level. This is not going a direct process and may involve several steps.

I deeply appreciate your help, maybe a few more lines here could clarify my issues:

Say I have allsimplepaths(A, B, '', 3) that look like this:

  • [A –>relation1 –>B]
  • [A –>relation2 –>B]
  • [A –>relation2->C->relation 1–>B]
  • [A –>relation5->D->relation 3–>B]
  • [A –>relation2->Y->relation 1–>B]
  • [A –>relation2->E->relation 1–>B]
  • [A –>relation2->D–>relation2->F->relation 1–>B]
  • [A –>relation2->F–>relation4->Y->relation 2–>B]

Desired result: FOREACH pathlength, randomly return 1 row

  • [A –>relation2 –>B]
  • [A –>relation2->Y->relation 1–>B]
  • [A –>relation2->D–>relation2->F->relation 1–>B]

The result is representative of all pathlengths:

The first row: 

  • [A –>relation2 –>B]     is a sample from path length 1

The second row:

  • [A –>relation2->Y->relation 1–>B   is a sample from path length 2

 the third row:

  • [A –>relation2->D–>relation2->F->relation 1–>B].   is a sample from path length 3 

This code will export the results as a json file. For selecting random rows for each level you need to export the data for each level. Select the data rows for each level and you need to combine the results from each level.

MATCH (source:Node{type: 'Molecule'}),(target:Node{type: 'Gene'})
WHERE source.name<>target.name
CALL apoc.algo.allSimplePaths(source, target, '', 3) YIELD path

with relationships(path) as rels , nodes(path) as n1, length(path) as lvl
with lvl, collect(distinct n1) as n2, collect(distinct rels) as r2
with apoc.coll.toSet(apoc.coll.flatten(n2)) AS n12, apoc.coll.toSet(apoc.coll.flatten(r2)) AS r12, lvl

with n12 as nodes, r12 as relationships, lvl

WITH lvl, [ node in nodes | node {.*, label:labels(node)[0], id:tostring(id(node))}] as nodes,
[rel in relationships | rel {.*, fromNode:{label:labels(startNode(rel))[0], id:tostring(id(startNode(rel)))},type:type(rel), toNode:{label:labels(endNode(rel))[0], id:tostring(id(endNode(rel)))}}] as rels
With lvl, collect(distinct rels) as Allrels, collect(distinct nodes) as AllNodes order by lvl
WITH {nodes:AllNodes, relationships:Allrels, level:lvl} as json
RETURN apoc.convert.toJson(json)
Result:

Screen Shot 2022-07-05 at 5.56.18 PM.png

 

For instance:
I have a network from

 

CREATE (a:Node {name: 'mola', type: 'Molecule'})
                CREATE (g:Node {name: 'molg', type: 'Molecule'})
                CREATE (b:Node {name: 'drgb', type: 'Drug'})
                CREATE (h:Node {name: 'drgh', type: 'Drug'})
                CREATE (c:Node {name: 'mola', type: 'Disease'})
                CREATE (i:Node {name: 'disi', type: 'Disease'})
                CREATE (j:Node {name: 'disj', type: 'Disease'})
                CREATE (m:Node {name: 'dism', type: 'Disease'})
                CREATE (d:Node {name: 'chemd', type: 'Chemical'})
                CREATE (k:Node {name: 'chemk', type: 'Chemical'})
                CREATE (e:Node {name: 'genee', type: 'Gene'})
                CREATE (l:Node {name: 'genel', type: 'Gene'})
                CREATE (f:Node {name: 'mola', type: 'DNA'})
                MERGE (a)-[:REL {r: 'subclass_of'}]->(b)
                MERGE (a)-[:REL {r: 'cure'}]->(c)
                MERGE (a)-[:REL {r: 'inhibits'}]->(d)
                MERGE (b)-[:REL {r: 'heals'}]->(d)
                MERGE (c)-[:REL {r: 'causes'}]->(d)
                MERGE (c)-[:REL {r: 'expands'}]->(e)
                MERGE (d)-[:REL {r: 'kills'}]->(e)
                MERGE (d)-[:REL {r: 'involved_in'}]->(f)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (f)-[:REL {r: 'kills'}]->(l)
                MERGE (b)-[:REL {r: 'heals'}]->(i)
                MERGE (c)-[:REL {r: 'part_of'}]->(j)
                MERGE (c)-[:REL {r: 'expands'}]->(k)
                MERGE (l)-[:REL {r: 'kills'}]->(l)
                MERGE (m)-[:REL {r: 'heals'}]->(i)
                MERGE (a)-[:REL {r: 'part_of'}]->(e)
                MERGE (c)-[:REL {r: 'expands'}]->(m)
                MERGE  (e)-[:REL {r: 'interacts_with'}]->(f)

 

Using 

 

MATCH (source),(target) 
            WHERE source<> 'None' AND target<>'None' AND source<target
            CALL apoc.algo.allSimplePaths(source, target, '', 4)
            YIELD path AS P
           RETURN P, length(P) 

 

I got:
P                                                              length(P)
(mola)-[:REL {r: 'subclass_of'}]->(drgb),1
(mola)-[:REL {r: 'inhibits'}]->(chemd),1
(drgb)-[:REL {r: 'heals'}]->(disi),1
(chemd)-[:REL {r: 'kills'}]->(genee),1
(chemd)-[:REL {r: 'involved_in'}]->(mola),1
(disi)<-[:REL {r: 'heals'}]-(dism),1
(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'heals'}]-(drgb),2
(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2
(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2
(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2
(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),2
(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),2
(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd),2
(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola)-[:REL {r: 'kills'}]->(genel),3
(mola)-[:REL {r: 'inhibits'}]->(chemd)-[:REL {r: 'kills'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3
(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'kills'}]-(chemd)-[:REL {r: 'involved_in'}]->(mola),3
(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),3
(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3
(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3
(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3
(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'kills'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3
(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),3
(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),3
(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'inhibits'}]-(mola)-[:REL {r: 'cure'}]->(mola),3
(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),3
(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4
(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4
(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'part_of'}]->(disj),4
(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'expands'}]->(dism),4
(disi)<-[:REL {r: 'heals'}]-(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(dism),4
(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4
 
My Question:
How can I randomly return only the subset of the path- representative of all path lengths? Eg.
 
(mola)-[:REL {r: 'inhibits'}]->(chemd),1
(drgb)-[:REL {r: 'heals'}]->(disi),1
(chemd)-[:REL {r: 'kills'}]->(genee),1
(mola)-[:REL {r: 'part_of'}]->(genee)<-[:REL {r: 'expands'}]-(mola),2
(mola)-[:REL {r: 'inhibits'}]->(chemd)<-[:REL {r: 'causes'}]-(mola),2
(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(disi),2
(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'causes'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3
(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),3
(drgb)-[:REL {r: 'heals'}]->(disi)<-[:REL {r: 'heals'}]-(dism)<-[:REL {r: 'expands'}]-(mola),3
(mola)<-[:REL {r: 'cure'}]-(mola)-[:REL {r: 'subclass_of'}]->(drgb)-[:REL {r: 'heals'}]->(chemd)-[:REL {r: 'involved_in'}]->(mola),4
(disi)<-[:REL {r: 'heals'}]-(drgb)-[:REL {r: 'heals'}]->(chemd)<-[:REL {r: 'causes'}]-(mola)-[:REL {r: 'part_of'}]->(disj),4
(drgb)<-[:REL {r: 'subclass_of'}]-(mola)-[:REL {r: 'cure'}]->(mola)-[:REL {r: 'expands'}]->(genee)-[:REL {r: 'interacts_with'}]->(mola),4
 
Thanks for your help.

ameyasoft
Graph Maven

Thanks for sharing the info. The solution is not straight forward and am working on it. Hopefully by this weekend I can send you the first steps for your solution. The path level 2 results contain the nodes in level  1 and 2 and so on.