Find same sequences of nodes in a graph with multiple chains of nodes (Sequential Pattern Mining algorithm)
‎10-28-2021 10:21 AM
Given a graph with several letters with every letter modelled as a chain of word nodes connected with NEXT-edges. Is there a way to find sequences of nodes of a given length covering the same text without setting any start node for the query ?
Bests
Andreas
- Labels:
-
Cypher
‎10-29-2021 01:11 AM
‎10-29-2021 01:51 AM
Yes, sounds good to me 😉
‎10-29-2021 01:59 AM
I'm also looking for one ...
I asked Neo4j if they were going to add any in GDS in the future but they told me it was not planned. I know there is a pattern mining library in R but I am looking for a Cypher solution. I haven't started to think about how to do it, but maybe we can start to find a Cypher solution here.
‎10-29-2021 02:08 AM
Yes, we can start here.
My first try was to query with fixes path length.
‎10-29-2021 02:13 AM
What is your current cypher query?
‎10-29-2021 05:47 AM
// 4 Gleiche
match (t1:Token)-[:NEXT_TOKEN]->
(t2:Token)-[:NEXT_TOKEN]->(t3:Token)-[:NEXT_TOKEN]->(t4:Token)
match (b1:Token)-[:NEXT_TOKEN]->
(b2:Token)-[:NEXT_TOKEN]->(b3:Token)-[:NEXT_TOKEN]->(b4:Token)
where t1.text = b1.text
and t2.text = b2.text
and t3.text = b3.text
and t4.text = b4.text
and t1.Briefnr <> b1.Briefnr
return t1.Briefnr, b2.Briefnr, b1.text, b2.text, b3.text, b4.text;
Runs forever ...
‎10-29-2021 09:30 AM
‎10-29-2021 12:01 PM
No i stopped it. Can you reach the DB ?
‎11-01-2021 10:49 AM
Works again now here: http://sandstormserv.mni.thm.de:7474/browser/
Login is the same 😉
‎10-29-2021 02:23 AM
I found a few topics which could be interested:
- How to find the common relationship pattern between two different graphs?
- Find most common paths between nodes having certain labels
- neo4j cypher how to find the most frequent pattern of nodes in a chain of word-nodes
- neo4j cypher query to get a sequence of nodes having same sequence id in the relationship property -...
I understand the logic but I wonder if it is optimize for big graphs
‎10-29-2021 02:37 AM
There was an approach by Stefan Armbruster but it wasn't merged to apoc: Commits · sarmbruster/neo4j-apoc-procedures · GitHub
‎10-29-2021 02:44 AM
I don't know if I'm ready to develop my own plugin but it will definitely be quicker in java than in full cypher.
‎10-29-2021 02:55 AM
Cypher might be to slow ... thats why apoc could be a good solution.
‎10-29-2021 02:56 AM
Hi all!
If you have a small portion of the data with some examples I may like helping with a Pregel Implementation.
Bennu
‎10-29-2021 03:04 AM
‎10-29-2021 04:34 AM
You can use this dataset from this source, it's a little social network dataset. Here is the cypher queries to create it:
CREATE (nAlice:User {id:'Alice', name:'Node Alice'})
,(nBridget:User {id:'Bridget', name:'Node Bridget'})
,(nCharles:User {id:'Charles', name:'Node Charles'})
,(nDoug:User {id:'Doug', name:'Node Doug'})
,(nMark:User {id:'Mark', name:'Node Mark'})
,(nNancy:User {id:'Nancy', name:'Node Nancy'})
,(nOliver:User {id:'Oliver', name:'Node Oliver'})
CREATE (nAlice)-[:FOLLOWS]->(nBridget)
,(nAlice)-[:FOLLOWS]->(nCharles)
,(nAlice)-[:FOLLOWS]->(nDoug)
,(nBridget)-[:FOLLOWS]->(nAlice)
,(nBridget)-[:FOLLOWS]->(nCharles)
,(nBridget)-[:FOLLOWS]->(nDoug)
,(nCharles)-[:FOLLOWS]->(nAlice)
,(nCharles)-[:FOLLOWS]->(nBridget)
,(nCharles)-[:FOLLOWS]->(nDoug)
,(nDoug)-[:FOLLOWS]->(nAlice)
,(nDoug)-[:FOLLOWS]->(nBridget)
,(nDoug)-[:FOLLOWS]->(nCharles)
,(nMark)-[:FOLLOWS]->(nNancy)
,(nMark)-[:FOLLOWS]->(nOliver)
,(nNancy)-[:FOLLOWS]->(nMark)
,(nNancy)-[:FOLLOWS]->(nOliver)
,(nOliver)-[:FOLLOWS]->(nMark)
,(nOliver)-[:FOLLOWS]->(nNancy)
,(nCharles)-[:FOLLOWS]->(nMark)
,(nMark)-[:FOLLOWS]->(nCharles)
CREATE (msg1:Message {id:'1', contents:'This is a message', day_sent: 'Monday'})
,(msg2:Message {id:'2', contents:'This is a message', day_sent: 'Monday'})
,(msg3:Message {id:'3', contents:'This is a message', day_sent: 'Monday'})
,(msg4:Message {id:'4', contents:'This is a message', day_sent: 'Monday'})
,(msg5:Message {id:'5', contents:'This is a message', day_sent: 'Monday'})
,(msg6:Message {id:'6', contents:'This is a message', day_sent: 'Monday'})
,(msg7:Message {id:'7', contents:'This is a message', day_sent: 'Monday'})
,(msg8:Message {id:'8', contents:'This is a message', day_sent: 'Monday'})
,(msg9:Message {id:'9', contents:'This is a message', day_sent: 'Tuesday'})
,(msg10:Message {id:'10', contents:'This is a message', day_sent: 'Tuesday'})
,(msg11:Message {id:'11', contents:'This is a message', day_sent: 'Tuesday'})
,(msg12:Message {id:'12', contents:'This is a message', day_sent: 'Tuesday'})
,(msg13:Message {id:'13', contents:'This is a message', day_sent: 'Tuesday'})
,(msg14:Message {id:'14', contents:'This is a message', day_sent: 'Monday'})
,(msg15:Message {id:'15', contents:'This is a message', day_sent: 'Thursday'})
,(msg16:Message {id:'16', contents:'This is a message', day_sent: 'Thursday'})
,(msg17:Message {id:'17', contents:'This is a message', day_sent: 'Thursday'})
,(msg18:Message {id:'18', contents:'This is a message', day_sent: 'Thursday'})
,(msg19:Message {id:'19', contents:'This is a message', day_sent: 'Thursday'})
,(msg20:Message {id:'20', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg21:Message {id:'21', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg22:Message {id:'22', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg23:Message {id:'23', contents:'Conversation Starter', day_sent: 'Friday'})
,(msg24:Message {id:'24', contents:'Reply', day_sent: 'Friday'})
,(msg25:Message {id:'25', contents:'Reply', day_sent: 'Friday'})
,(msg26:Message {id:'26', contents:'Reply', day_sent: 'Friday'})
,(msg27:Message {id:'27', contents:'Reply', day_sent: 'Friday'})
,(msg28:Message {id:'28', contents:'Reply', day_sent: 'Friday'})
,(msg29:Message {id:'29', contents:'Reply', day_sent: 'Friday'})
,(msg30:Message {id:'30', contents:'Reply', day_sent: 'Friday'})
,(msg31:Message {id:'31', contents:'Reply', day_sent: 'Friday'})
,(msg32:Message {id:'32', contents:'Reply', day_sent: 'Friday'})
,(msg33:Message {id:'33', contents:'Reply', day_sent: 'Friday'})
,(msg34:Message {id:'34', contents:'Reply', day_sent: 'Friday'})
,(msg35:Message {id:'35', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg36:Message {id:'36', contents:'Conversation Starter', day_sent: 'Saturday'})
,(msg37:Message {id:'37', contents:'Reply', day_sent: 'Saturday'})
,(msg38:Message {id:'38', contents:'Reply', day_sent: 'Saturday'})
,(msg39:Message {id:'39', contents:'Reply', day_sent: 'Saturday'})
,(msg40:Message {id:'40', contents:'Reply', day_sent: 'Saturday'})
,(msg41:Message {id:'41', contents:'Reply', day_sent: 'Saturday'})
,(msg42:Message {id:'42', contents:'Reply', day_sent: 'Saturday'})
,(msg43:Message {id:'43', contents:'Reply', day_sent: 'Saturday'})
,(msg44:Message {id:'44', contents:'Reply', day_sent: 'Saturday'})
,(msg45:Message {id:'45', contents:'Reply', day_sent: 'Saturday'})
,(msg46:Message {id:'46', contents:'Reply', day_sent: 'Saturday'})
,(msg47:Message {id:'47', contents:'Reply', day_sent: 'Saturday'})
,(msg48:Message {id:'48', contents:'Reply', day_sent: 'Saturday'})
,(msg49:Message {id:'49', contents:'Reply', day_sent: 'Sunday'})
,(msg50:Message {id:'50', contents:'Reply', day_sent: 'Sunday'})
,(msg51:Message {id:'51', contents:'Reply', day_sent: 'Sunday'})
,(msg52:Message {id:'52', contents:'Reply', day_sent: 'Sunday'})
,(msg53:Message {id:'53', contents:'Reply', day_sent: 'Sunday'})
,(msg54:Message {id:'54', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg55:Message {id:'55', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg56:Message {id:'56', contents:'Conversation Starter', day_sent: 'Sunday'})
,(msg57:Message {id:'57', contents:'Reply', day_sent: 'Sunday'})
,(msg58:Message {id:'58', contents:'Reply', day_sent: 'Sunday'})
,(msg59:Message {id:'59', contents:'Reply', day_sent: 'Sunday'})
,(msg60:Message {id:'60', contents:'Reply', day_sent: 'Sunday'})
,(msg61:Message {id:'61', contents:'Reply', day_sent: 'Sunday'})
,(msg62:Message {id:'62', contents:'Reply', day_sent: 'Sunday'})
,(msg63:Message {id:'63', contents:'Reply', day_sent: 'Sunday'})
,(msg64:Message {id:'64', contents:'Reply', day_sent: 'Sunday'})
,(msg65:Message {id:'65', contents:'Reply', day_sent: 'Sunday'})
,(msg66:Message {id:'66', contents:'Reply', day_sent: 'Sunday'})
,(msg67:Message {id:'67', contents:'Reply', day_sent: 'Sunday'})
,(msg68:Message {id:'68', contents:'Reply', day_sent: 'Sunday'})
,(msg69:Message {id:'69', contents:'Reply', day_sent: 'Sunday'})
,(nAlice)-[:SENT]->(msg2)
,(nBridget)-[:SENT]->(msg3)
,(nBridget)-[:SENT]->(msg4)
,(nBridget)-[:SENT]->(msg5)
,(nCharles)-[:SENT]->(msg6)
,(nDoug)-[:SENT]->(msg7)
,(nDoug)-[:SENT]->(msg8)
,(nAlice)-[:SENT]->(msg9)
,(nBridget)-[:SENT]->(msg10)
,(nCharles)-[:SENT]->(msg11)
,(nDoug)-[:SENT]->(msg12)
,(nDoug)-[:SENT]->(msg13)
,(nAlice)-[:SENT]->(msg14)
,(nAlice)-[:SENT]->(msg15)
,(nAlice)-[:SENT]->(msg16)
,(nAlice)-[:SENT]->(msg17)
,(nBridget)-[:SENT]->(msg18)
,(nCharles)-[:SENT]->(msg19)
,(nBridget)-[:SENT]->(msg20)
,(nBridget)-[:SENT]->(msg21)
,(nNancy)-[:SENT]->(msg22)
,(nMark)-[:SENT]->(msg23)
,(nAlice)-[:SENT]->(msg24)
,(nBridget)-[:SENT]->(msg25)
,(nAlice)-[:SENT]->(msg26)
,(nBridget)-[:SENT]->(msg27)
,(nAlice)-[:SENT]->(msg28)
,(nNancy)-[:SENT]->(msg29)
,(nMark)-[:SENT]->(msg30)
,(nAlice)-[:SENT]->(msg31)
,(nBridget)-[:SENT]->(msg32)
,(nBridget)-[:SENT]->(msg33)
,(nMark)-[:SENT]->(msg34)
,(nMark)-[:SENT]->(msg35)
,(nAlice)-[:SENT]->(msg36)
,(nAlice)-[:SENT]->(msg37)
,(nBridget)-[:SENT]->(msg38)
,(nMark)-[:SENT]->(msg39)
,(nMark)-[:SENT]->(msg40)
,(nBridget)-[:SENT]->(msg41)
,(nCharles)-[:SENT]->(msg42)
,(nBridget)-[:SENT]->(msg43)
,(nAlice )-[:SENT]->(msg44)
,(nCharles)-[:SENT]->(msg45)
,(nDoug)-[:SENT]->(msg46)
,(nDoug)-[:SENT]->(msg47)
,(nMark)-[:SENT]->(msg48)
,(nAlice)-[:SENT]->(msg49)
,(nMark)-[:SENT]->(msg50)
,(nAlice)-[:SENT]->(msg51)
,(nBridget)-[:SENT]->(msg52)
,(nAlice)-[:SENT]->(msg53)
,(nAlice)-[:SENT]->(msg54)
,(nAlice)-[:SENT]->(msg55)
,(nAlice)-[:SENT]->(msg56)
,(nCharles)-[:SENT]->(msg57)
,(nAlice)-[:SENT]->(msg58)
,(nCharles)-[:SENT]->(msg59)
,(nAlice)-[:SENT]->(msg60)
,(nCharles)-[:SENT]->(msg61)
,(nCharles)-[:SENT]->(msg62)
,(nBridget)-[:SENT]->(msg63)
,(nCharles)-[:SENT]->(msg64)
,(nMark)-[:SENT]->(msg65)
,(nMark)-[:SENT]->(msg66)
,(nCharles)-[:SENT]->(msg67)
,(nBridget)-[:SENT]->(msg68)
,(nCharles)-[:SENT]->(msg69)
CREATE (msg5)-[:FORWARD]->(msg2)
,(msg6)-[:FORWARD]->(msg2)
,(msg7)-[:FORWARD]->(msg3)
,(msg8)-[:FORWARD]->(msg1)
,(msg11)-[:FORWARD]->(msg10)
,(msg12)-[:FORWARD]->(msg10)
,(msg13)-[:FORWARD]->(msg11)
,(msg14)-[:FORWARD]->(msg3)
,(msg15)-[:FORWARD]->(msg4)
,(msg16)-[:FORWARD]->(msg5)
,(msg17)-[:FORWARD]->(msg10)
,(msg18)-[:FORWARD]->(msg6)
,(msg46)-[:FORWARD]->(msg39)
,(msg47)-[:FORWARD]->(msg40)
CREATE (msg24)-[:REPLY_TO]->(msg20)
,(msg25)-[:REPLY_TO]->(msg24)
,(msg26)-[:REPLY_TO]->(msg25)
,(msg27)-[:REPLY_TO]->(msg26)
,(msg28)-[:REPLY_TO]->(msg27)
,(msg29)-[:REPLY_TO]->(msg21)
,(msg30)-[:REPLY_TO]->(msg21)
,(msg31)-[:REPLY_TO]->(msg21)
,(msg32)-[:REPLY_TO]->(msg28)
,(msg33)-[:REPLY_TO]->(msg30)
,(msg34)-[:REPLY_TO]->(msg33)
,(msg37)-[:REPLY_TO]->(msg35)
,(msg38)-[:REPLY_TO]->(msg35)
,(msg39)-[:REPLY_TO]->(msg37)
,(msg40)-[:REPLY_TO]->(msg38)
,(msg41)-[:REPLY_TO]->(msg38)
,(msg42)-[:REPLY_TO]->(msg37)
,(msg43)-[:REPLY_TO]->(msg37)
,(msg44)-[:REPLY_TO]->(msg43)
,(msg45)-[:REPLY_TO]->(msg42)
,(msg48)-[:REPLY_TO]->(msg36)
,(msg49)-[:REPLY_TO]->(msg48)
,(msg50)-[:REPLY_TO]->(msg49)
,(msg51)-[:REPLY_TO]->(msg41)
,(msg52)-[:REPLY_TO]->(msg50)
,(msg53)-[:REPLY_TO]->(msg52)
,(msg57)-[:REPLY_TO]->(msg54)
,(msg58)-[:REPLY_TO]->(msg57)
,(msg59)-[:REPLY_TO]->(msg58)
,(msg60)-[:REPLY_TO]->(msg59)
,(msg61)-[:REPLY_TO]->(msg60)
,(msg62)-[:REPLY_TO]->(msg55)
,(msg63)-[:REPLY_TO]->(msg62)
,(msg64)-[:REPLY_TO]->(msg63)
,(msg65)-[:REPLY_TO]->(msg64)
,(msg66)-[:REPLY_TO]->(msg56)
,(msg67)-[:REPLY_TO]->(msg66)
,(msg68)-[:REPLY_TO]->(msg67)
,(msg69)-[:REPLY_TO]->(msg68)
I think there are three output possibilities:
- most frequent node sequence
- most frequent sequence of relationships
- most frequent path sequence (nodes & relations)
‎10-29-2021 05:21 AM
I have a running graphDB here: http://jlu-buster.mni.thm.de:9580/browser/
Login is neo4j/1234
There are chains of Token-Nodes which show the Words of letters.
The task is now to show sequences in these chains of the same pattern in the property text.
‎10-30-2021 05:44 AM
Hi Andreas,
I suggest to first prepare the data properly so that it fully can benefit from the power of the graph. Actually you do mainly text searches.
My suggestions are:
1.) There are 13840 Token with text equals " " - if not needed, delete them
2.) All text properties I have seen, ends with a space - if not needed, strip those spaces
3.) There was no index on the text property, I created one using CREATE INDEX ON :Token(text)
4.) The property BriefNr that is used in a condition in your cypher doesn't exist in the Token nodes
5.) There are many duplicates texts (15611 unique texts, 11319 multiple used texts blown up to 145427 nodes). I would suggest not using duplicate Tokens but just one 1 unique Token per "word". Then you could relate those unique Tokens with relations storing the source of the relation (BriefNr, startIndex, endIndex) on the relation.
With data structured like this, you could easily run a query that shows similar paths in seconds, because similar senctences will traverse exactly the same nodes in the path. Also you will not longer rely on slow heavy text searches.
Best,
Reiner
‎10-30-2021 06:47 AM
Ok, i changed the Graph accordingly. Now every Token has a unique value-Property.
Does this work for your query ?
Best,
Andreas
‎10-30-2021 07:30 AM
There was a Token with value "" that would interfere the results. Just tried to delete it with
match (t:Token) where coalesce(t.value,"") = "" detach delete t
but either that crashed the db or someone stopped it.
My approach then would have been something like this, but I couldn't test it anymore
MATCH (t1:Token)-[r1:NEXT]->(t2:Token)-[r2:NEXT]->(t3:Token)-[r3:NEXT]->(t4:Token)
where r1.Briefnr = r2.Briefnr = r3.Briefnr
with t1.value as v1, t2.value as v2, t3.value as v3, t4.value as v4, collect(r1.Briefnr) as Briefe
return v1, v2, v3, v4, Briefe
order by v1, v2, v3, v4
Best,
Reiner
‎10-30-2021 08:14 AM
A more graph-like approach would be something like this:
MATCH path = ((n:Token)-[:NEXT*4]-(m:Token))
WHERE [x IN relationships(path) | x.Briefnr] = head(relationships(path)).Briefnr
WITH nodes(path) as n, head(relationships(path)).Briefnr as Brief
WITH n, COLLECT(Brief) as Briefe
WITH [x in n | x.value] as x, Briefe
RETURN *
``
‎10-31-2021 12:26 PM
I transfered the db to another server:
http://sandstormserv.mni.thm.de:7474/browser/
login is the same. Maybe you can test it here ?
Best
Andreas
‎11-01-2021 10:29 AM
The DB now contains only the Text and the Token Nodes.
‎11-02-2021 01:25 AM
Hi Andreas,
I just quickly looked into your sample data and discovered, that you have multiple NEXT relationship which are redundant. I suggest to distill them down to just one.
‎11-02-2021 03:16 AM
Its not a bug, its a feature. Every edge has a property with the letternumber to get the textual contexts.
‎11-02-2021 03:27 AM
The following cypher working well for the first 7 letters (limit 7), but when the eights is included, this results in 7 million additional paths that will take too long to come a result.
Also that node with "" is still included.
match (:Token)-[r:NEXT]->(:Token)
with distinct r.Briefnr as Brief
limit 7
MATCH path = ((n:Token)-[:NEXT*3 {Briefnr:Brief}]-(m:Token))
with Brief, path
WITH distinct nodes(path) as n, Brief
WITH n, COLLECT(Brief) as Briefe
WHERE size(Briefe) > 1
WITH [x in n | x.value] as x, Briefe
RETURN x, Briefe
I'm sorry not beeing able to invest more time here.
Best,
Reiner
‎11-02-2021 03:29 AM
The "" seems to be a linebreak in our data and we have to adress this. Thanks for this starting point !
‎11-02-2021 04:15 AM
Ok. But you don't need a relationship for every context. You could have one relationship with an array of contexts. The sheer number of relationships leads to a combinatorial explosion with the effect that Reiner's cypher query times out.
When aggregating the context in one relationship and with a BFS-style query you should be able to achieve what you want.
‎11-02-2021 05:10 AM
Ok. With this query i create the tokens:
MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
CREATE (w1)-[r:NEXT {Briefnr:b}]->(w2));
I changed it this way:
// Property as Array with one Relation
MATCH (t:Text)
WITH t.Value as sentence, t.Name as b
WITH split(sentence," ") as words, b
FOREACH ( idx IN range(0,size(words)-2) |
MERGE (w1:Token {value:trim(words[idx])})
MERGE (w2:Token {value:trim(words[idx+1])})
MERGE (w1)-[r:NEXT]->(w2)
ON CREATE SET r.list = b
ON MATCH SET r.list = r.list + ", " + b);
Now we have a list of letter numbers on the relationships. But how can i find common chains of Tokens of length 5 of two letters ?
‎11-05-2021 03:48 AM
My colleague @liebig figured out that with the above modell its not possible to recreate the letters texts from the graph as outgoing NEXT-edges could be ambigue. We will now try to create an explicit modell of the text with one node for every Token from a letter.