cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Find the nodes with the unique stLine and the lowest iNode

One of our dbs is reading through lines of text and creates a node for the line (a.text) and building a node for each word. Rarely, I need to go back and find just the root lines.

Example:
Text: My dog runs fast. iNode:400, stLine:1
Text:my, iNode:401, stLine:1
Text:dog, iNode:402, stLine:1
Text:runs,iNode:403, stLine:1
Text:fast, iNode:403, stLine:1
Text:., iNode:404, stLine:1
Text: He is very large., iNode:405, stLine:2
Text:He, iNode:406, stline:2
etc

I just need
Text: My dog runs fast. iNode:400, stLine:1
Text: He is very large., iNode:405, stLine:2

The sentences themselves are connected so I can't just find ones with no parents
I'd be looking for distinct stLine but with the lowest iNode within that.

Any thoughts on the fast way to do that ?

1 ACCEPTED SOLUTION

Leveraging Neo4j's labels (as well as the ability for nodes to be multi-labeled) is one of the best and fastest ways to ensure you can quickly get nodes that are categorically different than other nodes.

But if you are unable to change the modeling, then yes, you can MATCH to your nodes, and either sort by the longest text value, or the lowest iNode value.

MATCH (t:Text)
WITH t.stLine as stLine, size(t.text) as size, t
ORDER BY size DESC
WITH stLine, head(collect(t)) as sentenceNode
RETURN stLine, sentenceNode

or

MATCH (t:Text)
WITH t.stLine as stLine, t
ORDER BY t.iNode ASC
WITH stLine, head(collect(t)) as sentenceNode
RETURN stLine, sentenceNode

View solution in original post

10 REPLIES 10

ameyasoft
Graph Maven
Assuming the parent node's iNode:xxx will be one number less than the iNode:yyy of a node with the first word, here is my attempt:

Created two nodes:
merge (a:Text {text: "My dog runs fast. iNode:400, stLine:1"})
merge (b:Text {text: "My, iNode:401, stLine:1"})

//getting the node with first word....
match (a:Text)
where a.text contains("My,") 
//This selects node b......

with split(a.text, ',') as s4
with s4, split(s4[1], ':') as s41
with s4, s41, toInteger(s41[1]) - 1 as s42
with s4, s41, s41[0] + ':' + s42 as s43
with trim(s43) as s431, trim(s4[2]) as s421
match (b:Text)
where b.text contains(s431) and b.text contains(s421) 
return b.text

Result:
"My dog runs fast. iNode:400, stLine:1"

Since a sentence is different than a word, you should be using a different label (or at least an additional label) on the sentence node, such as :Sentence or :Phrase. Since your nodes seem to have stLine to tell which sentence they are a part of, that should allow you, from any word node, to MATCH to the sentence node with the same stLine.

And if the sentences have the additional label, then finding the root sentences is as easy as a MATCH on that label.

Other alternatives include creating relationships between the nodes, but that may not be needed if the above is enough.

Yeah, I see what you are doing here. That's going to work but its non trivial. But it does bring up an interesting approach. Is there a clever way to get the length and select the longest ?

If I grouped them by lnStart (getting me all of the stLine in a set) and then selected the longest of the texts, that would do what you were doing ? Any way to make that work

Thanks

Leveraging Neo4j's labels (as well as the ability for nodes to be multi-labeled) is one of the best and fastest ways to ensure you can quickly get nodes that are categorically different than other nodes.

But if you are unable to change the modeling, then yes, you can MATCH to your nodes, and either sort by the longest text value, or the lowest iNode value.

MATCH (t:Text)
WITH t.stLine as stLine, size(t.text) as size, t
ORDER BY size DESC
WITH stLine, head(collect(t)) as sentenceNode
RETURN stLine, sentenceNode

or

MATCH (t:Text)
WITH t.stLine as stLine, t
ORDER BY t.iNode ASC
WITH stLine, head(collect(t)) as sentenceNode
RETURN stLine, sentenceNode

Thank you - we did a few changes but that approach worked fine. And thank you Ameyasoft for triggering the response.

Labels on nodes have been one of the most annoying features in the database we did do it that way to start, and over time, we got rid of them except one or two. Yes, we may add this one by changing the load program to amend it, and using this match to update the ones in place.

But thank you both ! Now if I can get my APOC issue solved I'll be quiet for a while.

Glad that helped!

If you wouldn't mind, could you add some detail as to why labels on nodes ended up giving you headaches? With a bit more info, we can probably provide some recommendations (or recommend against any antipatterns) to help you out. We wouldn't want you to miss out on capabilities or efficiencies that they enable, when used well.

Sure -

We are analyzing computer code. We take in the code and break it down into its component parts. So a 25M line java program will end up being about 300-400M nodes and clients typically have about 80-100 applications, with at least 12 releases. Each node has about 10-15 attributes.

A statement like this

let i = 1
is 4 nodes Node 1: (text:'let', transform, line 1, col, 1) Node2: (text:'i', primitive, integer, line 1, col 5) Node3: (text:'=', transform, primitive, line1, col 7) Node4: (text:'1', primitive, const, line 1, col 9)

thats easy enough on simple statements but there is a call or method, it goes out to the symbol table and resolves. So basic compiler logic where each atom is assigned memory and offset.

Each node is connected exactly as you would expect, the line node (text:"let i = 4" is the root, the components are linked in a simple tree structure.


In this case, the dark blue nodes represent violations in coding practice. Those were found by either matching a string (e.g. RSA/NULL) or a pattern.

In this case, the dark blue nodes represent changes between releases of the program.

We do compare good vs bad practice. So we have an example of one done right, and one done badly. we use a APOC procedure that compares two "paths" and tells us whats different and in what order. We go looking for that.

This only works when then nodes are homogeneous or we ignore the node "label". Otherwise we found it being too specific. Worse, because each language has unique verbs, the labels end up getting in the way (= in python and = in COBOL are not the same). So we ended up creating everything with as few Labels as possible. Units that are methods get a Label, a few others ( root nodes, etc) but most of the heavy lifting is done through attributes.

So they don't have the same value to us as they do to most other Neo4j scenarios

Thanks

This is very very interesting use case. Thanks for sharing this. I may find some use in the near future! Thanks again.

lol. You know far more about this than we do so I doubt its new ! Thank you

Its one of the reasons we smile every time someone suggest "oh just redesign....". Size makes that nearly impossible and reloading data is not a picnic. For better or worse, we need to keep minimalist and only change what absolutely must.
Thanks for asking !

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online