Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-15-2020 02:59 AM
Hello Cypher professionals,
I found myself in a situation where there's a discrepancy between the number of nodes and relationships coming out of these nodes. According to my business logic, the following should always be true:
Every node labeled 'Episode' should have exactly one outgoing relationship ':EPISODE_OF'.
Yet, right now, #'Episode' < #':EPISODE_OF' which makes me think that some episodes have several of these outgoing relationships. I know this because by checking node types on both ends of the path, I get the expected values. My goal is to identify these exceptional 'Episode' nodes.
I need help constructing a query that I would be able to run on a modest machine but a large graph (think ~25M relationships) (without killing it) that would return all distinct nodes which have more than one outgoing :EPISODE_OF relationship.
So, only return (e:Episode)-[r:EPISODE_OF]->(p:Podcast) if e has more than one r. Notice that, at the other end of this path, p can be the same node or a bunch of distinct nodes of the same type. I'm interested in both cases.
Both native and APOC-based solutions are acceptable.
Here's what I tried (and killed the server):
MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH e,COUNT(p) as rels, collect(p) as podcasts
WHERE rels > 1
RETURN e,podcasts, rels
Thank you in advance.
Solved! Go to Solution.
03-16-2020 01:51 AM
Friends,
I got it!
For those looking for something similar (or for future me), do the following:
To find duplicate links to distinct nodes
MATCH (p1:Podcast)<-[:EPISODE_OF]-(e)-[:EPISODE_OF]->(p2:Podcast)
WHERE p1.podcastId <> p2.podcastId
RETURN e LIMIT 10
To find duplicate links between pairs of nodes
MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p
Now for the subsequent question:
Is there a way to set up a DB constraint that will prevent something like this from happening? A uniqueness constraing on a path of some sort? So far I was only working with UUID or String uniqueness but can it be more sophisticated than this?
03-15-2020 11:05 AM
Try this.
MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN e, p
03-15-2020 11:42 AM
You might put the collect step after your WHERE clause. It would probably reduce the size of your collection by quite a bit.
03-15-2020 10:29 PM
Thanks for chipping in. I highly appreciate your help. So, I tried both approaches and both of them yield the same result: query running indefinitely until the server dies. Here's the EXPLAIN result which happens to be identical for both approaches:
Is there a way to do some sort of periodic commit equivalent or something?
03-15-2020 11:26 PM
Change the return statement to RETURN e, p limit 20 and see if you are getting the desired output.
03-15-2020 11:41 PM
No luck:
Neo.TransientError.General.OutOfMemoryError
There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
03-15-2020 11:49 PM
Could you please try give a distinct Episode and test the result .
if it work then try to check if you do not have any loop in the graph
03-15-2020 11:59 PM
Tried this:
MATCH (e:Episode)-[:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, COUNT(p) as rels
WHERE rels > 1
RETURN DISTINCT e LIMIT 10
and this:
MATCH (e:Episode)-[r:EPISODE_OF]->(p:Podcast)
WITH DISTINCT e, p, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, p LIMIT 10
To no avail... Still running indefinitely until I run out of memory.
03-16-2020 12:54 AM
What I meant distinct Episode is
MATCH (e:Episode{name:"some specific value")-[r:EPISODE_OF]->(p:Podcast)
WITH e, collect(r) as rels
WHERE size(rels) > 1
RETURN DISTINCT e, rels
03-16-2020 01:11 AM
Oh, I see,
Yes, that works well but it always was.
I was trying to identify loops by doing the following:
MATCH p=(e1:Episode)-[*]->(e2:Episode)
WHERE e1.episodeId = e2.episodeId
RETURN NODES(p) as episodes
But the query didn't return any results. I guess, I can safely assume there are no loops involving Episodes as the query is relatively broad relationship-wise.
Here's another test I did to determine if there are any duplicates between e and p:
MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p
This query returned no results. I assume it means no pairs Episode+Podcast have duplicate EPISODE_OF relationships.
03-16-2020 01:51 AM
Friends,
I got it!
For those looking for something similar (or for future me), do the following:
To find duplicate links to distinct nodes
MATCH (p1:Podcast)<-[:EPISODE_OF]-(e)-[:EPISODE_OF]->(p2:Podcast)
WHERE p1.podcastId <> p2.podcastId
RETURN e LIMIT 10
To find duplicate links between pairs of nodes
MATCH (e:Episode)-[r1:EPISODE_OF]->(p:Podcast)
WITH e,r1,p
MATCH (e)-[r2:EPISODE_OF]->(p)
WHERE ID(r2) <> ID(r1)
RETURN e,r1,r2,p
Now for the subsequent question:
Is there a way to set up a DB constraint that will prevent something like this from happening? A uniqueness constraing on a path of some sort? So far I was only working with UUID or String uniqueness but can it be more sophisticated than this?
03-16-2020 04:14 AM
There's currently no way to create a uniqueness constraint on relationships. If you're interacting with your instance via one of the officially supported drivers you could add some of your own code to perform checks that would probably be able to prevent this from happening.
03-16-2020 05:30 AM
Got it. Makes sense.
All the sessions of the conference are now available online