cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

apoc.path.subgraphAll Filter Issue

akira
Node Link

Greetings,

I am trying to obtain all paths and relationships of a specific node.

I currently run the following:

MATCH  (n)
WHERE id(n) = 2607 AND 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000
CALL apoc.path.subgraphAll(n, {
    relationshipFilter:'ns0__has',
})
YIELD nodes, relationships
RETURN nodes, relationships
 
The above does net me the results I desire, however the run time is very long.
 
It appears that the filter I provide before the call of apoc.path.subgraphAll does not apply to the call.
It seems to use all of the database to find the result.
I know my results are on day 1 of the 10 days of accumulated data, so I try to filter the dataset to be on the time window of day 1 (400K nodes) only.
But, apoc.path.subgraphAll still goes through all the 10 days of nodes (2.5 million nodes) instead.
 
Many thanks for your time and any help is much appreciated 
 
1 ACCEPTED SOLUTION

You can add node label filters to the configuration, so only nodes with those labels are traversed. As such, you can label all the nodes that meet your time constraint with a new label and add the label to the subgraphAll method's configuration so only those nodes are traversed. 

You can set the labels with the following query.

match(n)
where 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000
set n:TempLabel

You can remove the label once you are done with your query:

match(n:TempLabel)
remove n:TempLabel

Then you can execute the following.  It should give you paths that only have the nodes with the TempLabel.

MATCH  (n)
WHERE id(n) = 2607
CALL apoc.path.subgraphAll(n, {
    relationshipFilter:'ns0__has',
    labelFilter: "+TempLabel"
})
YIELD nodes, relationships
RETURN nodes, relationships

Let's see how this works.

View solution in original post

12 REPLIES 12

That doesn’t seem to be right, as your match predicate specifies the id of a single node. As such, your match should result in either one node or zero nodes (if the date condition is not made for the specified node). The query should return either zero rows or one row consisting of two lists. What result are you getting?  

The result I'm getting is the result of apoc.path.subgraphAll for that single start node id = 2307, so all of the paths and relationships, going through all 2.5M rows

Are you saying that the single node is connected to every node in your graph?  What is the result if you ‘return size(nodes), size(relationships)’ instead?

the match statement finds the one node that is than used as the starting point for the apoc.path method. Did you think the ‘where’ condition would apply to the nodes traversed in the path finder algorithm?  It does apply to that. 

Hello again glilienfield

The return of the sizes is 

╒═════════════╤═════════════════════╕
│"size(nodes)"│"size(relationships)"│
╞═════════════╪═════════════════════╡
│75895        │231033               │
└─────────────┴─────────────────────┘

 This is typical and I know that the path finder is going through all nodes in the database.

I wanted the path finger algorithm, apoc.path.subgraphAll (or another possible solution) to net me the paths of nodes and relationships of particular nodes within a specific time frame.

and so my WHERE filter does not seem to be applying to the call function.

The ‘where’ clause is not applied to the nodes during traversal by apoc.path.subgraphAll(). I don’t a mechanism to directly do so with this method. It’s not designed to check node properties during traversal.  It is configurable to specify labels to include and labels to exclude.  You can use this to achieve the result you want. You could execute a query that adds a new label to all nodes that match your time frame.  You would configure the subgraphAll() method to only traverse nodes with the unique label.  You can remove the new label once you are done with using it. 

Greetings, glilienfield

What do you mean when you say that I can configure labels to include and exclude?

And how do I add a new label to nodes that I want within my time frame and then include that match into my path finder?

Thank you very much for your help

You can add node label filters to the configuration, so only nodes with those labels are traversed. As such, you can label all the nodes that meet your time constraint with a new label and add the label to the subgraphAll method's configuration so only those nodes are traversed. 

You can set the labels with the following query.

match(n)
where 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000
set n:TempLabel

You can remove the label once you are done with your query:

match(n:TempLabel)
remove n:TempLabel

Then you can execute the following.  It should give you paths that only have the nodes with the TempLabel.

MATCH  (n)
WHERE id(n) = 2607
CALL apoc.path.subgraphAll(n, {
    relationshipFilter:'ns0__has',
    labelFilter: "+TempLabel"
})
YIELD nodes, relationships
RETURN nodes, relationships

Let's see how this works.

You mentioned you had millions of nodes.  You may want to batch the updates. Are you running this in Neo4j Browser or using a driver?  You need the ":auto" is using neo4j browser, but not a driver. You will need use 'session.run' instead of a transaction write function. 

:auto match(n)
where 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000
call {
    with n
    set n:TempLabel
} in transactions of 10000 rows

 

:auto match(n:TempLabel)
call {
    with n
    remove n:TempLabel
} in transactions of 10000 rows

  

Thank you so much, glilienfield

When you said labels and traversing only the special labels, I saw in apoc.path.subgraphAll that you can have a whitelist and to store all the pooled data into that whitelist.

It reduced the overall run time by nearly 20x in my use case.

Thank you again greatly.

 

akira
Node Link

The strange thing is that, if the filter is the following, using AND 

 


WHERE id(n) = 2607 AND 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000

 

or using OR

 

WHERE id(n) = 2607 OR 1665263599000 <= apoc.date.fromISO8601(toString(n.ns0__atTime[0])) <= 1665273599000

 

The OR results in the full paths and relationships, even though EXPLAIN gives the AND filter 0 estimated rows and the OR filter 1,871,702 estimated rows 

However, AND has its NodeByIdSeek@neo4j 0 estimated rows, and the OR has its AllNodesScan@neo4 at 2,495,603 estimated rows 

Yet, when the CALL function is running, it will result in all the paths, completely ignoring the filter for the OR filter phase, and giving no results for the AND filter.

I understand that it should give me zero results for the AND filter, however that means it should also be using the OR filter, but it includes nodes outside of the time window which is not making any sense to me. 

I think the AND has no results because your node with id = 2067 does not meet the time constraint, thus ‘n’ is null and the query stops. On the other hand, when using OR, you get the one node with id = 2067 and you get all nodes that meet the time constraint.  That explains the node scan.  You will get multiple rows from the match, and each value of ‘n’ will be passed to the apoc path method.  You should get multiple rows as the result. 

Greetings, glilienfield

Indeed, the AND does result in no rows however, the OR filter results in much more than expected rows.

When I run an EXPLAIN with just the OR filter with return count(n), I get around 400 nodes

However, when I run EXPLAIN with the OR filter with the CALL apoc, it nets me the entire 2.5M node database which is what is confusing me.