cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Complex filter with big graph very slow

Hello,
I work on big graphs and the ability to filter over them.

I have data of type Domain which point towards a Subdomain then towards the ipv4 and finally the ports, then service etc ..

--

The goal is that starting from the domain, I am filtering on the branches of the graph having, for example, IPv4 1.1.1.1 and port 80.

--

Today, the solution we use is:

MATCH (e:Entity)-[:HAS_ASSET *0..]->(a:Asset:Ipv4)
WHERE id(e)=**
AND EXISTS { MATCH (b0)-[:HAS_ASSET *0..]->(a)-[:HAS_ASSET *0..]->(c0)
WHERE (b0:Service AND (b0.value='220 microsoft ftp shttpervice'))
OR (c0:Service AND (c0.value='http')) }

OPTIONAL MATCH (b:Asset)-[:HAS_ASSET *0..]->(a:Asset)

OPTIONAL MATCH (a:Asset)-[:HAS_ASSET *0..]->(c:Asset)
...

The first match is used to match the ipv4 from which we will search the branches. And then making us a match to go up and down to get everything back.
These are queries generated by our api, hence the need to search before or after the services, even if we all know that the services are after the ipv4. (We have several hundred different types, so impossible to make special cases)

--
Two things :

  • => It is the right way to do that ?

  • => If yes, the problem is that the request is long over large perimeters, and that in the case of a Domain pointing to, say 10,000 subdomains, which all point to the same ip, the second match in the exist :

MATCH (b0)-[:HAS_ASSET *0..]->(a)-[:HAS_ASSET *0..]->(c0)
WHERE (b0:Service AND (b0.value='220 microsoft ftp shttpervice'))
OR (c0:Service AND (c0.value='http')) 

will use as many as finded a, that subdomain point on, even if they have the same value.
The solution for this problem is to separate the 2 matchs like :

MATCH (e:Entity)-[:HAS_ASSET *0..3]->(a:Asset:Ipv4)
WHERE id(e)=**
with distinct a as a

match (a) where
EXISTS { MATCH (b0)-[:HAS_ASSET *0..]->(a)-[:HAS_ASSET *0..]->(c0)
WHERE (b0:Service AND (b0.value='220 microsoft ftp shttpervice'))
OR (c0:Service AND (c0.value='http')) }

It's faster, but very dirty. Have you any advice ?
Thank,s
Gautier

1 REPLY 1

My recommendation is to work out your query step by step with PROFILE

it looks as if you're touching a lot of data.

You might also be better of to use NOT shortestPath() IS NULL in some of your places

all your unbounded paths can go really deep
and you might want to use some WITH DISTINCT a to reduce the number o f in-between results.

I would also split up the OR check into two

WHERE NOT shortestPath( 
  (b0:Service {value:'220 microsoft ftp shttpervice'})-[:HAS_ASSET *0..]->(a)) IS NULL OR 
 NOT shortestPath( (a)-[:HAS_ASSET *0..]->(c0:Service {value:'http'})) IS NULL

Also make sure you have indexes/constraints on :Service(value)