Neo4j

risingodegua · ‎11-10-2021

Hi folks

I have database with about 7m nodes and relationships, and I'm currently executing the following query:

CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016" OR 
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

The query above simply does a full-text search with the word food and retrieves all related location nodes where the zipcode is any of the above.

This works fine, but it's relatively slow, and I would appreciate any help to make it faster.
Using the PROFILE tool, I can see that the bottleneck is in two places, the MATCH and WHERE clauses.

So my question is, is there an optimal way of doing this to make it faster?. The current speed is approximately 30-40 seconds.

Bennu · ‎11-10-2021

Hi @risingodegua!

Let's try two things.

First

CREATE INDEX ZIP_LOCATION FOR (n:Location) ON (n.zipCode);

It will take some time before it arrives to 100% completation, so check with call db.indexes in order to be sure that is populated before testing second part.

Then...

CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode in ["75219", ...]
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Lemme know how it goes.

Bennu

risingodegua · ‎11-10-2021

Thanks for the suggestion @Bennu

Sadly, I get the same performance after creating the index.

Bennu · ‎11-10-2021

Hi!

Can you share a profile if this newer version of the query?

Bennu

risingodegua · ‎11-10-2021

Sure, here it is:

andrew_bowman · ‎11-10-2021

Do you think this query is asking the right question? I'm wondering if the approach would work better if you flipped the ordering.

Right now, you find any 5000 food organizations, then of those 5000, you check if any are located in the zip codes given and try to limit it to 5. Your PROFILE plan shows that none of those 5000 food organizations are in those zip codes, so nothing gets returned.

I would think it would be more useful to find food organizations in those zip codes, where you either start from an index lookup on the zipCode, or a composite index lookup by zipCode and "food" organization value, though that would require some duplication of properties I think on the node that is looked up. In any case, I think that would lead to more relevant results, instead of relying on the first random 5000 nodes of the fulltext index to be present in the zip codes filtered.

risingodegua · ‎11-10-2021

Thanks for the suggestion @andrew.bowman

I considered this approach as well, and if possible I'm sure it's definitely better. The only problem I have is how to do a full text search on the returned results for location.

Something like:

MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016"
WITH org, loc
CALL db.index.fulltext.queryNodes("organization", 'food' )
YIELD node
RETURN DISTINCT node, loc
SKIP 0 LIMIT 5

Is it possible to do a full text search on filtered nodes?

Update:

I tried the following and it works way faster:

MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE 
 (org.description CONTAINS 'food' OR org.name CONTAINS 'food' OR org.classification CONTAINS 'food')
AND 
 (loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010" ,"75011","75013","75014","75015","75016"])
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Although this isn't as powerful as FTS, but it improves speed

risingodegua · ‎11-10-2021

I tried this, but regex doesn't seem to work.

 MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
 WHERE (
       org.name=~ '.*food.*' OR org.description=~ '.*food.*' OR org.classification=~ '.*food.*'
      )
 AND (
        loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
     )
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

This would have been better if it worked.

Update!!

This works and is way faster. I added a case-insensitive regex check. The final solution is:

PROFILE
MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
 WHERE (
       org.name=~ '(?i).*Food.*' OR org.description=~ '(?i).*Food.*' OR org.classification=~ '(?i).*Food.*'
      )
 AND (
        loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
     )
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5

Neo4j

Optimizing Relationship query with Where filters