Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-10-2021 12:25 AM
Hi folks
I have database with about 7m nodes and relationships, and I'm currently executing the following query:
CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016" OR
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5
The query above simply does a full-text search with the word food
and retrieves all related location nodes where the zipcode is any of the above.
This works fine, but it's relatively slow, and I would appreciate any help to make it faster.
Using the PROFILE tool, I can see that the bottleneck is in two places, the MATCH
and WHERE
clauses.
So my question is, is there an optimal way of doing this to make it faster?. The current speed is approximately 30-40 seconds.
11-10-2021 10:11 AM
Hi @risingodegua!
Let's try two things.
First
CREATE INDEX ZIP_LOCATION FOR (n:Location) ON (n.zipCode);
It will take some time before it arrives to 100% completation, so check with call db.indexes in order to be sure that is populated before testing second part.
Then...
CALL db.index.fulltext.queryNodes('organization', 'food' )
YIELD node
WITH node as org LIMIT 5000
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode in ["75219", ...]
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5
Lemme know how it goes.
Bennu
11-10-2021 10:53 PM
Thanks for the suggestion @Bennu
Sadly, I get the same performance after creating the index.
11-10-2021 11:44 PM
Hi!
Can you share a profile if this newer version of the query?
Bennu
11-10-2021 11:49 PM
Sure, here it is:
11-10-2021 11:46 AM
Do you think this query is asking the right question? I'm wondering if the approach would work better if you flipped the ordering.
Right now, you find any 5000 food organizations, then of those 5000, you check if any are located in the zip codes given and try to limit it to 5. Your PROFILE plan shows that none of those 5000 food organizations are in those zip codes, so nothing gets returned.
I would think it would be more useful to find food organizations in those zip codes, where you either start from an index lookup on the zipCode, or a composite index lookup by zipCode and "food" organization value, though that would require some duplication of properties I think on the node that is looked up. In any case, I think that would lead to more relevant results, instead of relying on the first random 5000 nodes of the fulltext index to be present in the zip codes filtered.
11-10-2021 10:53 PM
Thanks for the suggestion @andrew.bowman
I considered this approach as well, and if possible I'm sure it's definitely better. The only problem I have is how to do a full text search on the returned results for location.
Something like:
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE loc.zipCode="75219" OR loc.zipCode="75001" OR loc.zipCode="75002" OR loc.zipCode="75006" OR loc.zipCode="75007" OR loc.zipCode="75009" OR loc.zipCode="75010" OR loc.zipCode="75011" OR loc.zipCode="75013" OR loc.zipCode="75014" OR loc.zipCode="75015" OR loc.zipCode="75016"
WITH org, loc
CALL db.index.fulltext.queryNodes("organization", 'food' )
YIELD node
RETURN DISTINCT node, loc
SKIP 0 LIMIT 5
Is it possible to do a full text search on filtered nodes?
Update:
I tried the following and it works way faster:
MATCH (org)<-[:LOCATED_IN]-(loc:Location)
WHERE
(org.description CONTAINS 'food' OR org.name CONTAINS 'food' OR org.classification CONTAINS 'food')
AND
(loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010" ,"75011","75013","75014","75015","75016"])
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5
Although this isn't as powerful as FTS, but it improves speed
11-10-2021 11:53 PM
I tried this, but regex doesn't seem to work.
MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
WHERE (
org.name=~ '.*food.*' OR org.description=~ '.*food.*' OR org.classification=~ '.*food.*'
)
AND (
loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
)
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5
This would have been better if it worked.
Update!!
This works and is way faster. I added a case-insensitive regex check. The final solution is:
PROFILE
MATCH (org:Organization)<-[:LOCATED_IN]-(loc:Location)
WHERE (
org.name=~ '(?i).*Food.*' OR org.description=~ '(?i).*Food.*' OR org.classification=~ '(?i).*Food.*'
)
AND (
loc.zipCode IN ["75219","75001","75002","75006","75007","75009","75010","75011","75013","75014","75015","75016","75017","75019","75022","75023","75024","75025","75026","75027"]
)
RETURN DISTINCT org, loc
SKIP 0 LIMIT 5
All the sessions of the conference are now available online