Neo4j

michael4 · ‎01-05-2021

I have a very slow search I need to tune up.

I have this schema:

This is a typical email box for a user who might have multiple email accounts.

I want to allow users to search their email, returning all Threads which match
a keyword in any of its own properties, or its participants or messages.

The problem is that my query is very slow at the moment, even for a relatively small data set.

  CALL db.index.fulltext.createNodeIndex(
    "emailThreadsIndex",
    [
      "Thread",
      "Message",
      "Participant",
    ],
    [
      "snippet",
      "body",
      "subject",
      "name",
      "email",
      "description"
    ]
  )

This returns 3 different nodes, but I only want to return 1 to the user.
Therefore I need to

In a small dataset this takes 20 seconds to return no results,
or over 20 minutes before I killed the process to query for data for which I
know there should be results

:param { user_id: "292d41ba-d5b7-4f89-ab1a-8bb4382eeaa5" }

CALL db.index.fulltext.queryNodes(
  "emailThreadsIndex",
  'chloe'
) YIELD node as root_item, score
WITH root_item

MATCH (user)<-[:emails]-(:Email)-[:account]->(:Account)<-[:account_threads]-(threads:Thread)
WHERE user.uuid = {user_id}
WITH threads, root_item
MATCH (threads)
WHERE
(
  threads.uuid = root_item.uuid
  OR (threads)-[:messages]->(:Message { uuid: root_item.uuid })
  OR (threads)-[:participants]->(:Participant { uuid: root_item.uuid })
)
RETURN threads

The following is surprisingly slightly better, but it takes 2.5 seconds simply to return a count,
which on a small data set is far too long - imagine when I'm searching other
data types in an 'all' site search, or when this user's data grows larger than
6 months worth of email.

I'm also unsure if this is additive, i.e. whether each optional match adds onto the next one.

MATCH (user:User)
WHERE user.uuid = {user_id}
WITH user

CALL db.index.fulltext.queryNodes(
    "emailThreadsIndex",
'
    joe
'
) YIELD node as root_item, score

OPTIONAL MATCH (user)<-[:emails]-(emails:Email)-[:account]->(account:Account)<-[:account_threads]-(threads:Thread)
WHERE threads.uuid = root_item.uuid
OPTIONAL MATCH (user)<-[:emails]-(emails:Email)-[:account]->(account:Account)<-[:account_threads]-(threads:Thread)
WHERE (threads)-[:messages]->(:Message { uuid: root_item.uuid })
OPTIONAL MATCH (user)<-[:emails]-(emails:Email)-[:account]->(account:Account)<-[:account_threads]-(threads:Thread)
WHERE (threads)-[:thread_participants]->(:Participant { uuid: root_item.uuid })
OPTIONAL MATCH (user)<-[:emails]-(emails:Email)-[:account]->(account:Account)<-[:account_threads]-(threads:Thread)
WHERE (threads)-[:thread_tasks]->(:Task { uuid: root_item.uuid })

WITH threads
WHERE threads IS NOT NULL

RETURN count(DISTINCT (threads)) as count

I need to get this down to sub 100ms to be acceptable speed for modern users, desirably faster even, since there's post-processing to do as well.

I view the problem in two parts:

Limiting the haystack we're searching through to only the current user's information.

I happened upon a talk on Youtube where Christophe Willemsen advocated for this method:

How would I pass a keyword as well as the ids to this?
Would this even help?
The drawback here is I'd have to pass 3 ids: the threads, the message ids, and the participant ids.

Is there a more efficient way to do this? Am I missing something?

Efficiently searching through that user's email threads by its relationships, and only returning email threads.

It seems my queries above aren't efficient, is there something obvious I'm doing wrong?

Later, I want to be able to search across my entire site, which also has the same kind of rules - I want to be able to search relations, scoped by user. This was working OK with optional matches and Ors, but as soon as I introduced a substantial amount of data the searches are now taking 6+ seconds to execute.

For now this seems complicated enough, so let's leave the 'all' search alone for now 🙂

REALLY appreciate any help / pointers people can give me with this! Thanks!

michael_hunger · ‎01-09-2021

Best to run your query with PROFILE to see where the time is spent.

My suggestion

create indexes/constraints on your uuids
simplify the query first, test each part individually
instead of the OR use 3 queries or a UNION

e.g. how long takes this

CALL db.index.fulltext.queryNodes(
  "emailThreadsIndex",
  'chloe'
) YIELD node as root_item, score
RETURN count(*)

or this

MATCH (user)<-[:emails]-(:Email)-[:account]->(:Account)<-[:account_threads]-(threads:Thread)
WHERE user.uuid = {user_id}
RETURN *

If the threat is the root item, then use it as such and don't do an UUID lookup but use it directly

try a USING JOIN ON account

to trigger a two sided index lookup that's joined in the middle

michael4 · ‎01-14-2021

Thanks very much for your response, Michael - will give your suggestions a shot and get back to you.

clem · ‎01-14-2021

Definitely run your queries with PROFILE to see where the big DB hits are.

I think subqueries can help here by winnowing down a lot of the DB before expanding the search.

I suggest something like this (I might not have the syntax exactly right):


CALL{
MATCH(user {uuid:user_id})  // having a Label for user could help
RETURN user // returns only one user.
}
WITH user
CALL db.index.fulltext.queryNodes(
  "emailThreadsIndex",
  'chloe'
) YIELD node as root_item, score
WITH root_item, user

MATCH (user)<-[:emails]-(:Email)-[:account]->(:Account)<-[:account_threads]-(threads:Thread)
WITH threads, root_item
MATCH (threads)
WHERE
(
  threads.uuid = root_item.uuid
  OR (threads)-[:messages]->(:Message { uuid: root_item.uuid })
  OR (threads)-[:participants]->(:Participant { uuid: root_item.uuid })
)
RETURN threads

(It goes without saying, that you should have indexes on uuid.)

I hope this helps.

michael4 · ‎02-04-2021

Thanks for your responses guys, appreciate you taking the time out of your busy days to reply.

At this point I think we'll just move to elasticsearch instead.

Thanks though!

Neo4j

queryNodes - efficiently searching for one node across related nodes, only for one user's information