Neo4j

mason_edmison · ‎01-29-2021

I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:

MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid

where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.

I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.

This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.

In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?

Thanks in advance.

clem · ‎01-29-2021

[EDIT] My initial guess turns out to be incorrect: Neo4J implements Lists as Arrays. That means your clause WHERE n.id IN $pmids causes a linear search to be done. [WRONG! The list is actually hashed the first time it's used, so subsequent searches are fast.]

Unfortunately, I don't think there is a way to make Neo4J use the IN operator on a hash (or dictionary if you're a Python person.). That does seem like an oversight. If Neo4J could operate on a hash, it would speed things up by a factor of 100x (which is the length of your lists.). I posted this as a suggestion: Want: in hash

One trick you can do, is you can give nodes multiple labels. With that, you could label your article nodes with a Label, e.g. L1234. Then you could form complex queries with match Labels.

mason_edmison · ‎01-29-2021

Hey @clem. Thanks for the very thoughtful answer and also for submitting the suggestion.

I will give the label trick mentioned and report back performance improvements 🙂

andrew_bowman · ‎01-29-2021

For this kind of query (WHERE n.id IN $pmids), if there's an index on :Article(id) then it will use an index lookup, that's your fastest bet.

You can also improve the performance by using pattern comprehensions instead of the collect() aggregation:

MATCH (n:Article)
WHERE n.id IN $pmids AND (n)-[:HAS_MENTION]->()
WITH n, [(n)-[:HAS_MENTION]->(m:Mention) | m] as mentions
ORDER BY n.id   // should use index-backed ordering, if a type hint is supplied earlier
RETURN n.id as pmid, mentions

We want to also allow this to use index-backed ordering, but we may need to supply a hint for the type of n.id. Is this an integer or string? Using something like AND n.id > 0 or AND n.id > '' should be enough to supply the type hint.

clem · ‎01-29-2021

Two thoughts....

A. Here's possibility (but here I'm going out on a limb as I'm at the intermediate level of Cypher...). Use UNWIND to unserialize the id list. I think Cypher will do the right thing here:

UNWIND $pmids AS id // unserializes $pmids
MATCH (n:Article) MATCH (n:Article {id:id}) -[:HAS_MENTION]->() // many matches are run for each id
WITH n, [(n)-[:HAS_MENTION]->(m:Mention) | m] as mentions
ORDER BY n.id   // should use index-backed ordering, if a type hint is supplied earlier
RETURN n.id as pmid, mentions

B. [EDIT] Not so good a thought apparently... but I'll leave it in, in case somebody wants to try it.

FOREACH doesn't permit MATCH inside of its clause (I don't know why there's this restriction.)

What you could do instead, is use another language like Python to call this fragment N times (for each id)

MATCH (n:Article {id:$id) -[:HAS_MENTION]->()

and then glue the results together in the external language. I haven't done that, so I'm not sure how well it would work though.

andrew_bowman · ‎01-29-2021

I would't recommend that, as it would require a separate transaction for each, so between 100-500 separate Cypher queries depending on the length. The cost for I/O and separate query processing isn't worth it.

The current approach to do it all in one query is the recommended way, we just need to streamline it a bit.

clem · ‎01-29-2021

I would't recommend that

Whose (and which) suggestion you wouldn't recommend?

andrew_bowman · ‎01-29-2021

Sorry, I should clarify:

This sounded like the intent was to call the MATCH fragment and RETURN as separate queries per id, and then glue the results back together from all those queries on the client-side. That involves more network I/O than you want, as well as the overhead of doing each in its own transaction instead of all of them in one.

clem · ‎01-29-2021

Perhaps you can help me understand a point that I believe isn't documented.

I believe that the problem is with:

n.id IN $pmids

I'm presuming (and I may be wrong!) that this does a linear search through the list to find if n.id is in the list. This is mighty expensive, especially if there are hundreds of items in the list. However, if the items in the list are put into a hash, it would be quicker.

On the other hand, I think the code as is, has to scan every n:Article node to check for a match which is still very expensive. I think my UNWIND example might do the trick though.

Thanks for enlightening us!

andrew_bowman · ‎01-29-2021

If there wasn't an index present, then you would be correct: it would require a NodeByLabelScan through :Article nodes and property filtering to find the ones that have an id that are in the parameter list.

But if an index is present, the planner can use the list to do an index lookup for the exact :Article nodes with ids in the parameter list, no need for filtering.

You can do an EXPLAIN to see the differences in the query plan yourself, just run it with an index present and without.

Typically for fast lookups, b-tree type indexes are used which are log(n). That scales much better across larger sets of data than with a hash, which is why you typically won't see databases using hashtables to back their indexes. Additionally, it allows us to maintain ordering (can't do that with a straight hashmap), which is needed for range querieres and index-backed ordering, which can be very valuable for some queries.

There are some StackOverflow answers that can go into greater detail in the tree-vs-hashtable question for database index implementation.

All that said...it would be really neat to work with a local hashmap structure within Cypher, not as the implementation of an index, but to assist with quick local mappings. We can do this to a degree with Cypher map structures, but we typically need map functions from APOC Procedures to create them:

These can only use string keys, however, not integers or nodes. This won't be useful for the query we've discussed here, index lookup will take care of things nicely, but it may be useful in other cases.

Neo4j

Query optimization that collects and orders nodes on very large graph