Neo4j

xmlsysop · ‎06-15-2021

Hello.
Thanks in advance if anyone help me with this.

-- Status --
I have 1.5M Document nodes, 4.6M Keyword nodes, and 242M relations between them in Neo4j 4.1.9 Community Edition, and all properties are indexed.
Documents nodes have created date property between 2021-05-01 and 2021-05-31.
Neo4j database uses about 9G disk size, so I allocate 12G JVM heap for neo4j(the total memory size of vm is 16G).

-- Problems --
My query is to get 100 keywords of documents and its' count.

MATCH (d:Document)-->(k:Keyword) WHERE DATE('2021-05-01') <= d.created <= DATE('2021-05-15') RETURN k.name, COUNT(k) AS c LIMIT 100

This query takes about 200 sec like below message in the Neo4j Browser.

Started streaming 100 records after 2 ms and completed after 201559 ms.

-- Qeustions --

Is it a problem of my Cypher query?
How to increase performance of Neo4j?
The message in Neo4j Browser is an exact query execution time? If not, how can I get exact query execution time?

I apologize for my poor English and thank you in advance for any help.

andrew_bowman · ‎06-17-2021

First, show us the plan from a PROFILE of the query (with all elements expanded). That should show index use (or lack of) as well as the number of rows flowing between operations.

Also, looking at the query itself, you should be aggregating with respect to k, not k.name, that will be more efficient. Since you're aggregating before your LIMIT, all records have to be checked and processed for correct counting before the LIMIT can be applied, so the LIMIT isn't going to save you any time here, just as an fyi.

xmlsysop · ‎06-17-2021

Thank you for your help.

As your reply, I aggregate with respect to k, not k.name.
But it is still slow, about 130sec.

Here is the query plan for count().

Cypher version: CYPHER 4.1, planner: COST, runtime: INTERPRETED. 234430749 total db hits in 131011 ms.

andrew_bowman · ‎06-17-2021

Well that seems to make sense, you're aggregating across 116.4 million rows, that would take some time.

Your LIMIT isn't helping you since you only have 10 distinct rows (10 distinct k nodes) at the end, far below your limit of 100. It just happens that you likely have massive counts across the resulting k nodes.

Your query just seems to be doing a lot of necessary work. If the number of relationships between :Document and :Keyword nodes isn't likely to change, then it may be useful to process these every so often and create a special relationship between a document and its distinct keyword nodes and encode the sum of the counts of the others into the relationship between them. That would result in a simpler retrieval query, since you would only be querying for these fewer special relationships and getting the sums saved in their relationship properties.

xmlsysop · ‎06-17-2021

Thank you for your replies and help, and these are really helpful.

We currently collect news articles and its keywords everyday and store them in a relational database.
However, as the amount of data increases, the speed of queries like Cypher I wrote is slowing down.
So we are considering to adapt a Graph DB such as Neo4j to increase query performance.

vialard · ‎06-18-2021

What information do the keyword nodes contain?
You probably do not have 4.5M different keywords.
I would try using the keywords as labels for the documents.

xmlsysop · ‎06-18-2021

Keywords only have a name.
Our keywords include not only nouns but also etc. such as number, so it is possible in my case.
Reducing keywords is one of the simplest way, but you can't do that.

Neo4j

My count() query is too slow