Neo4j

Ciro · ‎09-29-2020

I'd like to get your help to understand what I can do to reduce the time it takes a query to return.

I created a graph database of the 2020 Stack Overflow data dump. I've started simple with a data model consisting of four types of nodes (User, Question, Answer, and Comment) and three types of relationships (POSTS, SUGGESTS, and GENERATES):

User -- POSTS -- Question
User -- SUGGESTS -- Answer
User -- GENERATES -- Comment

The Question node has several properties, one of which contains tags (words) that help categorize the question. This property can have one or more words in it.

I'm using the Community Edition (V3.5.20) running on an iMac with 32 GB of RAM). I wanted to start exploring the data and refining my queries before I code them into a Java app to do the heavy lifting. So, I asked, what I thought, was a simple question, how many Questions in the database contain the tag "?" I used the following query:

MATCH (u:User)-[r:POSTS]->(q:Question) WHERE q.Tags='<java>' return count(q.QID) as NumberofQuestionsWithJavaTag

The database contains ~19 M questions. This query took ~40 minutes to return a result (95,442). In the hopes of reducing the processing time of the query, I created indices on tags, among other properties of the Question node. That reduced the query time to ~33 minutes.

Do you have any ideas how I can reduce this query time to something a bit more sensible?

Thanks in advance for your help.

anthapu · ‎09-29-2020

Can you please provide your neo4j configuration and the database size?

Most likely reason would be there is not enough page cache is configured.

Also, is your disk HDD or SSD?

Also it would be useful if you post the query profile.

Ciro · ‎09-30-2020

Anthapu,
Thanks so much for your quick reply. Here are some of the sections of the database's config file for your review. I notice via the CleanMyMac app that the database consumes almost all available RAM while the query runs.

#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m
#dbms.memory.pagecache.size=10g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+DisableExplicitGC

Please let me know if there are other settings you'd like me to share with you.

The database folder takes-up 127.86 GB on the disk. The disk is an Apple Fusion drive. It's a hybrid that keeps frequently accessed files on a solid state disk (~121 GB) and the other files on a conventional hard disk (~1 TB).

I have not profiled the query, but will try to do that with the next run and share the results with you.
Regards,
Ciro

anthapu · ‎09-30-2020

It looks like your db might be too big for the memory available and coupled with regular HDD it might be causing too many page faults that are impacting the performance

Also for the thing you are doing you can change the query from

MATCH (u:User)-[r:POSTS]->(q:Question) WHERE q.Tags='<java>' return count(q.QID)

to

MATCH (q:Question) WHERE q.Tags='<java>' return count(q)

This query should be fine unless you can have queries that are not connected via POSTS.

Also, you do not want to refer to the property for count. You can count the nodes themselves. When you refer to the property, DB has to read that property, thus causing more slowness.

Ciro · ‎10-01-2020

Anthapu,
Thanks for the sobering news 🙂

After I posed my initial question, I had the idea to simplify the query much like you suggested. After I created an index on the Tags property, I ran the simplified query. I thought I would run it in the background while I worked on other tasks. To my surprise, the database responded in less than one second! Incredulous, I tried another query; same result! I tried another half-dozen queries...same result.

So, the creation of indices and simplifying the query did the trick.

Thanks so much for your help, I really appreciate it!
Regards,
Ciro

anthapu · ‎10-02-2020

Glad to hear it worked. I know from RDBMS experience we alway count the columns. In Neo4J the properties are stored separately. When you refer to a property it has to fetch it, thus going back to page cache which in turn goes to disk if it is not in page cache. By removing the property we are using traversal and counting the nodes.

Neo4j

Dreadfully Long Query Times