Neo4j

tomswinkels · ‎02-01-2019

There are a lot of difference option to make a query faster. But in my case sometimes is an other query faster than the other query.

But i need help, what is the best way!

I use Neo4j version 3.5.2.

In my case i have a big note with 4 miljoen messages and sometimes i want the last messages, and sometimes i want filter the messages. The messages has some relation to a hourly timetree and we have split all the words to a new note with relations between the word en message.

In some cases it is faster to use that and in some cases it is better to filter the message note by itself.

We want only filter or display the messages in the last month to make the performance better.

The examples are only word filters but there are more filters like a group-code (this code are not in the message but it is always a relation to a other note).

Some queries and results

PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WITH distinct message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'someren'})
WITH distinct message AS message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'kruisbaan'})
WITH distinct message AS message
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848193 total db hits in 2099 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848176 total db hits in 763 ms.

PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WHERE message.message =~ '(?i).*someren.*' AND message.message =~ '(?i).*kruisbaan.*'
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 3732 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 338 ms.

PROFILE 
MATCH  p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
  MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
        (message)-[:HAS_WORD]->(word:TS_Word)
  WHERE leaf IN dates AND
        word.name IN ['kruisbaan', 'someren']
  WITH distinct message AS message
  WITH count(message) AS results, collect(message) AS messages
  UNWIND(messages) AS message
  WITH results, message AS message
  SKIP 0 LIMIT 15
  RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 1086 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 36 ms.

But this query is very long when i use this without word filter or with other words (popular words) this query is faster, check the next result

PROFILE 
MATCH  p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
  MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
        (message)-[:HAS_WORD]->(word:TS_Word)
  WHERE leaf IN dates AND
        word.name IN ['brand']
  WITH distinct message AS message
  WITH count(message) AS results, collect(message) AS messages
  UNWIND(messages) AS message
  WITH results, message AS message
  SKIP 0 LIMIT 15
  RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149648 total db hits in 21066 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149652 total db hits in 1679 ms.

When we use without the other notes

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 15
RETURN results, message

First: Started streaming 15 records after 3098 ms and completed after 3100 ms.

Second: Started streaming 15 records after 303 ms and completed after 303 ms.

Other way to filter.

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH message
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message

First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 3346 ms.

Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 1114 ms.

kees_vegter · ‎02-06-2019

Hi Tom,

Did you also try CONTAINS?
And is TS_P2000Message.message and TS_P2000Message.sended indexed?

It is all about on controlling the size of the data set for each cypher step.
So assuming that there is also an index on message than you may start with the evaluation of the message first and after that filtering on time. If that results in a 'smaller' intermediate data set to work with for the next query step...

regards

tomswinkels · ‎02-06-2019

Hi Kees,

Yes, have try that also but in the cases that i have test the results are not better.

I have a lot of indexes, see the list of my indexes

description,indexName,tokenNames,properties,state,type,progress,provider,id,failureMessage

INDEX ON :TS_P2000Message(message),index_1465,[TS_P2000Message],[message],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1465,""
INDEX ON :TS_P2000Message(sended),index_1655,[TS_P2000Message],[sended],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1655,""
INDEX ON :TS_P2000Message(message, sended)",index_1458,[TS_P2000Message],"[message,sended]",ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1458,""
INDEX ON :TS_P2000Message(uuid),index_1444,[TS_P2000Message],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1444,""

INDEX ON :Hour(hash),index_1670,[Hour],[hash],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1670,""
INDEX ON :Hour(uuid),index_1677,[Hour],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1677,""

INDEX ON :TS_P2000Capcode(capcode),index_1691,[TS_P2000Capcode],[capcode],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1691,""
INDEX ON :TS_P2000Capcode(uuid),index_1684,[TS_P2000Capcode],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1684,""


INDEX ON :TS_Word(name),index_1663,[TS_Word],[name],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1663,""
INDEX ON :TS_Word(uuid),index_1656,[TS_Word],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1656,""

New result for filter first on the message:

Query:

PROFILE
MATCH (message:TS_P2000Message)
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH message
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message

Result: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342325 total db hits in 5378 ms.

michael_hunger · ‎02-06-2019

Can you share the profile output?

michael_hunger · ‎02-06-2019

I think if you simplify your query you should benefit from the new index backed order by operations in Neo4j 3.5
Esp if message.sended is indexed in a native index.

EXPLAIN
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*")
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10

Leads to an quite optimal plan for me:

tomswinkels · ‎02-06-2019

Sure!

tomswinkels · ‎02-06-2019

I have running the query that you give me.

michael_hunger · ‎02-06-2019

Sorry please run it with PROFILE

tomswinkels · ‎02-06-2019

No problem

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 864 total db hits in 68 ms.

tomswinkels · ‎02-06-2019

I think the count(results) is the problem in my case. But how can i get than the count of de total results of a filter?

michael_hunger · ‎02-06-2019

You mean the total count? Yep that requires pulling through all data, which is not effective.
Esp. with the regexp filters which are very inefficient as they require each record to be touched.

Do you really need the total counts, or would be "more than xxx" be enough?

tomswinkels · ‎02-06-2019

I use the total counts now for pagination.

The fulltext index takes very long time, there are 4 mil TS_P2000Message notes.

Maybe the best way is to use

PROFILE
MATCH (message:TS_P2000Message) 
WHERE 1546281754000 <= message.sended <= 1548960154000 
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*") 
RETURN message
ORDER BY message.sended desc 
SKIP 0 LIMIT 10

But in this case i get all the messages with someren or brand. But in some cases i want say only if someren and rosdaal. And in some cases i even want filter on codes (that is a relation between the message and code). When the word match is ok than i make some new examples.

Another thing what we do is give the relationships of something directly back (in one query). But is it better to do that, or a new request? I think it is better to do a new query but than you have with a limit of 50 results sometimes 1+50*(4 subqueries) = 201 queries for one request.

Now the indexes not used, this takes more time.

PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
OPTIONAL MATCH (message)-[:HAS_URGENCY]->(urgency:P2000Urgency)
WITH {urgency: {urgency: urgency}, message: message} AS message
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10

michael_hunger · ‎02-07-2019

That last statement won't work as you need to use message.message.sended but then you loose all benefits of the index backed ordering.

So you should not shadow the name.

I can't believe that fetching results from an index with only 4M entries takes so long.
How big are your messages usually?

michael_hunger · ‎02-06-2019

If you instead used a Neo4j 3.5 fulltext index on your messages, you could try to use this:

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)

tomswinkels · ‎02-07-2019

First i have create the fulltext index

CALL db.index.fulltext.createNodeIndex("messages",["TS_P2000Message"],["message"])

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)

It takes more than 1035732 ms after that time i have killed the query.

michael_hunger · ‎02-08-2019

What does the EXPLAIN look like?

And what does this look like:

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node 
RETURN count(*)

michael_hunger · ‎02-08-2019

Would it be possible to share your db with us for testing?
How big are your message texts usually?

tomswinkels · ‎02-08-2019

I have run that query:

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 1 total db hits in 3049 ms.

I will share my db with Neo4j, how can i do that? I can send a download link.

The messages texts are not big, max 200 characters.

michael_hunger · ‎02-08-2019

Thanks I also meant the explain from the other statement that didn't finish.
You can send me the link to michael at neo4j.com or via PM.

Thanks so much.

tomswinkels · ‎02-08-2019

Sorry, see here the explain.

I have send the link to you're mail.

michael_hunger · ‎02-12-2019

I just ran this with your data on my laptop (with 4G PC and 2G heap configured).

PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message
WHERE 1546281754000 <= message.sended <= 1548960154000
RETURN message.uuid

It finished in less than 200ms

tomswinkels · ‎02-12-2019

Hmmm thats sounds good.

But then i run that query:

Started streaming 2310 records after 39 ms and completed after 22869 ms, displaying first 1000 rows.

michael_hunger · ‎02-13-2019

What is your memory (heap + page-cache config again)?
What kind of disk are you running?

tomswinkels · ‎02-13-2019

This are the settings that i use now:

dbms.memory.heap.initial_size=3000m
dbms.memory.heap.max_size=3000m
dbms.memory.pagecache.size=1000m

150GB SSD disk.

tomswinkels · ‎02-13-2019

I have change the settings to:

dbms.memory.heap.initial_size=2000m
dbms.memory.heap.max_size=2000m
dbms.memory.pagecache.size=4000m

Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 121489 total db hits in 17191 ms.

Maybe you result is from the second (cache) run?

michael_hunger · ‎02-24-2019

Yes very likely as i ran multiple queries there. The page-cache should be warmed up.

I just ran it on my laptop and it was fast enough.

Neo4j

Query optimalisation