Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
02-01-2019 01:40 PM
There are a lot of difference option to make a query faster. But in my case sometimes is an other query faster than the other query.
But i need help, what is the best way!
I use Neo4j version 3.5.2.
In my case i have a big note with 4 miljoen messages and sometimes i want the last messages, and sometimes i want filter the messages. The messages has some relation to a hourly timetree and we have split all the words to a new note with relations between the word en message.
In some cases it is faster to use that and in some cases it is better to filter the message note by itself.
We want only filter or display the messages in the last month to make the performance better.
The examples are only word filters but there are more filters like a group-code (this code are not in the message but it is always a relation to a other note).
Some queries and results
PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WITH distinct message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'someren'})
WITH distinct message AS message
MATCH (message)-[:HAS_WORD]->(:TS_Word { name:'kruisbaan'})
WITH distinct message AS message
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message
First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848193 total db hits in 2099 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 848176 total db hits in 763 ms.
PROFILE
MATCH (startleaf:Hour{hash: '2018/04/01/05'}), (endleaf:Hour{hash: '2018/04/30/05'}), p = shortestPath((startleaf)-[:NEXT*0..]->(endleaf))
UNWIND nodes(p) AS leaf
MATCH (leaf)<-[:SENDED]-(message:TS_P2000Message)
WHERE message.message =~ '(?i).*someren.*' AND message.message =~ '(?i).*kruisbaan.*'
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message
First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 3732 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 115168 total db hits in 338 ms.
PROFILE
MATCH p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
(message)-[:HAS_WORD]->(word:TS_Word)
WHERE leaf IN dates AND
word.name IN ['kruisbaan', 'someren']
WITH distinct message AS message
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message
First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 1086 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 4694 total db hits in 36 ms.
But this query is very long when i use this without word filter or with other words (popular words) this query is faster, check the next result
PROFILE
MATCH p = shortestPath((startleaf:Hour{hash: '2018/04/01/05'})-[:NEXT*0..]->(endleaf:Hour{hash: '2018/04/30/05'}))
WITH NODES(p) AS dates
MATCH (message:TS_P2000Message)-[:SENDED]->(leaf),
(message)-[:HAS_WORD]->(word:TS_Word)
WHERE leaf IN dates AND
word.name IN ['brand']
WITH distinct message AS message
WITH count(message) AS results, collect(message) AS messages
UNWIND(messages) AS message
WITH results, message AS message
SKIP 0 LIMIT 15
RETURN results, message
First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149648 total db hits in 21066 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 149652 total db hits in 1679 ms.
When we use without the other notes
PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 15
RETURN results, message
First: Started streaming 15 records after 3098 ms and completed after 3100 ms.
Second: Started streaming 15 records after 303 ms and completed after 303 ms.
Other way to filter.
PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH message
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message
First: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 3346 ms.
Second: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342324 total db hits in 1114 ms.
02-06-2019 05:43 AM
Hi Tom,
Did you also try CONTAINS?
And is TS_P2000Message.message and TS_P2000Message.sended indexed?
It is all about on controlling the size of the data set for each cypher step.
So assuming that there is also an index on message than you may start with the evaluation of the message first and after that filtering on time. If that results in a 'smaller' intermediate data set to work with for the next query step...
regards
02-06-2019 07:47 AM
Hi Kees,
Yes, have try that also but in the cases that i have test the results are not better.
I have a lot of indexes, see the list of my indexes
description,indexName,tokenNames,properties,state,type,progress,provider,id,failureMessage
INDEX ON :TS_P2000Message(message),index_1465,[TS_P2000Message],[message],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1465,""
INDEX ON :TS_P2000Message(sended),index_1655,[TS_P2000Message],[sended],ONLINE,node_label_property,100.0,"{version:1.0,key:native-btree}",1655,""
INDEX ON :TS_P2000Message(message, sended)",index_1458,[TS_P2000Message],"[message,sended]",ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1458,""
INDEX ON :TS_P2000Message(uuid),index_1444,[TS_P2000Message],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1444,""
INDEX ON :Hour(hash),index_1670,[Hour],[hash],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1670,""
INDEX ON :Hour(uuid),index_1677,[Hour],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1677,""
INDEX ON :TS_P2000Capcode(capcode),index_1691,[TS_P2000Capcode],[capcode],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1691,""
INDEX ON :TS_P2000Capcode(uuid),index_1684,[TS_P2000Capcode],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1684,""
INDEX ON :TS_Word(name),index_1663,[TS_Word],[name],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1663,""
INDEX ON :TS_Word(uuid),index_1656,[TS_Word],[uuid],ONLINE,node_unique_property,100.0,"{version:1.0,key:native-btree}",1656,""
New result for filter first on the message:
Query:
PROFILE
MATCH (message:TS_P2000Message)
WHERE message.message =~ "(?i).*\\bsomeren\\b.*"
OR message.message =~ "(?i).*\\bbrand\\b.*"
WITH message
WHERE 1546281754000 <= message.sended <= 1548960154000
WITH count(message) AS results, collect(message) AS messages
WITH results AS results, messages AS messages
UNWIND(messages) AS message
WITH results, message AS message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
RETURN results, message
Result: Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 342325 total db hits in 5378 ms.
02-06-2019 08:13 AM
Can you share the profile output?
02-06-2019 08:21 AM
I think if you simplify your query you should benefit from the new index backed order by operations in Neo4j 3.5
Esp if message.sended
is indexed in a native index.
EXPLAIN
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*")
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
Leads to an quite optimal plan for me:
02-06-2019 08:24 AM
Sure!
02-06-2019 08:26 AM
I have running the query that you give me.
02-06-2019 08:40 AM
Sorry please run it with PROFILE
02-06-2019 09:38 AM
No problem
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 864 total db hits in 68 ms.
02-06-2019 10:33 AM
I think the count(results) is the problem in my case. But how can i get than the count of de total results of a filter?
02-06-2019 02:01 PM
You mean the total count? Yep that requires pulling through all data, which is not effective.
Esp. with the regexp filters which are very inefficient as they require each record to be touched.
Do you really need the total counts, or would be "more than xxx" be enough?
02-06-2019 03:14 PM
I use the total counts now for pagination.
The fulltext index takes very long time, there are 4 mil TS_P2000Message notes.
Maybe the best way is to use
PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message.message =~ ("(?i).*\\b"+"(someren|brand)"+"\\b.*")
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
But in this case i get all the messages with someren or brand. But in some cases i want say only if someren and rosdaal. And in some cases i even want filter on codes (that is a relation between the message and code). When the word match is ok than i make some new examples.
Another thing what we do is give the relationships of something directly back (in one query). But is it better to do that, or a new request? I think it is better to do a new query but than you have with a limit of 50 results sometimes 1+50*(4 subqueries) = 201 queries for one request.
Now the indexes not used, this takes more time.
PROFILE
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
OPTIONAL MATCH (message)-[:HAS_URGENCY]->(urgency:P2000Urgency)
WITH {urgency: {urgency: urgency}, message: message} AS message
RETURN message
ORDER BY message.sended desc
SKIP 0 LIMIT 10
02-07-2019 04:06 PM
That last statement won't work as you need to use message.message.sended
but then you loose all benefits of the index backed ordering.
So you should not shadow the name.
I can't believe that fetching results from an index with only 4M entries takes so long.
How big are your messages usually?
02-06-2019 02:05 PM
If you instead used a Neo4j 3.5 fulltext index on your messages, you could try to use this:
PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)
02-07-2019 11:36 PM
First i have create the fulltext index
CALL db.index.fulltext.createNodeIndex("messages",["TS_P2000Message"],["message"])
PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message1
MATCH (message:TS_P2000Message)
WHERE 1546281754000 <= message.sended <= 1548960154000
AND message1 = message
RETURN count(*)
It takes more than 1035732 ms after that time i have killed the query.
02-08-2019 03:49 AM
What does the EXPLAIN look like?
And what does this look like:
PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node
RETURN count(*)
02-08-2019 04:45 AM
Would it be possible to share your db with us for testing?
How big are your message texts usually?
02-08-2019 05:22 AM
I have run that query:
I will share my db with Neo4j, how can i do that? I can send a download link.
The messages texts are not big, max 200 characters.
02-08-2019 06:20 AM
Thanks I also meant the explain from the other statement that didn't finish.
You can send me the link to michael at neo4j.com or via PM.
Thanks so much.
02-08-2019 06:54 AM
Sorry, see here the explain.
I have send the link to you're mail.
02-12-2019 02:04 PM
I just ran this with your data on my laptop (with 4G PC and 2G heap configured).
PROFILE
CALL db.index.fulltext.queryNodes("messages", "someren brand") yield node as message
WHERE 1546281754000 <= message.sended <= 1548960154000
RETURN message.uuid
It finished in less than 200ms
02-12-2019 11:36 PM
Hmmm thats sounds good.
But then i run that query:
Started streaming 2310 records after 39 ms and completed after 22869 ms, displaying first 1000 rows.
02-13-2019 07:06 AM
What is your memory (heap + page-cache config again)?
What kind of disk are you running?
02-13-2019 08:16 AM
This are the settings that i use now:
dbms.memory.heap.initial_size=3000m
dbms.memory.heap.max_size=3000m
dbms.memory.pagecache.size=1000m
150GB SSD disk.
02-13-2019 11:41 PM
I have change the settings to:
dbms.memory.heap.initial_size=2000m
dbms.memory.heap.max_size=2000m
dbms.memory.pagecache.size=4000m
Cypher version: CYPHER 3.5, planner: COST, runtime: SLOTTED. 121489 total db hits in 17191 ms.
Maybe you result is from the second (cache) run?
02-24-2019 11:31 PM
Yes very likely as i ran multiple queries there. The page-cache should be warmed up.
I just ran it on my laptop and it was fast enough.
All the sessions of the conference are now available online