Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-15-2022 12:38 PM
Hi all,
I have a problematically large cypher query that seems to run forever.
Little background.
I am running neo4j in a docker container as part of docker compose.
The machine this is running on has 4 vCPUs and 32 GiB of memory.
Neo4j currently uses 5gb of disk space.
neo4j:
image: neo4j:4.4.3
hostname: neo4j
container_name: neo4j
environment:
- NEO4J_dbms_logs_debug_level:debug
- NEO4J_dbms_connector_https_advertised__address:localhost:7473
- NEO4J_dbms_connector_http_advertised__address:localhost:7474
- NEO4J_dbms_connector_bolt_advertised__address:localhost:7687
- NEO4J_apoc_uuid_enabled=true
- NEO4JLABS_PLUGINS=["apoc","graph-data-science"]
- NEO4J_dbms_security_procedures_unrestricted=gds.*, apoc.*
- NEO4J_dbms_memory_heap_initial__size=12000m
- NEO4J_dbms_memory_heap_max__size=12000m
- NEO4J_dbms_memory_pagecache_size=12100m
- NEO4J_dbms_jvm_additional=-XX:+ExitOnOutOfMemoryError
- NEO4J_dbms_logs_gc_enabled=true
(d:Document)<-[:IN_DOC]-(t:Table)<-[:IN_TABLE]-(tr:TableRow)<-[:IN_ROW]-(p:Phrase)
CREATE CONSTRAINT ON (p:Phrase) ASSERT p.content IS UNIQUE;
CREATE INDEX on :Phrase(content);
CALL db.index.fulltext.createNodeIndex("Phrases", ["Phrase"], ["content"]);
CREATE INDEX on :TableRow(encoded_content);
CALL db.index.fulltext.createNodeIndex("TableRows", ["TableRow"], ["text"]);
Query v1:
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr)
MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)
Query v2:
CALL apoc.periodic.iterate(
'
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr) MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2
RETURN p1, p2, c1
',
'MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)',
{batchSize:10000, parallel:true, params:{doc_id: $doc_id}}
);
Query v3: This was an attempt to use less ram
CALL apoc.periodic.commit(
"
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr) MATCH (p2:Phrase)-[:IN_ROW]->(tr)
WHERE NOT p1=p2 AND NOT (p1)-[:CONNECTED_TO]->(p2)
WITH c1, p2, p1 LIMIT $limit
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2) RETURN count(*)
",
{limit:1000, doc_id: $doc_id}
);
MATCH (d:Document {id:$doc_id}) RETURN count(d)
1
MATCH (d:Document {id:$doc_id})--(t:Table) RETURN count(t)
1
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow) RETURN count(tr)
253,762
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow)--(p:Phrase) RETURN count(p)
7,612,860
MATCH (d:Document {id:$doc_id})--(t:Table)--(tr:TableRow)
MATCH (tr)--(p1:Phrase)
MATCH (tr)--(p2:Phrase)
WHERE NOT p1=p2
RETURN count(p2)
203,009,870
So this is a huge query and while it justifiably should take a long time to run, however I haven't been able to get this query to complete while even giving it 12+ hours to run.
Any recommendations on how to fix this query or what I might be doing wrong are greatly appreciated.
07-16-2022 12:38 PM
Hi @Max-Tait
I wrote some code similar to this code.
I think doing this will save memory.
CALL apoc.periodic.iterate(
'
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
RETURN tr
',
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr),
(p2:Phrase)-[:IN_ROW]->(tr)
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)
',
{batchSize:10000, parallel:true, params:{doc_id: $doc_id}}
);
07-16-2022 01:17 PM
It seems to me that you are creating a new relationship between every pair of phrases associated with the specified document. The way this query is written, you should get relationships created in each direction. Is my understanding correct? If so, you can reduce the number of pairs that need to be processed by creating the relationship in each direction at the same time.
Extending on @koji's optimization, you can accomplish this as follows:
CALL apoc.periodic.iterate(
'
MATCH (d:Document {id: $doc_id})
MATCH (tr:TableRow)-[:IN_TABLE]->(t:Table)-[:IN_DOC]->(d)
RETURN tr
',
'
MATCH (p1:Phrase)-[c1:IN_ROW]->(tr),
(p2:Phrase)-[c2:IN_ROW]->(tr)
WHERE id(p1) < id(p2)
MERGE (p1)-[:CONNECTED_TO {relation: c1.label}]->(p2)
MERGE (p2)-[:CONNECTED_TO {relation: c2.label}]->(p1)
',
{batchSize:10000, parallel:true, params:{doc_id: $doc_id}}
);
07-18-2022 09:46 AM
07-18-2022 09:53 AM
Your first cypher statement includes
````
MATCH (d:Document {id: $doc_id})
````but yet it does not appear there is an index on :Document and property id. Having said index should help peformance.
Also from your debug.log we see
[system/00000000] Storage summary:
[system/00000000] Total size of store: 1.047MiB
[system/00000000] Total size of mapped files: 312.0KiB
...
....
and if you sum the lines that have `Total size of mapped files` then this should be the value for dbms.memory.pagecache.size. In this case setting dbms.memory.pagecache.size=6G or even 7G should be adequate. Your current 11G is probably overkill
07-18-2022 10:06 AM
@dana_canzano, that's good to know about the index and the pagecache. When I set the memory settings for neo4j I was grasping at straws about what might be causing this issue.
All the sessions of the conference are now available online