Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
04-11-2022 12:29 AM
Hello everyone,
I'm a bit puzzled because I have what seems to be a very simple graph model and insert batch query that is taking multiple minutes to run.
Here is the query:
UNWIND $batch as row
Merge (tld:TLD{tld:row.tld})
Merge (ip:Ip{address:row.address})
Merge (domain:Domain{root:row.root,sub:row.subdomain})
Merge (domain)-[:tld]->(tld)
Merge (domain)<-[:host]-(ip)
The model (as is evident from the query) is a group of TlD
nodes (such as Tld{tld:"com"}
) attached via a :root
relationship to Domain
nodes (such as Domain{domain:"foo", subdomain:"www"}
) which are attached to Ip
nodes (such as `Ip{address:"1.1.1.1"}.
Here are the explain and profile png's for single inserts:
explain:
profile:
If it helps here's the golang code I'm using to generate the query:
MapResults
converts the result structs into a generic map structure to satisfy the tx.Run
method.
_, err := sess.Session.WriteTransaction(func(tx neo4j.Transaction) (interface{}, error) {
result, err := tx.Run(
"UNWIND $batch as row "+
"Merge(tld:Tld{tld:row.Tld}) "+
"Merge(ip:Ip{ip:row.Address, rdn: row.Rdn}) "+
"Merge(domain:Domain{domain:row.Domain, subdomain:row.Subdomain}) "+
"Merge (domain)-[:root]->(tld) "+
"Merge (domain)<-[:host]-(ip) ",
map[string]interface{}{"batch": MapResults(results)})
return result, err
})
04-11-2022 04:13 AM
Look at the beginning of the two branches that are orange. It indicates each starts with a node label scan, I,e, it is scanning all nodes with the label to filter on the predicate in the next stage. Looks like the are lots of nodes for these two labels. You may want to try adding an index to those two labels and corresponding properties.
04-11-2022 04:14 AM
From the profile it does not appear you have any indexes to support the Merge statements. Since a MERGE is effectively a create or update, if it is to do an update than having an index will improve the performance.
MERGE - Neo4j Cypher Manual describes MERGE and states
For performance reasons, creating a schema index on the label or property is highly recommended when using MERGE. See Indexes for search performance for more information.
I might suggest creating indexes on
:TLD(tld)
:Ip(address)
:Domain(root, sub)
04-11-2022 07:47 AM
The IP address index helped speed it up a bit but for some reason my domains index isn't being used.
CREATE INDEX domains_index_name FOR (d:Domain) ON (d.domain, d.subdomain)
New profile:
Does it just take some time to propagate?
04-11-2022 07:49 AM
is there a index on :TLD?
04-11-2022 07:54 AM
@dana.canzano yup,
7 "tld_index" "ONLINE" 100.0 "NONUNIQUE" "BTREE" "NODE" ["Tld"] ["tld"] "native-btree-1.0"
I'm not too concerned about TLD node lookups since there's only going to be <100 of them. Its the Domain node index that would speed it up the most.
This is odd. It uses the domain index for searches but not on inserts.
profile match (t:Tld{tld:"com"})<-[r]-(d:Domain{domain:"att", subdomain:""}) return t,d
profile:
04-11-2022 09:00 AM
The first profile shows 'sub', not 'subdomain' as the 'domain' attribute used in the filter predicate.
Are you being consistent with your attribute names?
04-11-2022 09:19 AM
Ended up going with unique constraints on Tld, Ip, and Domain (using a domainkey field which is a concat of root and subdomain) and now the batch insert finishes in ~10 secs which is fine!
nope apparently this didn't fix it. Example input:
PROFILE
UNWIND {tld:"com", Domain:"test2",address:"1.2.37.4",Subdomain:"test2", Domainkey: "test2.test2",Rdn:"luz", Port:443} as row
Merge (tld:Tld{tld:row.tld})
Merge (ip:Ip{address:row.address})
ON CREATE SET ip.rdn = row.Rdn SET ip.port = row.Port
Merge (domain:Domain{domainkey:row.Domainkey})
ON CREATE SET domain.domain = row.Domain, domain.subdomain = row.Subdomain
Merge (domain)-[:tld]->(tld)
Merge (domain)<-[:host]-(ip)
plan:
47 seconds to enter one row. Is it because the relationships are anonymous? It's using all the unique index lookups I've created so it doesn't seem to be the lookup that's the issue.
04-14-2022 01:19 AM
Just a note and I doubt this is the cause, but on your ip merge you are setting the ‘ip’ and ‘rdn’ attributes in separate SET clauses. You can see in the profile that the second SET is separate and always done separately from the first. It is not associated with the MERGE clause. Did you mean the following instead:
ON CREATE SET ip.rdn = row.Rdn, ip.port = row.Port
Just an observation.
All the sessions of the conference are now available online