Neo4j

tanner989 · ‎04-11-2022

Hello everyone,

I'm a bit puzzled because I have what seems to be a very simple graph model and insert batch query that is taking multiple minutes to run.

Here is the query:

UNWIND $batch as row 
    Merge (tld:TLD{tld:row.tld})
    Merge (ip:Ip{address:row.address})
    Merge (domain:Domain{root:row.root,sub:row.subdomain})
    Merge (domain)-[:tld]->(tld)
    Merge (domain)<-[:host]-(ip)

The model (as is evident from the query) is a group of TlD nodes (such as Tld{tld:"com"}) attached via a :root relationship to Domain nodes (such as Domain{domain:"foo", subdomain:"www"}) which are attached to Ip nodes (such as `Ip{address:"1.1.1.1"}.

Here are the explain and profile png's for single inserts:

explain:

profile:

If it helps here's the golang code I'm using to generate the query:
MapResults converts the result structs into a generic map structure to satisfy the tx.Run method.

_, err := sess.Session.WriteTransaction(func(tx neo4j.Transaction) (interface{}, error) {
		result, err := tx.Run(
			"UNWIND $batch as row  "+
				"Merge(tld:Tld{tld:row.Tld}) "+
				"Merge(ip:Ip{ip:row.Address, rdn: row.Rdn}) "+
				"Merge(domain:Domain{domain:row.Domain, subdomain:row.Subdomain}) "+
				"Merge (domain)-[:root]->(tld) "+
				"Merge (domain)<-[:host]-(ip) ",
			map[string]interface{}{"batch": MapResults(results)})
		return result, err
	})

glilienfield · ‎04-11-2022

Look at the beginning of the two branches that are orange. It indicates each starts with a node label scan, I,e, it is scanning all nodes with the label to filter on the predicate in the next stage. Looks like the are lots of nodes for these two labels. You may want to try adding an index to those two labels and corresponding properties.

dana_canzano · ‎04-11-2022

@tanner989

From the profile it does not appear you have any indexes to support the Merge statements. Since a MERGE is effectively a create or update, if it is to do an update than having an index will improve the performance.

MERGE - Neo4j Cypher Manual describes MERGE and states

For performance reasons, creating a schema index on the label or property is highly recommended when using MERGE. See Indexes for search performance for more information.

I might suggest creating indexes on

:TLD(tld)
:Ip(address)
:Domain(root, sub)

tanner989 · ‎04-11-2022

The IP address index helped speed it up a bit but for some reason my domains index isn't being used.

CREATE INDEX domains_index_name FOR (d:Domain) ON (d.domain, d.subdomain)

New profile:

Does it just take some time to propagate?

dana_canzano · ‎04-11-2022

@tanner989

is there a index on :TLD?

tanner989 · ‎04-11-2022

@dana.canzano yup,
7 "tld_index" "ONLINE" 100.0 "NONUNIQUE" "BTREE" "NODE" ["Tld"] ["tld"] "native-btree-1.0"

I'm not too concerned about TLD node lookups since there's only going to be <100 of them. Its the Domain node index that would speed it up the most.

This is odd. It uses the domain index for searches but not on inserts.

profile match (t:Tld{tld:"com"})<-[r]-(d:Domain{domain:"att", subdomain:""}) return t,d profile:

glilienfield · ‎04-11-2022

The first profile shows 'sub', not 'subdomain' as the 'domain' attribute used in the filter predicate.

Are you being consistent with your attribute names?

tanner989 · ‎04-11-2022

@glilienfield ..........I need to get more sleep.......

tanner989 · ‎04-11-2022

Ended up going with unique constraints on Tld, Ip, and Domain (using a domainkey field which is a concat of root and subdomain) and now the batch insert finishes in ~10 secs which is fine!

nope apparently this didn't fix it. Example input:

PROFILE
UNWIND {tld:"com", Domain:"test2",address:"1.2.37.4",Subdomain:"test2", Domainkey: "test2.test2",Rdn:"luz", Port:443} as row 
    Merge (tld:Tld{tld:row.tld})
    Merge (ip:Ip{address:row.address})
    ON CREATE SET ip.rdn = row.Rdn SET ip.port = row.Port 
    Merge (domain:Domain{domainkey:row.Domainkey})
    ON CREATE SET domain.domain = row.Domain, domain.subdomain = row.Subdomain 
    Merge (domain)-[:tld]->(tld)
    Merge (domain)<-[:host]-(ip)

plan:

47 seconds to enter one row. Is it because the relationships are anonymous? It's using all the unique index lookups I've created so it doesn't seem to be the lookup that's the issue.

glilienfield · ‎04-14-2022

Just a note and I doubt this is the cause, but on your ip merge you are setting the ‘ip’ and ‘rdn’ attributes in separate SET clauses. You can see in the profile that the second SET is separate and always done separately from the first. It is not associated with the MERGE clause. Did you mean the following instead:

ON CREATE SET ip.rdn = row.Rdn, ip.port = row.Port

Just an observation.

Neo4j

Seemingly simple batch query with 2500 rows takes over a minute to insert