Neo4j

busymo16 · ‎04-08-2022

Hi everyone,
I have a query like this:

WITH event AS data
MERGE (i:Individual {id: data.id})
SET
i.corporateTitleEnumId = data.corporateTitleEnumId,
i.costCenter = data.costCenter,
i.department = data.department,
i.externalId = data.externalId,
i.email = data.email,
i.emailVerified = data.emailVerified,
i.fax = data.fax,
i.firstName = data.firstName,
i.genderEnumId = data.genderEnumId,
i.languageEnumId = data.languageEnumId,
i.lastName = data.lastName,
i.middleName = data.middleName,
i.mobile = data.mobile,
i.officeNumber = data.officeNumber,
i.phone1 = data.phone1,
i.phone2 = data.phone2,
i.pCreated = data.pCreated,
i.pModified = data.pModified,
i.titleEnumId = data.titleEnumId,
i.userId = data.userId,
i.userPrincipalName = data.userPrincipalName,
i.uuid = data.uuid
WITH i, data
MERGE (a:Address{id: COALESCE(data.addressId,'')})
MERGE (i)-[:HAS_ADDRESS]->(a)
MERGE (e:ExternalID:AGO{value: COALESCE(data.agoPersonId,''), type: "Person ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e)
MERGE (c:Customer{id: COALESCE(data.customerId,'')})
MERGE (i)<-[:IS_RELATED_WITH]-(c)
MERGE (e1:ExternalID:IAS{value: COALESCE(data.iasvSubjectId,''), type: "Subject ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e1)
MERGE (e2:ExternalID:IAS{value: COALESCE(data.iasPersonId,''), type: "Person ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e2)
MERGE (e3:ExternalID:ILM{value: COALESCE(data.ilmEmployeeId,''), type: "Employee ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e3)
MERGE (e4:ExternalID:IPP{value: COALESCE(data.ippContactId,''), type: "Contact ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e4)
MERGE (e5:ExternalID:SAP{value: COALESCE(data.sapContactId,''), type: "Contact ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e5)
MERGE (e6:ExternalID:SFR{value: COALESCE(data.sfrIndividualId,''), type: "Individual ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e6)
MERGE (e7:ExternalID:SYM{value: COALESCE(data.symvContactId,''), type: "Contact ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e7)
MERGE (e8:ExternalID:SYM{value: COALESCE(data.symContactId,''), type: "Contact ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e8)
MERGE (e9:ExternalID:TIS{value: COALESCE(data.tisContactId,''), type: "Contact ID"})
MERGE (i)-[:HAS_EXTERNAL_ID]->(e9)

I get the data from Kafka but it is the case that there might be duplicates and null values. Since this query is very heavy (regarding performance) the load time is extremely high. Does anyone have any idea on how to optimise this query while considering duplicate and null values removal? I tried to use apoc.periodic.iterate but it does not seem to have a huge impact. Thank you in advance!

glilienfield · ‎04-08-2022

That is a crazy difference. Do you have indexes on many or all of the keys you are merging on?

View solution in original post

glilienfield · ‎04-08-2022

You could try something like the following. If the attributes are all the same between node(i) and data, then you can set i = data. The call blocks will not attempt the merges if the attribute is null. Since i = data, you probably only need to pass 'I' in the 'with' clauses, but all your code uses data, so I passed it so you could just cut/paste what you have into the call blocks and make minor adjustments. You need the double 'with' clauses in the call blocks, as it complains about a 'where' clause following the first 'with' clause, so it's a work around.

Maybe it will help.

WITH event AS data
MERGE (i:Individual {id: data.id})
SET i = data
WITH i, data
CALL {
    with i, data
    with i, data
    where data.addressid IS NOT NULL
    MERGE (a:Address{id: data.addressId})
    MERGE (i)-[:HAS_ADDRESS]->(a)
}
// add all the other attributes to be set using the same call pattern 
CALL {
    with i, data
    with i, data
    where data.tisContactId IS NOT NULL
    MERGE (e9:ExternalID:TIS{value: data.tisContactId,  type: "Contact ID"})
    MERGE (i)-[:HAS_EXTERNAL_ID]->(e9)
}

busymo16 · ‎04-08-2022

Thank you for your answer. I will give it a try. Just to create and idea that while I am using the MERGE everywhere alongside with apoc.periodic.itearate, it takes around 10-20 mins to create/merge 30,000 nodes. While when I just use CREATE only it takes 5-10 mins to create 10 million nodes and 8 million relationships. I guess MERGE is supposed to be slower but not this much slow or am I wrong?

glilienfield · ‎04-08-2022

That is a crazy difference. Do you have indexes on many or all of the keys you are merging on?

busymo16 · ‎04-08-2022

I had indexes on some of them but not all of the properties used for merging. Now, the data import is faster (reasonable time compared with CREATE).

Neo4j

Exhaustive query optimization