Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-26-2020 11:36 PM
Hello,
I'm trying to create or update nodes from a batch using neo4j python module,
I have approx 100k items in the batch.
Here is 1 item from the batch:
batch[0] = {'contactid': '1', 'gender': 'Mr', 'firstname': 'Marc', 'lastname': 'Brown', 'customerid': 'abc123', 'password': 'cbz', 'salutation': 'Marc', 'organizationid': '20', 'companyid': '100003.0', 'eipuserid': nan, 'email_address': 'xyz@hotmail.com', 'email2': nan, 'url': nan, 'academictitle': nan, 'jobtitle': 'Director'}
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(username, password))
def create_update_nodes(insert_batch,label,label_id):
query = """
CALL apoc.periodic.iterate(
'UNWIND $batch as row RETURN row',
'MERGE (n:{label} {{{label_id}:row.{label_id}}})
ON MATCH SET n+= row
ON CREATE SET n += row ',
{{ batchSize : 5000 , iterateList:true , params:{{batch: $batch}} }})
""".format(label=label,label_id=label_id , batch = insert_batch)
start_time = time.time()
with driver.session() as session:
result = session.run(query,batch = insert_batch)
print(label+ f" node insertion took - {time.time() - start_time} ")
return None
i have created constraints too.
The first 20-25k insertion is quite fast, but post that the insertion becomes very slow , i have tried with different batch sizes ranging from 500 to 25000 , seems faster in the 1000 to 5000 batch size range.
Where am I doing wrong within the function ?
Thanks.
08-26-2020 11:40 PM
Hello @neo4j_noob and welcome to the Neo4j community
You should only use parameters, don't fill your query like a string.
If you want to use different labels, have a look at APOC functions
Moreover, you must build you own batches and don't use apoc.periodic.iterate()
for loading data.
I did an example here
Regards,
Cobra
08-27-2020 06:04 AM
Hey really appreciate your help.
Now I got rid of apoc and just executing the unwind query in a loop with a batch of 10k each time
def create_update_nodes(self,insert_batch,label,label_id):
with self.driver.session() as session:
start_time = time.time()
for i in range(0, len(insert_batch), 10000):
query = """
UNWIND $batch as row
MERGE (n:{label} {{{label_id}:row.{label_id}}})
ON MATCH SET n+= row
ON CREATE SET n += row ;
""".format(label=label,label_id=label_id )
session.run(query, batch = insert_batch[i:i+10000])
print(f"Time taken for {label} = {time.time() - start_time}")
return None
So now for 50k nodes it took 3 min to for ingestion
08-27-2020 11:03 AM
Remember that MERGE is like first doing a MATCH, and if the pattern doesn't exist, a CREATE. As such, you need to leverage indexes for best performance when using MATCH or MERGE when finding/creating starting nodes in the graph. Make sure you have supporting indexes to support these operations, or you'll see a notable decrease in efficiency.
08-27-2020 06:19 AM
Please, do not use format to fill your query, use parameters like in the example I linked.
All the sessions of the conference are now available online