Neo4j

neo4j_noob · ‎08-26-2020

Hello,

I'm trying to create or update nodes from a batch using neo4j python module,
I have approx 100k items in the batch.

Here is 1 item from the batch:
batch[0] = {'contactid': '1', 'gender': 'Mr', 'firstname': 'Marc', 'lastname': 'Brown', 'customerid': 'abc123', 'password': 'cbz', 'salutation': 'Marc', 'organizationid': '20', 'companyid': '100003.0', 'eipuserid': nan, 'email_address': 'xyz@hotmail.com', 'email2': nan, 'url': nan, 'academictitle': nan, 'jobtitle': 'Director'}

from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(username, password))

def create_update_nodes(insert_batch,label,label_id):

query =  """
     CALL apoc.periodic.iterate(
    'UNWIND $batch as row RETURN row',
     'MERGE (n:{label} {{{label_id}:row.{label_id}}})
     ON MATCH SET n+= row 
      ON CREATE SET n += row ',       
     {{ batchSize : 5000 , iterateList:true , params:{{batch: $batch}} }})
""".format(label=label,label_id=label_id , batch = insert_batch)

start_time = time.time()

with driver.session() as session:
    result = session.run(query,batch = insert_batch)

print(label+ f" node insertion took - {time.time() - start_time} ")

return None

i have created constraints too.
The first 20-25k insertion is quite fast, but post that the insertion becomes very slow , i have tried with different batch sizes ranging from 500 to 25000 , seems faster in the 1000 to 5000 batch size range.

Where am I doing wrong within the function ?

Thanks.

Cobra · ‎08-26-2020

Hello @neo4j_noob and welcome to the Neo4j community

You should only use parameters, don't fill your query like a string.
If you want to use different labels, have a look at APOC functions

Moreover, you must build you own batches and don't use apoc.periodic.iterate() for loading data.

I did an example here

Regards,
Cobra

neo4j_noob · ‎08-27-2020

Hey really appreciate your help.

Now I got rid of apoc and just executing the unwind query in a loop with a batch of 10k each time

def create_update_nodes(self,insert_batch,label,label_id):

    with self.driver.session() as session:
        start_time = time.time()
        for i in range(0, len(insert_batch), 10000): 
            query =  """                 
                UNWIND $batch as row 
                 MERGE (n:{label} {{{label_id}:row.{label_id}}})
                 ON MATCH SET n+= row 
                  ON CREATE SET n += row ;
            """.format(label=label,label_id=label_id )

             session.run(query, batch = insert_batch[i:i+10000])

           print(f"Time taken for {label} = {time.time() - start_time}")

    return None

So now for 50k nodes it took 3 min to for ingestion

andrew_bowman · ‎08-27-2020

Remember that MERGE is like first doing a MATCH, and if the pattern doesn't exist, a CREATE. As such, you need to leverage indexes for best performance when using MATCH or MERGE when finding/creating starting nodes in the graph. Make sure you have supporting indexes to support these operations, or you'll see a notable decrease in efficiency.

Cobra · ‎08-27-2020

Please, do not use format to fill your query, use parameters like in the example I linked.

Neo4j

Merge Nodes using APOC is slow