cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to batch json records using apoc library for better importing?

damisg7
Node Clone

I want to import a large json dataset using apoc library, and create nodes - relationships. My script was working slowly (~750000 records). Neo4j Community suggested to batch the records for more efficiency. Firstly, what method should i use ? Periodic Commit or Periodic Iterate? For example when i'm using commit, it executes, without doing anything.

// Insert CPEs and CPEs Children - Cypher Script

UNWIND ["nvdcpematch-1.0.json"] AS files

CALL apoc.periodic.commit("

  CALL apoc.load.json($files) YIELD value

  // Insert Base Platform

  UNWIND value.matches AS value_cpe

  with value_cpe limit $limit

  MERGE (cpe:CPE {

    uri: value_cpe.cpe23Uri

  })

 

  // Insert Children

  FOREACH (value_child IN value_cpe.cpe_name |

    MERGE (child:CPE {

      uri: value_child.cpe23Uri

    })

    MERGE (cpe)-[:parentOf]->(child)

  )", {parallel:true, files:files, limit:2000}

) YIELD UPDATES RETURN UPDATES
4 REPLIES 4

Hi, @damisg7 !

You should try with apoc.periodic.iterate(). It is used to load your data in transactional batches and in parallel. By using it, the heap memory will be released in every batch and the load time will be faster.

An example of usage is:

CALL apoc.periodic.iterate(
  'CALL apoc.load.jdbc("jdbc:mysql://localhost:3306/northwind?user=root","company")',
  'CREATE (p:Person) SET p += value',
  { batchSize:10000, parallel:true})
RETURN batches, total

I tried out several combinations of this, but always getting to running out of memory exception. I tried different batch sizes, different ways to split the query etc. However, I think that this way is the right one!

// Insert CPEs and CPEs Children - Cypher Script
UNWIND ["nvdcpematch-1.0.json"] AS files

CALL apoc.periodic.iterate("
  CALL apoc.load.json($files) YIELD value",

  "// Insert Base Platform
  UNWIND value.matches AS value_cpe
  MERGE (cpe:CPE {
    uri: value_cpe.cpe23Uri
  })
  

  // Insert Children
  FOREACH (value_child IN value_cpe.cpe_name |
    MERGE (child:CPE {
      uri: value_child.cpe23Uri
    })
    MERGE (cpe)-[:parentOf]->(child)
  )", {parallel:true, batchSize:10000, params:{files:files}}
) YIELD batches, total RETURN batches, total

The exception i get is the one below. I have 8GB RAM and dbms.memory.heap.initial_size=1.5G,
dbms.memory.heap.max_size=3G, dbms.memory.pagecache.size=1.5G.

Failed to invoke procedure `apoc.periodic.iterate`: Caused by: java.lang.OutOfMemoryError: Java heap space

Hello @damisg7

I assume that the property uri is unique so did you create a UNIQUE CONSTRAINT on this property before?
Normally, your data will load faster with APOC and UNIQUE CONSTRAINT.

Regards,
Cobra

I already have it thanks!