Neo4j

brett · ‎08-30-2019

I am loading a large data file into a Neo4j database, and I see a problem when applying apoc.merge.node in the apoc.periodic.iterate() procedure. I have run this query several times, and it never executed. The reason appears to be that it is running the apoc.load.csv() procedure in the background, and it never sends the data to the apoc.merge.node() procedure.

The version of Neo4J database I'm using is:

Neo4j Desktop - 1.1.17
Version: 3.5.3 Enterprise .  

 Settings:   

dbms.memory.heap.initial_size=2G
dbms.memory.heap.max_size=4G  
apoc.import.file.enabled=true 
apoc.import.file.use_neo4j_config=false

The procedures I'm running utilize apoc.periodic.iterate and two other APOC procedures, including the ability to load a CSV file, and the method of merging nodes.

CALL apoc.periodic.iterate( "CALL apoc.load.csv('/data/all_data_now.csv') yield map return map", "  
 CALL apoc.merge.node(map.NODE_NAME, {uuid:map.NODE_ID} ,  
  {indicies:split(map.NODE_INDICIES,',') , 
data:split(map.NODE_DATA,',') , line: map.NODE_LINE}) " , 
  {batchSize:10000, iterateList:true, parallel:true})

When I ran a different version that didn't need to use the apoc.node.merge because I initially created separate files that had only one type of node name, it executed fine. I used a Python script to create MERGE with the proper node name. This new version requires that I use the correct NODE name based on the column in the CSV file. This is the old call I used which worked.

CALL apoc.periodic.iterate( "CALL apoc.load.csv('/data/all_data_now.csv') yield map return map",
 "  MERGE (n:ICD9{uuid:map.NODE_ID}) 
         SET n.indicies = split(map.NODE_INDICIES,',') ,
             n.data = split(map.NODE_DATA,',') ,
             n.line = map.NODE_LINE "

Data File:
I created the CSV data file an open-source file by processing it in Python and placing it into a single large file. I was able to read the whole file in Python Pandas and determined that the file has four columns and over 124 million rows.
`

NODE 124091330
NODE_ID 124091330
NODE_INDICIES 124091327
NODE_DATA 49309879
dtype: int64
`
The problem I'm seeing is based on running the function "dbms.listQueries()"

The query list shows that there is the 'apoc.load.csv' procedure is running as cypher runtime=sloted that continues to run, and the main procedure never executes because it is waiting for the slotted procedure to complete.

I've utilized apoc.periodic.iterate() many times to load large CSV files, and it has always worked well.
What causes the procedure not to execute?
Brett Taylor

michael_hunger · ‎08-30-2019

It should be streaming.

But you don't need apoc.load.csv you can just use load csv with headers as your driving statement.

michael_hunger · ‎08-30-2019

Remove the parallel:true, it could be that constraints are waiting on the Label(id) combination locks across threads.

You have constraints for all your Label + ID combos, right?

brett · ‎08-30-2019

I'll re-write this tomorrow to use the traditional load method you just recommended. Over the last few hours, I re-wrote my data extraction system to place each type of Node in separate files, and another Python script to load each file. There are now over 400 files, and I'm seeing different problems with several of the files. Some of the files that are relatively large, go into a similar failure as this single large file. A smaller file is having issues with the last column in a line not having the correct double quote (e.g. “Unterminated quoted field at the end of the file”). I'm running a new version of the Python code so that the quote problem will go away. Thanks for your ideas, and I'll give feedback once I get working tomorrow.

michael_hunger · ‎08-31-2019

I meant only for the driving statement not overall.

Neo4j

APOC load csv file with call apoc.merge.node does not execute