Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-30-2019 09:08 AM
I am loading a large data file into a Neo4j database, and I see a problem when applying apoc.merge.node in the apoc.periodic.iterate() procedure. I have run this query several times, and it never executed. The reason appears to be that it is running the apoc.load.csv() procedure in the background, and it never sends the data to the apoc.merge.node() procedure.
The version of Neo4J database I'm using is:
Neo4j Desktop - 1.1.17
Version: 3.5.3 Enterprise .
Settings:
dbms.memory.heap.initial_size=2G
dbms.memory.heap.max_size=4G
apoc.import.file.enabled=true
apoc.import.file.use_neo4j_config=false
The procedures I'm running utilize apoc.periodic.iterate and two other APOC procedures, including the ability to load a CSV file, and the method of merging nodes.
CALL apoc.periodic.iterate( "CALL apoc.load.csv('/data/all_data_now.csv') yield map return map", "
CALL apoc.merge.node(map.NODE_NAME, {uuid:map.NODE_ID} ,
{indicies:split(map.NODE_INDICIES,',') ,
data:split(map.NODE_DATA,',') , line: map.NODE_LINE}) " ,
{batchSize:10000, iterateList:true, parallel:true})
When I ran a different version that didn't need to use the apoc.node.merge because I initially created separate files that had only one type of node name, it executed fine. I used a Python script to create MERGE with the proper node name. This new version requires that I use the correct NODE name based on the column in the CSV file. This is the old call I used which worked.
CALL apoc.periodic.iterate( "CALL apoc.load.csv('/data/all_data_now.csv') yield map return map",
" MERGE (n:ICD9{uuid:map.NODE_ID})
SET n.indicies = split(map.NODE_INDICIES,',') ,
n.data = split(map.NODE_DATA,',') ,
n.line = map.NODE_LINE "
Data File:
I created the CSV data file an open-source file by processing it in Python and placing it into a single large file. I was able to read the whole file in Python Pandas and determined that the file has four columns and over 124 million rows.
`
I've utilized apoc.periodic.iterate() many times to load large CSV files, and it has always worked well.
What causes the procedure not to execute?
Brett Taylor
08-30-2019 05:39 PM
It should be streaming.
But you don't need apoc.load.csv you can just use load csv with headers
as your driving statement.
08-30-2019 05:40 PM
Remove the parallel:true, it could be that constraints are waiting on the Label(id) combination locks across threads.
You have constraints for all your Label + ID combos, right?
08-30-2019 08:06 PM
I'll re-write this tomorrow to use the traditional load method you just recommended. Over the last few hours, I re-wrote my data extraction system to place each type of Node in separate files, and another Python script to load each file. There are now over 400 files, and I'm seeing different problems with several of the files. Some of the files that are relatively large, go into a similar failure as this single large file. A smaller file is having issues with the last column in a line not having the correct double quote (e.g. “Unterminated quoted field at the end of the file”). I'm running a new version of the Python code so that the quote problem will go away. Thanks for your ideas, and I'll give feedback once I get working tomorrow.
08-31-2019 03:27 AM
I meant only for the driving statement not overall.
All the sessions of the conference are now available online