cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Using Nested `LOAD CSV` with `apoc.periodic.iterate`

I am suing LOAD CSV and apoc.periodic.iterate to import data. I have one .csv that contains the folder and file names that contain the data to be imported. Below is the code I currently have:

CALL apoc.periodic.iterate("
LOAD CSV WITH HEADERS FROM 'file:///NBIFilesListFiles.csv' AS line1
WITH collect('file:///'+line1.Folder+'/'+line1.File) as Files
UNWIND Files AS file
LOAD CSV WITH HEADERS FROM file AS row RETURN row",
"
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,

			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
",
{batchSize:1000,iterateList:true})

My goal is to use LOAD CSV to load the .csv that contains all the other file names to import and pass that into apoc.periodic.iterate. Is that possible?

Right now what I have in the code above tends to bog down my machine and cause Neo4j Desktop to crash. I have adjusted these settings to try and help

dbms.memory.heap.initial_size=1G
dbms.memory.heap.max_size=2G
1 ACCEPTED SOLUTION

SOLUTION: (explanation below)

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL
CALL apoc.periodic.iterate(
'
LOAD CSV WITH HEADERS FROM $url AS row RETURN row
//CALL apoc.load.csv($url) YIELD map AS row RETURN row
','
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,
			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
',
{batchSize:1000, parallel:false, params:{url:fileURL}}) YIELD batches, total
RETURN batches, total

You will notice that //CALL apoc.load.csv($url) YIELD map AS row RETURN row is commented out. While that is supposed to work. One of the files at the URL I was calling was throwing an error, so at this point I had to use LOAD CSV instead.

EXPLANATION

The first part of the code

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL

loads the file URLs from the shared Google Sheet. The WITH clause is included to "ignore" rows where the file URL has not been verified. All that satisfy are put into a collection and then using the UNWIND passed in to the next batch of code contained within the apoc.periodic.iterate function. The code within apoc.periodic.iterate load the data for each url and add it to the graph using MERGE.

View solution in original post

5 REPLIES 5

If I recall we don't recommend using LOAD CSV within apoc.periodic.iterate(), but you can use apoc.load.csv()

@andrew.bowman so the recommendation is to use apoc.load.csv() inside of apoc.periodic.iterate()?

@andrew.bowman I somehow skipped over that in the APOC library. I have taken that and applied it as follows (I simplified the MERGE statements in my original post):

CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({"https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318"}) yield map as row return row
','
WITH row.URL as url
CALL apoc.periodic.iterate(
'
CALL apoc.load.csv({url}) yield map as row1 RETURN row1
','
MERGE (state:State {id: row1.STATE_CODE_001})
ON CREATE SET state.name = row1.STATE_CODE_001
',
{batchSize:1000, iterateList:true})
',
{batchSize:10, iterateList:true})

I am getting an error on the line CALL apoc.load.csv({url}) yield map as row1 RETURN row1 that reads
"Neo.ClientError.Statement.SyntaxError: Invalid input 'A': expected 'o/O' (line 8, column 2 (offset: 302))
"CALL apoc.load.csv({url}) yield map as row1 RETURN row1"

SOLUTION: (explanation below)

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL
CALL apoc.periodic.iterate(
'
LOAD CSV WITH HEADERS FROM $url AS row RETURN row
//CALL apoc.load.csv($url) YIELD map AS row RETURN row
','
MERGE (state:State {id: row.STATE_CODE_001})
MERGE (state)<-[:OF_STATE]-(county:County {id: row.COUNTY_CODE_003})
MERGE (county)<-[:OF_COUNTY]-(place:Place {id: row.PLACE_CODE_004})
MERGE (place)<-[:OF_PLACE]-(bridge:Bridge {id: row.STRUCTURE_NUMBER_008})
ON CREATE SET bridge.name = row.STRUCTURE_NUMBER_008,
			  bridge.latitude = row.LAT_016,
			  bridge.longitude = row.LONG_017,
			  bridge.yearbuilt = row.YEAR_BUILT_027,
			  
			  bridge.deck_cond = row.DECK_COND_058,
			  bridge.superstructure_cond = row.SUPERSTRUCTURE_COND_059,
			  bridge.substructure_cond = row.SUBSTRUCTURE_COND_060,
			  bridge.channel_cond = row.CHANNEL_COND_061,
			  bridge.culvert_cond = row.CULVERT_COND_062,
			  bridge.structural_eval = row.STRUCTURAL_EVAL_067,
			  bridge.deck_geometry_eval = row.DECK_GEOMETRY_EVAL_068,
			  bridge.undclrence_eval = row.UNDCLRENCE_EVAL_069,
			  bridge.posting_eval = row.POSTING_EVAL_070,
			  bridge.waterway_eval = row.WATERWAY_EVAL_071,
			  bridge.appr_road_eval = row.APPR_ROAD_EVAL_072,
			  
			  place.name = row.PLACE_CODE_004,
			  county.name = row.COUNTY_CODE_003,
			  state.name = row.STATE_CODE_001
',
{batchSize:1000, parallel:false, params:{url:fileURL}}) YIELD batches, total
RETURN batches, total

You will notice that //CALL apoc.load.csv($url) YIELD map AS row RETURN row is commented out. While that is supposed to work. One of the files at the URL I was calling was throwing an error, so at this point I had to use LOAD CSV instead.

EXPLANATION

The first part of the code

LOAD CSV WITH HEADERS FROM "https://docs.google.com/spreadsheets/d/1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A/export?format=csv&id=1sFcY7LFBCGXSFG336UPoOf72BBv3bmv_AVaYLxwiV4A&gid=1318941318" AS row1
WITH CASE
	WHEN NOT row1.Year IS NULL THEN collect(row1.URL)
    END AS fileURLs
UNWIND fileURLs as fileURL

loads the file URLs from the shared Google Sheet. The WITH clause is included to "ignore" rows where the file URL has not been verified. All that satisfy are put into a collection and then using the UNWIND passed in to the next batch of code contained within the apoc.periodic.iterate function. The code within apoc.periodic.iterate load the data for each url and add it to the graph using MERGE.

Just following up with some additional information about the websocket error I was encountering.

In several of the files there are internal ' and " within column entries. Using CALL apoc.load.csv(url,{header:true, quoteChar: "\u0000"}), in lieu of LOAD CSV WITH HEADERS, where \u0000 is a null character I was able to go through all the file rows and create nodes.

@jsmccrumb and @mdfrenchman helped me with figuring out how to use \u0000 to get around this issue.