Neo4j

richard_lin · ‎08-12-2020

(Using neo4j official python driver) I am wondering if unwind or load csv should be used in a situation where there are two ec2 servers. One server contains a neo4j instance. The other server contains ETL scripts. Using load csv would be rather complex, as I am reading csv's from s3, cleaning them, making new csv's (re-uploading to s3), and then presigning url's to use load csv. If I were to use unwind then there would be no pre-signing or re-uploading of csv files (basic unwind cypher statement). Below are sample code segments. The load csv one requires a csv file to be referenced ($link). Performance wise, unwind seems to be slow even with indexes, but it is less complex than load csv.

# unwind 
def largeUnwind(driver, batch, label):
    with driver.session() as session:
        session.run("UNWIND $batch AS props MERGE (n:" + label +"{" + "name:props.name" + "})" + " SET n = props" , batch=batch, label=label)

# load csv
def largeInsert(driver, body, label):
    with driver.session() as session:
        session.run("LOAD CSV WITH HEADERS FROM $link AS line MERGE (s:" + label + "{name:" + "line.name}) SET s=line", link=body)

david_rosenblum · ‎08-13-2020

Hi Richard,

Welcome to the forum.

I wonder if you adjust the size of what you passed to the unwind that you could see better performance. You can also try adding USING PERIODIC COMMIT 5000 (or some other number) to the load csv.

Can you tell me what your heap and page cache settings are?

Thank you

richard_lin · ‎08-14-2020

Hi David,

Thanks for the reply. Below, is my current settings for heap and page cache. I have tried using periodic commit to load csv, and it seems to work faster than unwinding batches. Another issue is I am wondering if it would be better to use unwind vs load csv for the case of having my python script on one ec2 server and the neo4j database on the other ec2 server. The reason why is because there are limitations with load csv. It seems it can only be used on csv files with presigned urls or the files have to be on the same server as neo4j database.

dbms.memory.pagecache.size=512m

dbms.memory.heap.initial_size=3G
dbms.memory.heap.max_size=5G

david_rosenblum · ‎08-14-2020

Are the two EC2 instances in the same region?
If the file is available via https:// from s3, the load csv can be on another server.
You'll note I specify the fields one by one so that I can alter the type. SET s=line will set properties always be a string.
e.g.
LOAD CSV WITH HEADERS FROM 'https://s3.eu-north-1.amazonaws.com/com.neo4j.gds.browser-guide/data/battles.csv' AS row
MERGE (b:Battle {name: row.name})
ON CREATE SET b.year = toInteger(row.year),
b.summer = row.summer,
b.major_death = row.major_death,
b.major_capture = row.major_capture,
b.note = row.note,
b.battle_type = row.battle_type,
b.attacker_size = toInteger(row.attacker_size),
b.defender_size = toInteger(row.defender_size);

A couple of suggestions.
Make initial and max heap the same to avoid the GC when the heap grows.
I make page cache the same size as the database on disk if possible.

Neo4j

Should load csv or unwind be used in MERGE situation with two ec2 servers (complexity vs performance)?