Neo4j

gabrielsantos_i · ‎02-21-2019

Hello guys,

I'm using the apoc.periodic.iterate to create a graph from a csv file containing 10k rows (to start off), but I think it's taking longer than it should to finish the command (around 50 seconds). Here is how the entire command looks like and the image of the results:

CALL apoc.periodic.iterate("LOAD CSV WITH HEADERS FROM 'file:///vendas_cupom_subset.csv' AS file RETURN file", 
"MERGE (n:Cliente {cpf: file.CPF}) MERGE (m:Cupom {codigoCupom: file.NUMEROUNICOCUPOM}) MERGE (o:Produto {codigoProduto: file.CODIGOPRODUTO}) MERGE (p:Filial {codigoFilial:file.CODIGOFILIAL})  MERGE (n)-[:FEZ_PEDIDO {dataMovimento: file.DATAMOVIMENTO}]->(m) MERGE (m)-[:CONTEM {qtd:file.QTD_PRODUTO}]->(o) MERGE (o)-[:VENDIDO]->(p)", 
{batchSize: 1000, iterateList:true, parallel: false});

I'm running it on neo4j desktop. My computer is aN i5 vpro 2.50 ghz, 8gb ram and 200 gb ssd. I didn't alter the configurations to increase the heap/ram memory, so it's default. I also tested running the same command on cypher-shell, but the difference was very little. Also tried changing the batchsize to 100 and 10k, but it got worse.

I'm really concerned because I'll have to create the same graph, reading from the same csv containing millions of lines, and it doesn't seem very performatic.

How can I improve my query, having in mind that in the future i'll have to insert millions of registers.?

here's how my csv file looks like:

If I set the parallel parameter 'true', it inserts faster, but with failed operations:

stefan_armbrust · ‎02-23-2019

parallel: true makes sense if you don't have overlapping locks in concurrently running batches. Otherwise they have to wait for each other, which effectively serializes the parallel operations.

Please take a look a log/debug.log regarding suspicious messages there. The NullPointerException indicates some unexpected behaviour.

View solution in original post

stefan_armbrust · ‎02-22-2019

Since you do MERGE on nodes during import, be sure to have indexes in place before running the import. In your case:

create index on :Cliente(cpf);
create index on :Cupom(codigoCupom);
create index on :Produto(codigoProduto);
create index on :Filial(codigoFilial);

Run them either one-by-one or in one block with the multi-statement editor config enabled. Then try the import again.

gabrielsantos_i · ‎02-22-2019

Thank you very much! The time decreased from 50 seconds to 1 second.

Do you know if it's possible to run the command with "parallel:true" without the failed operations? If not, what is causing the failures?

Thanks again, in advance.

stefan_armbrust · ‎02-23-2019

parallel: true makes sense if you don't have overlapping locks in concurrently running batches. Otherwise they have to wait for each other, which effectively serializes the parallel operations.

Please take a look a log/debug.log regarding suspicious messages there. The NullPointerException indicates some unexpected behaviour.

gabrielsantos_i · ‎02-23-2019

Alright, thank you very much.

michael_hunger · ‎04-04-2019

Also you need to check your heap + page-cache config for your large import.
heap should be 2-8G and PC should be according to neo4j-admin memrec but best as large as your expected store. You can build a small store and multiply the size.

You can increase page-size to 100k. with parallel:false in your case.

Neo4j

How can I improve the performance of this query?