Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-01-2019 09:18 AM
I'm currently importing nodes and relationships from different csv files with a total size of ~45GB using neo4j-admin import. In the beginning all 4 cpu cores were used but from (at least) 55% of (1/4) Node import only one core is used. It is running now for over 16 hours and is still at 60%. You can see that in the following console output:
I started the import with the following command:
./bin/neo4j-admin import \
--mode=csv \
--database=btctest.db \
--nodes $HEADERS/addresses-header.csv,$DATA/addresses.csv \
--nodes $HEADERS/blocks-header.csv,$DATA/blocks.csv \
--nodes $HEADERS/transactions-header.csv,$DATA/transactions.csv \
--relationships $HEADERS/before_rel-header.csv,$DATA/before_rel.csv \
--relationships $HEADERS/belongs_to_rel-header.csv,$DATA/belongs_to_rel.csv \
--relationships $HEADERS/receives_rel-header.csv,$DATA/receives_rel.csv \
--relationships $HEADERS/sends_rel-header.csv,$DATA/sends_rel.csv \
--ignore-missing-nodes=true \
--ignore-duplicate-nodes=true \
--multiline-fields=true \
--high-io=true
The headers are the following way (as an example I post one node header and one relationship header):
node (transactions-header.csv):
txid:ID,:LABEL
relationship (sends_rel-header.csv):
:START_ID,value,:END_ID,:TYPE
Is it normal that neo4j uses only one cpu core after a while? And is it normal that the import with the import tool takes that long? Do you have any recommendations on how to make this faster? By the way I use SSD.
03-01-2019 05:51 PM
Hmm usually it is quite effiicient.
Do you by chance have a lot of duplicate nodes in your data?
Btw. you can pass the main label/rel-type directly on the command line.
You could try to configure less heap (e.g. 2G) and reserve the rest to the heap.
can you get the "c" and "i" outputs in that stage?
03-02-2019 08:07 AM
Only the addresses.csv can contain duplicates. I try to remove the duplicates before importing the data.
Yes, I realised that just after I've started generating the csv files.
"c" and "i" work in that stage. This is the output ("i"):
03-03-2019 03:00 AM
It does not create an index at this stage. It's probably just the de-duplication.
I'll ask the devs about it, this is really not what it should look like, your whole import should be done in a few minutes.
Also if you look at your memory information it seems there is not much available.
03-04-2019 12:19 AM
Answer from the team.
there are some special cases where the sorting in there isn't particularly optimal and only one thread gets the majority of the work.
So the solution for him would be to de-duplicate upfront with unix tools for the time being.
Sorry for that, it's something we're going to address going forward.
03-04-2019 03:29 AM
Removing duplicates in advance did it. The import took now less than 30 minutes. Thank you for your help.
All the sessions of the conference are now available online