Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-05-2018 04:56 AM
Hi there! I've a problem with importing data. I have managed to import really huge datasets without any problem using neo4j-admin import
tool. But I've faced with the issue during importing one dataset.
The dataset contains only 2 type of values - id and language code.
Here is a sample of that file http://joxi.ru/J2byBXpcXLyqbm
Here is header file content: :IGNORE languageCode:ID(LanguageCode-ID)
, so we'll ignore the first field and process as ID the second field
Here is PaperLanguages.txt
847234 en
283432 fr
344533 en
Here is import.conf
--nodes:LanguageCode "PaperLanguages-header.txt,./src/PaperLanguages.txt"
--delimiter \9
--database test
--ignore-extra-columns
--quote \0
--high-io true
--id-type STRING
--ignore-missing-nodes true
--ignore-duplicate-nodes true
Import goes very fast but then stops and never finishes.
Import hangs on this stage (number of batches are not changing during hours):
Prepare node index
[*SORT----------------------------------------------------------------------------------------] 109M
Memory usage: 2.86 GB
Duration: 46m 29s 423ms
Done batches: 10911
.......... .......... .......... .......... .......... 5% ∆36m 14s 870ms
.......... .......... .......... .......... .......... 10% ∆2ms
.......... .......... .......... .......... .......... 15% ∆0ms
.......... .......... .......... .......... .......... 20% ∆0ms
.......... .......... .......... .......... .......... 25% ∆0ms
.......... .......... .......... .......... .......... 30% ∆1ms
.......... .......... .......... .......... .......... 35% ∆0ms
.......... .......... .......... .......... .......... 40% ∆0ms
.......... .......... .......... .......... .
I'm using neo4j v3.4.8
does anybody have any ideas what should be done to import this?
10-05-2018 01:18 PM
I've observed that --ignore-duplicate-nodes true
can cause performance issues with the importer. My strategy is use external tooling (unix text tools or more fancy stuff ) to ensure you don't have duplicate nodes.
10-08-2018 09:14 AM
Thanks! It looks like you're right. I've prepared data using
sort -u -k2,2 PaperLanguages.txt > PaperLanguages-normalized.txt
After that, there were only 80 unique rows. So import was done in less than one second.
Without this preparation, import was running more than 2 days without success
All the sessions of the conference are now available online