Neo4j

benjamin_squire · ‎10-26-2018

I am trying to import 2.3 TB of data onto an EC2 Box with 8 Cores 120 GB Ram and a 8 TB SSD. I have been able to load smaller datasets but am now scaling up to a larger dataset. The command to invoke the import is

~/../../../usr/bin/neo4j-admin import  \
--nodes "import/uids-header.csv,import/uid_no.*"  \
--nodes "import/age-header.csv,import/age_no.*"  \
--nodes "import/gender-header.csv,import/gender_no.*"  \
--nodes "import/ip-header.csv,import/ip_no.*"  \
--nodes "import/device-header.csv,import/device_no.*"  \
--nodes "import/os-header.csv,import/os_no.*"  \
--nodes "import/browser-header.csv,import/browser_no.*"  \
--nodes "import/identitylink-header.csv,import/idlink_no.*"  \
--nodes "import/opti-header.csv,import/opti_no.*"  \
--nodes "import/bluekai-header.csv,import/bk_no.*"  \
--nodes "import/acxiom-header.csv,import/axm_no.*"  \
--nodes "import/adobe-header.csv,import/adb_no.*"  \
--nodes "import/lr-header.csv,import/lr_no.*"  \
--nodes "import/viant-header.csv,import/vnt_no.*"  \
--nodes "import/ga-header.csv,import/ggl_no.*"  \
--nodes "import/segment-header.csv,import/seg_no.*"  \
--nodes "import/email-header.csv,import/email_no.*"  \
--nodes "import/country-header.csv,import/cntry_no.*"  \
--nodes "import/citystate-header.csv,import/city_no.*"   \
--relationships:OBSERVED_WITH "import/rels-header.csv,import/opti_li.*,import/idlink_li.*,import/bk_li.*,import/axm_li.*,import/adb_li.*,import/lr_li.*,import/vnt_li.*,import/ggl_li.*,import/seg_li.*,import/email_li.*"  \
--relationships:VISITED_ON "import/rels-header.csv,import/device_li.*,import/os_li.*,import/browser_li.*"  \
--relationships:VISITED_FROM "import/rels-header.csv,import/city_li.*,import/cntry_li.*,import/ip_li.*"  \
--relationships:IDENTIFIED_AS "import/rels-header.csv,import/gender_li.*,import/age_li.*"  \
--ignore-duplicate-nodes=true  \
--ignore-missing-nodes=true  \
--delimiter="~"  \
--max-memory=95%

Please provide the following information if you ran into a more serious issue:

neo4j version: Community 3.4.9
neo4j.log and debug.log
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 224919552 bytes for committing reserved memory. Possible reasons:
The system is out of physical RAM or swap space
In 32 bit mode, the process size limit was hit
Possible solutions:
Reduce memory load on the system
Increase physical memory or swap space
Check if swap backing store is full
Use 64 bit Java on a 64 bit OS
Decrease Java heap size (-Xmx/-Xms)
Decrease number of Java threads
Decrease Java thread stack sizes (-Xss)
Set larger code cache with -XX:ReservedCodeCacheSize=
This output file may be truncated or incomplete.

Out of Memory Error (os_linux.cpp:2657), pid=7928, tid=0x00007fbd2e13c700

JRE version: OpenJDK Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
Java VM: OpenJDK 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

benjamin_squire · ‎11-07-2018

IMPORT DONE in 18h 51m 44s 165ms.
Imported:
7553667978 nodes
29805914822 relationships
18671681291 properties
Peak memory usage: 92.45 GB

Thanks for everyone's help

View solution in original post

stefan_armbrust · ‎10-26-2018

is known to slow down the import and to have a much higher memory footprint. If possible try to get rid of duplicates before running the import.

benjamin_squire · ‎10-26-2018

That would not be possible given the data as some of the ids across different nodes inherently had the same id which was giving an error and required me to have it ignore duplicate nodes.
I am handling this per Max Demarzi suggestion here: https://maxdemarzi.com/2012/02/28/batch-importer-part-2/
I have used row_number sequentially on tables of different nodes to ensure a unique numerical id which I will load into the DB with id-type: ACTUAL and i will no longer use --ignore-duplicate-nodes hopefully this reduces memory need and speeds up the import as it was taking 4.5 hours just to hit a brick wall.

michael_hunger · ‎10-27-2018

Are there other processes running consuming memory? 120G might also be a bit on the low side.
Can you share the output of the tool?

benjamin_squire · ‎10-29-2018

@michael.hunger there are no other processes running. I am trying to format data via Max's suggestion to see if ordering the data with the Actual Node id via a row_number command on the distincts might help.

@mpviolet it fails about 45% into the 1/4 node import stage. It took around 4.5 hours to hit the error. Never reached the relationship stage.

mpviolet · ‎10-29-2018

Does it fail before even starting to import or midway into it? Can you include printout from the import run? How much heap do you give it?

benjamin_squire · ‎10-31-2018

Had a critical realization that in my CSV export from redshift I failed to perform a distinct on one of the nodes sets, namely citystate, this meant I had duplicates in the range of 5.4 Billion as every record had a city recorded. It broke the import on the ignore duplicates as was stated previous.

Key discovery: Make sure to double check your data prior to loading with import process

benjamin_squire · ‎11-06-2018

Despite fixing the UNLOAD command in redshift, and double checking my data, I found for 2.2 TB of data that 120 GB RAM will not be enough. I extended to 244 GB RAM and it is now on 3/4 stages linking relationship after load time of around 10 hours. I did follow Max's link about using ACTUAL ID, not sure if it sped up the process any but at least it has almost loaded the 33 Billion nodes which is the max limit of Neo4j Community

benjamin_squire · ‎11-07-2018

IMPORT DONE in 18h 51m 44s 165ms.
Imported:
7553667978 nodes
29805914822 relationships
18671681291 properties
Peak memory usage: 92.45 GB

Thanks for everyone's help

Neo4j

Neo4j Import error- There is insufficient memory for the Java Runtime Environment to continue. - 2.3 TB dataset