cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Automate data upload process from AWS S3 to Neo4j CE running on EC2

Neo4j Community Edition is running on Ec2 instance. I am looking to upload data from S3. I came across articles that told me to do "aws s3 presign .." and then I can use the output link to upload the data. But I want to know how can we automate this process. Please suggest me on this.

Other question I have is, what is the best way to upload large amount of CSV files from S3 onto Neo4j. I am right now doing apoc.load.csv to get each file with many MERGE commands in the loading operation. Should I create different CSV files for Nodes and relationships? Please suggest. Thanks in advance.

10 REPLIES 10

Hi,
Which part of the process did you want to automate? The S3 upload and configuration? Or the neo4j db build.

By far the fastest way to load data into neo4j is to use the bulk loader

neo4j-admin import

However, please note that the bulk loader cannot be used to add new data to an existing neo4j database, it can only build a new database from scratch. The bulk loader takes node and edge files (e.g. csv/tsv) as input, and they can be compressed files (e.g. node.tsv.gz).

I am building graphs from a collection of parquet files, so I was able to build an automated workflow that creates the bulk load files (e.g. header files with the correct type assignments), and automatically create the script file to execute with the neo4j-admin command line with all parameters (lengthy if you have a lot of nodes and edge files).

Note: I'm running multiple Neo inside docker containers on an EC2 instance.

If you must add to an existing neo4j, in order of increasing complexity, some options I'm familiar with are

  • check out the apoc library, there are file load functions which have more functionality and I believe are a fair bit faster than "load csv".
  • or do something complicated to hide/ease the burden of the load time (e.g. stream data in, perhaps with kafka?, etc)

When adding collections of edges/nodes which would add a significant percentage of new nodes/edges to a database it can be faster to rebuild from scratch than to add them. In my experience it has always been faster. The bulk loader is lightning fast in comparison to load csv.

Joel

Joel, I'm curious what your approach for automated loading of data into Neo4j from within the Docker container when it's started. Does your script use sleep command or take into account the startup time before importing into the db? I have been looking for a good way to do this.

Hi gclindsey,

High level process (no neo4j related installations on the host, only docker)

  • First I start up a temporary/ephemeral neo4j docker (this creates a blank db I don't and can't use). I need to do this in order to use the neo4j-admin tool inside docker to do the bulk load into a separate new neo4j db
  • once complete I shut down this ephemeral neo4j docker
  • Then I start up neo4j again in docker but configured to use the newly created db. It feels like a hack, but it works, and I haven't seen a better way (keeping to my docker only approach)

There are times when the script needs to wait for neo4j to be ready, so I borrowed a script to wait for neo4j but it didn't work, I borrowed another it failed as well. I suspect you have experienced this as well, many techniques we can find online to check if neo4j is ready don't work reliably. For a while I tried to resort to sleep as well (after checks), but any reasonable wait time will fail on you too, eventually. Over time, I've tried waiting inside the container, and outside, it seems to be a bit of a challenge. port test, web server response test, all can pass, but neo4j is not ready yet.

NOTE: If I recall correctly even this approach may have failed a few rare times, maybe this query can work before everything is fully functional? I'm still on the look out for a better way.

For now I've resorted to what feels like a heavy test, but a test that I hoped would be reliable, I try to run a cypher query repeatedly until it works. I use the script like this

# wait for ephemeral neo4j to go fully online before proceeding
# note: ephemeral is always on the default port
./wait4bolt_outside ${EPHEMERALCONTAINER} ${NEO_BOLT}

--- script file ---


#!/bin/bash
echo "wait for neo4j bolt to respond at port $1"

CONTAINER=$1
PORT=$2

if [ -z ${CONTAINER} ] || [ -z ${PORT} ]
  then
     echo "Usage:"
     echo "  wait4bolt_outside docker_container port"
     echo "  e.g. wait4bolt_outside neo_ag 7687"
     exit 1
fi

# this returns before server is ready
#    curl -i http://127.0.0.1:${PORT} 2>&1 | grep -c -e '200 OK' || break

# try an actual query as test?
docker exec -t \
  --env NEO4J_USERNAME=${NEO4J_USERNAME} \
  --env NEO4J_PASSWORD=${NEO4J_PASSWORD} \
  ${CONTAINER} \
  bash -c "until echo 'match (n) return count(n);' | bin/cypher-shell -a bolt://localhost:${PORT}; do echo $? ; sleep 1; done"

echo 'neo4j online!'

--- script file ---

Note: If there was a simple light weight way to test if neo4j is ready and usable, I'd setup the --health-cmd like I do on the database dockers, but for now I run neo4j dockers without a --health-cmd

Hey Joel thanks for the information. My challenge has been invoking a command to automatically run a cypher script from within the new instance that contains a LOAD CSV statement and it has to wait for Neo4j to start up when the container is launched. I might need to reprocess the CSV so that a different tool like neo4j-admin import. Trying different stuff for now.

Perhaps you've seen these already but they sound similar to your case:

cheers

@Joel

Hi Joel,

How were you able to automate the execution of the neo4j-admin commands? I was googling around and searching this forum for the answer to my question. Based on the quote above, I think you may have the answer to my question.

You also state that,

When adding collections of edges/nodes which would add a significant percentage of new nodes/edges to a database it can be faster to rebuild from scratch than to add them. In my experience it has always been faster. The bulk loader is lightning fast in comparison to load csv.

Now that my database has grown to include 500,000,000 relationships, I am finding this to be completely true and accurate. I've detailed my struggle here:

Hi

Thank you for the reply.

I am looking to automate the process where I do "aws s3 presign file_link". This will give me a new link which I should use in the load command to upload the data. Is there a way to automate this process rather than doing "aws s3 presign file_link" manually for each file and then uploading the data.

Also, I need to mutate the existing graph multiple times a day. So as per your information about the import, I will not be able to use this. Right now I am using apoc.load.csv to upload the data.

Please let me know if there is anyway to automate the above mentioned process.

Thanks

Hi,
You can use this Python ingest utility to load data from S3, URL or local from CSV or JSON files.

This is a simple utility me and my colleague built to help us ingesting large amounts of data.

Regarding the update challenge, if you are trying to do one bulk update per day, you might try switching over to making small changes as they occur through out the day. Trickle updating might be more performant. I'd also avoid committing large transactions either way.

You asked about " automate the execution of the neo4j-admin commands? ", a cron script programmatically creates a script (from snippets, adding the part that is dynamic), then it runs the script (a bash script in my case)

Joel,
I am trying to use neo4j-admin import with signed s3 urls for the node files. Is it documented somewhere how to do this? The urls themselves have been tested in a browser and resolve properly. If I have the files locally ( not in s3 ) the command will run but if I use the signed s3 urls instead, neo4j-admin import command does not parse the urls properly.
Given an error like this:
Invalid value for option '--nodes' at index 0 ([<label>[:<label>]...=]<files>): Invalid nodes file: https://mutualmarkets-development.s3.amazonaws.com/etl-pipline/prepared/iva/SKiQVn/csv/Show/iva-show-200-300_headers.csv?AWSAccessKeyId=AKIAZGQ4RUHNHAD5ZCV5&Expires=1622398364&Signature=p1r5S90N1XbDw4uhqxQeFdZEnVY%3D,https://mutualmarkets-development.s3.amazonaws.com/etl-pipline/prepared/iva/SKiQVn/csv/Show/iva-show-0-100.csv?AWSAccessKeyId=AKIAZGQ4RUHNHAD5ZXX5&Expires=1622398392&Signature=QnVt%2FJvcn%2BcN8QtIzClL9njDDrU%3D (java.lang.IllegalArgumentException: File 'AKIAZGQ4RUHNHAD5ZXX5&Expires=1622398364&Signature=p1r5S90N1XbDw4uhqxQeFdZEnVY%3D' doesn't exist)

I've not tried that myself, I wonder if it can accept http addresses for the files, if yes, then usually it is an escaped character related issue, given or in processing somewhere. (in my experience...)

You could sync the files local and then use them, right?