cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

AWS Glue + Neo4j : Tutorials?

I'm learning AWS Glue (essentially Spark from what I understand about it) and I'd like to use Neo4j as my destination. I have a bunch of JSON in S3 that I'm hoping to process. Does anyone know if this is going to be possible : [S3]->[AWS Glue]->[Neo4j] ? I'm hoping I can just follow the write ups on Spark and they'll hopefully transfer over. If anyone has any resources they can point me to, I'd appreciate it! I'm new to the whole Spark ecosystem so I know I have a lot of googling ahead of me.

1 ACCEPTED SOLUTION

I finally got around to spending some more time on this project today and found some success that I wanted to share for anyone else who may come across this post in their education.

Neo4j Spark Documentation - helpful starting point
Download the latest release of the connector from GitHub and upload it to an S3 bucket.
I also downloaded the GraphFrames jar and uploaded it to the S3 bucket

AWS Glue Job
I made a Scala job because that's what the examples are written in (To Do: figure out the python equivalent)
Dependent Jars
include the two jars comma separated
Parameters
This was the tricky part, AWS only lets you specify the a key once. They also don't encourage you to pass in --conf settings but that's how Neo4j wants the connection parameters. Specify a --conf key and the value I just kept on specifying more confgs like this: spark.neo4j.bolt.url=bolt://mydomain.com:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password'. The Neo4j documentation says you can combine the user & password all part of the URL parameter but I never could get this work.

That's what you need as far as specifying the Glue Job. Then for the actual code of my job I did just a very basic select and print results.

import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

import org.neo4j.spark._
import org.graphframes._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    
    val neo = Neo4j(spark)
    
    val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame
    
    graphFrame.vertices.show()
  }
}

Most of this is just the boilerplate that AWS provides when making a new scala job. Not knowing scala and being new to spark in general, it took me some trial and error to get all this figured out. But it runs successfully and in the CloudWatch logs I can see values from my database printed!

Things still left to do

  • Figure out how to do this python
  • How to write and general manipulations with graphframes
  • Manage connection information to be passed in as parameters
  • What if you had to have two database connections how would you manage that

View solution in original post

15 REPLIES 15

I finally got around to spending some more time on this project today and found some success that I wanted to share for anyone else who may come across this post in their education.

Neo4j Spark Documentation - helpful starting point
Download the latest release of the connector from GitHub and upload it to an S3 bucket.
I also downloaded the GraphFrames jar and uploaded it to the S3 bucket

AWS Glue Job
I made a Scala job because that's what the examples are written in (To Do: figure out the python equivalent)
Dependent Jars
include the two jars comma separated
Parameters
This was the tricky part, AWS only lets you specify the a key once. They also don't encourage you to pass in --conf settings but that's how Neo4j wants the connection parameters. Specify a --conf key and the value I just kept on specifying more confgs like this: spark.neo4j.bolt.url=bolt://mydomain.com:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password'. The Neo4j documentation says you can combine the user & password all part of the URL parameter but I never could get this work.

That's what you need as far as specifying the Glue Job. Then for the actual code of my job I did just a very basic select and print results.

import com.amazonaws.services.glue.ChoiceOption
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.ResolveSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._

import org.neo4j.spark._
import org.graphframes._

object GlueApp {
  def main(sysArgs: Array[String]) {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    // @params: [JOB_NAME]
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    
    
    val neo = Neo4j(spark)
    
    val graphFrame = neo.pattern(("Person","id"),("KNOWS",null), ("Person","id")).partitions(3).rows(1000).loadGraphFrame
    
    graphFrame.vertices.show()
  }
}

Most of this is just the boilerplate that AWS provides when making a new scala job. Not knowing scala and being new to spark in general, it took me some trial and error to get all this figured out. But it runs successfully and in the CloudWatch logs I can see values from my database printed!

Things still left to do

  • Figure out how to do this python
  • How to write and general manipulations with graphframes
  • Manage connection information to be passed in as parameters
  • What if you had to have two database connections how would you manage that

Did you get anything on how to do this in python/pyspark?

Not yet, life got busy and I had to put the project down and I haven't picked it back up again yet. If you figure it out, please post and let me know.

to @VGVG and @mike.r.black we're working on an updated spark connector. You can currently find it in the spark connector repo on github, the 4.0 branch. We're going to be putting out a pre-release with full documentation on September 30th. If you're interested in testing & providing feedback, please drop me a note at david DOT allen @ neo4j DOT com

The overall API is changing to use the DataSource API within Spark, and it's going to be quite a bit nicer.

Link to the updated release is right here -- would love to get any feedback from those playing with these systems. https://github.com/neo4j-contrib/neo4j-spark-connector/releases/tag/4.0.0-pre1

Hello,

From this channel has there been a successful way to get glue to work with Python? I have successfully:

  1. incorporated Neo4j driver
  2. placed both jar files in appropriate S3 buckets and key'ed locations to them
  3. Placed in --conf parameters to AWS Key with the string:
    - "spark.neo4j.bolt.url=bolt://"hostname":7687 --conf
    spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password"
    - no quotations

but I still get this error in ran jobs on AWS/Glue:
java.net.UnknownHostException: "hostname": Name or service not known

where "hostname" is a link (for firewall purposes that have been approved) similar to hey-neo4j-opm.cth.compn.net

Any help is fully appreciated.

Neo4j Version: 4.4.1
Spark Version: 3.1.1+amzn.0
Python 3.7

I am thinking that I just need to convert the small scala/glue job above to python code
(ending in:
graphFrame.vertices.show()
}
}
)

thoughts?

any help is fully appreciated!

Hello,

From this channel has there been a successful way to get glue to work with Python? I have successfully:

  1. incorporated Neo4j driver
  2. placed both jar files in appropriate S3 buckets and key'ed locations to them
  3. Placed in --conf parameters to AWS Key with the string:
  • "spark.neo4j.bolt.url=bolt://"hostname":7687 --conf
    spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=password"
  • no quotations

but I still get this error in ran jobs on AWS/Glue:
java.net.UnknownHostException: "hostname": Name or service not known

where "hostname" is a link (for firewall purposes that have been approved) similar to hey-neo4j-opm.cth.compn.net

Any help is fully appreciated.

Neo4j Version: 4.4.1
Spark Version: 3.1.1+amzn.0
Python 3.7

I am thinking that I just need to convert the small scala/glue job above to python code
(ending in:
graphFrame.vertices.show()
}
}
)

thoughts?

any help is fully appreciated!

Hi Mike,

I keep getting a syntax error while running your script above as a AWS glue job

object GlueApp {
^
SyntaxError: invalid syntax

Could you please let me know what's wrong?

Also, it is possible to use AWS glue to get the neo4j properties/schema only?

Thanks,
Charles

Here some things I'd verify:

  • You're using AWS Glue Scala and not AWS Glue Python
  • You have the dependent Jars specified when creating the job
  • You've set the job parameter for --conf

To test that you have at least all the jars and configuration importing correctly, you can comment out that object and just print out hello world . Then slowly just start uncommenting code until things break.

Are you asking if you can use Cypher to get the Neo4j database schema? The same information if you were to execute this in the Neo4j Browser?

call db.schema()

Hi Mike,

Thanks for your quick reply! I am currently working on a project to use AWS Glue to collect the database schema on AWS. For example, we use AWS crawler job to collect RDS MySQL database schema information, such as table name, column name, data type and etc.
One of our application teams is using Neo4j, so we are wondering whether we can do similar things for that database technology too, and that is how I found your article on the internet. Could you please let us know whether that is possible?

Thank you in advance!

Charles

I'm sure it's going to be possible. I'm still new to learning about Graphframes and working with with AWS Glue (aka Apache Spark) but I don't have specific instructions how to.

If I may, I'd question the choice to use AWS Glue to go about collecting database metadata. Glue (Apache Spark) is good for large volumes of data when you want to leverage massive parallel processing (MPP). Using it to query a database for it's schema seems like overkill. Not only is it a much larger engine than what you need, it'll be costly because AWS Glue has a minimum bill of 10 minutes. To gather database meta data would take much less than that with relatively few rows that you wouldn't need to shard out across multiple DPUs.

What would be more appropriate and likely far easier and cheaper to implement is to use a AWS Glue Python Shell Job. It's much cheaper and lighter weight processing engine running just plain python. Refer to the instructions on how to connect vanilla python to Neo4j.

That's just my opinion and you may have other reasons why you need to be using Apache Spark to gather metadata.

Thanks Mike for the information. I was pulled into a project to save all database schemas into the AWS Glue catalog, so that our developers don't need to create various "collectors" to retrieve the schema for various different database technologies. I am actually
a relational database engineer, and new to both AWS Glue and neo4j. I will definitely relay your information to our developer teams.

By the way, I still can't get the your stuff working under my account. When I ran the scala job, I get this error message:

Compilation result: /tmp/g-5de1747b394a6f0bf75445dd36dd03e4d27dbe16-790763640924356726/script_2020-01-13-14-34-21.scala:15:
error: expected class or object definition GlueApp { ^ one error found Compilation failed.

Could you please review my glue job definition below to see anything
wrong? What should be Scala class name?

Name:
yju200-neo4j-scala

IAM
role: AWSGlueServiceAdminRole

Type:
Spark

Spark
version: 2.4

ETL
language: scala

Scala
class name: GlueApp

Script
location: s3://aws-glue-scripts-xxxxx-us-east-1/admin/yju200-neo4j-scala

Temporary
directory: s3://aws-glue-temporary-xxxxx-us-east-1/admin

Job
bookmark: Disable

Job
metrics: Disable

Continuous
logging: Disable

Server-side
encryption: Disabled

Python
lib path: s3://yju200-glue/graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip,s3://yju200-glue/neo4j-spark-connector-2.2.1-M5.zip

Jar
lib path: s3://yju200-glue/graphframes-f9e13ab4ac1a7113f8439744a1ab45710eb50a72.zip,s3://yju200-glue/neo4j-spark-connector-2.2.1-M5.zip

Other
lib path:

Job
parameters: --conf spark.neo4j.bolt.url=bolt://10.140.107.171:7687 --conf spark.neo4j.bolt.user=neo4j --conf spark.neo4j.bolt.password=xxxxx Connections:

Maximum
capacity: 10

Job
timeout (minutes): 2880

Delay
notification threshold (minutes):

Tags:

Hi,
we have successfully managed to create a AWS Glue job (with Neo4J 4.4.0 Aura and Scala 2,12), which reads the data from a csv on S3 and writes them on a db on Aura, following these steps:

1 Placed on S3 the Spark connector jar for Spark 3 , Scala 2.12 and Spark 4.1.0 (downloaded here Releases · neo4j-contrib/neo4j-spark-connector · GitHub)
2 Placed the scala script on S3
3 Created the AWS Glue Job and setting the parameters:

  • IAM role: yourIAMRole
  • Glue version 3.0
  • Spark version 3.1
  • ETL language: scala
  • Scala class name: Test
  • script location: S3 path of the Scala script (step 2)
  • jar lib path: S3 path of the Spark connector jar (step 1)

It is important that the IAM role you choose has all the permission on S3 and AWS Glue
In the following there is the script used, which reads a csv files and writes to Neo4j Aura:

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import java.util.Calendar
import scala.util.Random
import org.apache.spark.SparkContext
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.Trigger
import scala.collection.JavaConverters._

object Test {
  def main(sysArgs: Array[String]) {
    	 val spark: SparkContext = new SparkContext()
    	 val glueContext: GlueContext = new GlueContext(spark)
    	 val sparkSession: SparkSession = glueContext.getSparkSession
    	 import sparkSession.implicits._
    
	     val staticData = sparkSession.read         
      	    .format("csv")
      	    .option("header", "true")
      	    .load("s3://graph-ai-datasets/athena/tables/bandi_gara/addresses.csv")
      	    .toDF
      
    	 staticData
        	.write
        	.format("org.neo4j.spark.DataSource")
        	.mode(SaveMode.Overwrite)
        	.option("url", "neo4j+s://***")
        	.option("database", "neo4j")
        	.option("authentication.type", "basic")
        	.option("authentication.basic.username", "neo4j")
        	.option("authentication.basic.password", "password")
        	.option("node.keys", "name,surname")
        	.option("labels", ":Person:Customer")
        	.save()

   }
}

I greatly appreciate this!

would you think there would be an issues with performing above direction, with
ETL language: python
?
I don't think so, but thought to ask.

thanks for all the assistance on this!

I can answer to your, question.
no there is no issue by using Python as programming language, you just need to translate the code