Neo4j

TK36 · ‎12-18-2022

I am using Noe4j 4.4.5 and streaming big data from Spark to Neo4j. I have streamed around 36 million addresses belonging to a blockchain as nodes with no problems. Now, I'm trying to connect the addresses that transacted with each other. The data is preprocessed in Spark, so this part has no issues. My issue is with the Cypher code that creates relationships among the nodes. Here is the code I'm using now:

Tx.write.format("org.neo4j.spark.DataSource") \
.mode("append") \
.option("url", "bolt://<IP>:7687") \
.option("query","<SEE BELOW>") \
.save()

Here is the query, which is probably causing the problem :

MATCH (a1:Address {address: event.input_addr})

MATCH (a2:Address {address: event.output_addr})

CREATE (a1)-[:TRANSACTED_WITH {tx_hash: event.tx_hash, value: event.value}]->(a2)

The code runs with no errors, but the streaming doesn't start probably because the processing takes too long with millions of nodes. I created a TEXT index for Address to make the lookup using MATCH more efficient, but this still doesn't seem to solve the problem. Is there a way to rewrite the query in a more efficient way? Do you have any suggestions on how to do the same thing differently?

glilienfield · ‎12-18-2022

You may get better performance if you use a relationship type/property index on tx_hash and value, instead of relying on node property index on address. This was added in 4.3. I have not used it, but it is a thought.

https://neo4j.com/developer-blog/neo4j-4-3-blog-series-relationship-indexes/

Quote from the blog:

Creating a Relationship and /or Relationship Property Index will enable you to run more complex queries on relationships in less time. The indexes help reduce scanning across the entire database, which translates into fewer hits on the database.

TK36 · ‎12-19-2022

Thank you @glilienfield for your response. I believe the index for relationships is not impactful in this case since the lookup using MATCH is done on the nodes. I tried to stream 10 transactions instead of millions, and they were streamed in about 10 minutes. I'm trying to steam 1000 now to see how long it would take, but I think I need to find a suitable solution since I have millions of transactions to stream! Is there a way to rewrite the query while avoiding using MATCH?

glilienfield · ‎12-19-2022

You are correct....my mistake...you haven't created the relationships yet.

Are you stating that if you execute that cypher for two specific nodes it does not work? If so, I would conclude it is not finding one or both of the nodes?

I don't see a way around needing match, as you need to get a handle on the nodes in order to create the relationship between them. I assume you can't do this during streaming when you would have the nodes, because you don't have the two node to related at that time.

TK36 · ‎12-19-2022

@glilienfield

I believe all nodes exist already. I tried running the code for 10 transactions and it worked. It just took a long time (~10 minutes) probably to find the right nodes out of the ~40 million nodes I have. Running the same query for 1000 transactions is taking too long (more than 2 hours so far!). I thought that neo4j is designed to handle millions of data, so I'm surprised that there is no way to accomplish my goal efficiently! Do you know of any other way to stream the data from Spark without having to run cypher queries? Is there a way to preprocess the whole network using python and then stream the nodes with their relationships to neo4j?

glilienfield · ‎12-19-2022

This is surprising. I feel it may not be using your index. Let’s look at the query plan. Add ‘explain’ as the first line of the query and execute it. Do this in the browser. There will be a new icon on the left hand side to select the plan, where you normally can select graph, text, table, etc.

also, it may not pick the text index because it can’t tell the value of address is a string since it is a parameter. You may want to try a Range index instead. There is a discussion about this here:

https://neo4j.com/docs/cypher-manual/current/query-tuning/indexes/

If you use the python driver, you still will have to execute the same query.

sorry, I have not used the Spark connector.

TK36 · ‎12-19-2022

Thank you so much @glilienfield for taking the time to help me. Your effort is greatly appreciated 🙂

I think the text index is working. Here is a screenshot that captures the first part:

I also tried deleting it to see how it compares to the lookup index (I found that it's made automatically):

I tried to create the range index, but it failed for some reason. Maybe I need to create it before creating the nodes?

Do you think that making a new numerical ID property for each address (node) would make the lookup faster? If so, is there a way to do that in neo4j easily and efficiently?

glilienfield · ‎12-19-2022

You may have gotten the error when creating the index because it already existed. You can see your indexes with 'show indexes'.

I am really confused with your query plan results. It shows both matches using the Text index, but it is showing 2 million rows for one match and 4 trillion rows for the other match. Are the address values unique, thus you should get one node for each match?

There is not an easy way to assign a unique numerical key to each node. Neo4j assigns a unique long value to each node and relationships, but the values are reused when the entities are deleted. I ended up creating my own mechanism. I have a node for each sequence I want to generate that contains the next sequence number for the sequence. I did this in a custom procedure. You could do something similar using a driver and transactions, but I think it would take along time, considering the time it is taking to create your relationships. I guess you could do it batches.

I would be nice to see how what the query plan is if you use 'Profile' instead of 'explain'. This would actually execute the query and show what happened. Explain is an estimate. I would do it on a small set of addresses that you know would finish.

tard_gabriel · ‎12-19-2022

Hi @TK36

Looks like a simple but crazy cartesian product to me, try this:
MATCH (a1:Address {address: event.input_addr}), (a2:Address {address: event.output_addr})

If my solution worked, please press the solution button on my answer, it helps me to provide more insights to the community.

Neo4j

Using MATCH with Millions of Nodes