cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Creating relationship over several millions of nodes

Hello!

I have 2 node labels, one of them has around 750K nodes, and another one with almost 50millions nodes. I have a stand-alone neo4j 3.5 running with 8 core and 96g memory. I am creating a relationship between these node labels using apoc.periodic.itererate but this query is been running for too long.
Some of the node label properties get refreshed every day and I want to be able to refresh the relationship once the node properties are updated in both labels. The last time when I ran the query to update the relationship, it was running for more than 8 hrs and I had to kill it.
Do you guys have any solution for this to speed up this process? Should I be doing this in a different way? If so, would you please let me know the approach.(I'm new to Neo4j).

Thanks!!

23 REPLIES 23

call apoc.periodic.iterate("match (i:Inventory) , (p:Products) where p.p3 <> '0' and i.p1= p.p1 OR where p.p3 <> '0' and i.p2= p.p2 return i,p","merge (i) -[:PRODUCES]->(p)",{batchSize:20000, parallel:false})

This is how the query looks like and I have indexes on p1,p2 and p3.

  1. Negation is expensive
    (good formatting isn't)

First, lest get this one running faster:

MATCH (i:Inventory), (p:Products)
WHERE p.p3 <> '0' AND  i.p1= p.p1 OR # this one is gonna *kill* your cpy and ram
WHERE p.p3 <> '0' AND i.p2= p.p2
RETURN i,p

So... let's make sure we only need to do that negation once, and group the rest of the WHERE clause more carefully

MATCH (p:Products)
WHERE p.p3 <> '0'

WITH p
MATCH (i :Inventory) 
WHERE i.p1= p.p1 OR i.p2= p.p2
RETURN i,p

...which will get you to a faster solution:

CALL apoc.periodic.iterate("
    MATCH (p:Products) WHERE p.p3 <> '0' WITH p
    MATCH (i :Inventory) WHERE i.p1= p.p1 OR i.p2= p.p2
    RETURN i,p
","
    MERGE (i) -[:PRODUCES]->(p)
",{batchSize:20000, parallel:false})

However, these would probably be better handled when importing data, or changes. When you're updating the inventory, the additional Cypher to adjust these relationships wouldn't have to check the entire DB. I strongly advise taking a very close look at whatever is creating/updating this data, as it will be orders of magnitude more efficient to do it then, instead of after the fact.

Thank you @tony.chiboucas!! I'll try your query and see if it helps.

I'm sorry about the formatting. 🙂

I absolutely agree with you. I should be creating this relationship when the Product nodes are created/updated batch-wise. I am using golang-neo4j driver and session.WriteTransaction to write/update 20000 Product nodes at a time using go-routine. I will just plug in this cypher query into that go-routine function.
Since I am still in the testing phase I wanted to give this query a try by running manually in the browser.

Thanks again!

Could you share the query/command that is creating/updating the nodes to begin with?

sure .. here is it is - I pasted only the data create portions


func InsertToNeo(data interface{}) error {
	m := ArrayStructToMap(data)
	session, err := NewSession(InfraObj.NeoDriver)
	if err != nil {
		return err
	}
	defer session.Close()

	_, err = session.WriteTransaction(**execute**(m), neo4j.WithTxTimeout(300*time.Second))
	if err != nil {
		return err
	}

	return nil
} 

func execute(data interface{}) func(tx neo4j.Transaction) (interface{}, error) {
	return func(tx neo4j.Transaction) (interface{}, error) {
		result, err := tx.Run(
			`unwind $events as event MERGE (a:Products {key: event.key}) SET
			          a.p1 = event.p1,
		              a.p2 = event.p2,
                      a.key = event.key
                       .
                       .
                       .
                       .
                       .
	              a.p10 = event.p10`,
			map[string]interface{}{"events": data})
		if err != nil {
			return nil, err
		}
		return nil, result.Err()
	}
}

I am still not sure how to plug these events (data) to relationship creation query. If you know how to do it please let me know. That would save a ton of my time Thanks in advance!

UNWIND $events as event
MERGE (a:Products {key: event.key})
SET
	a.p1 = event.p1,
	a.p2 = event.p2,
	a.key = event.key,
	...
# Just add at the end...

WITH a
MATCH (i :Inventory)
WHERE i.p1 = a.p1 OR i.p2 = a.p2
MERGE (i)-[:PRODUCES]->(a)

Note: This won't clean up old connections... so as your data changes, old i - p relationships will persist. You'll want to come up with a strategy for when/how to clear old :PRODUCES rels.

WOW, I honestly didn't think this is going to be this easy Thanks so much!!!!

Yes, I am going to wipe the DB and reload the data from scratch.

Ohh .. a qq - do you think I still need to plug in the below query to my Products update function? or just enough running this query in the create function like above?

MATCH (i :Inventory)
WHERE i.p1 = a.p1 OR i.p2 = a.p2
MERGE (i)-[:PRODUCES]->(a)

That's the point of doing it during CREATE, the relationships are created then. No extra work needed. Everything better, faster, stronger.

gotcha.. thanks a ton, Sir!
Have a great day!

I'm sorry to bother you again.

It seems that it's still taking too long to create the relationship. Below is my code snippet

result, err := tx.Run(
		`unwind $events as event MERGE (a:Products {key: event.key}) SET
			a.p1 = event.p1,
			a.p2 = event.p2,
			a.key = event.key,
			.
            .
		    .	
            a.pN = event.pN

			WITH a MATCH i:Inventory) WHERE i.p1 = a.p1 OR i.p2 = a.p2  
			MERGE (i) -[:PRODUCES]->(a)`,
			map[string]interface{}{"events": data})

For testing purposes, I am only creating 25 a records at a time and trying to create PRODUCES rel with 750K i records. It's taking forever.... (I just killed the program).
If you have any insight on this please let me know. Thank you!

  1. Did you create indices on i.p1 and i.p2?
CREATE INDEX inventory_p1 FOR (i:Inventory) ON (i.p1)
CREATE INDEX inventory_p2 FOR (i:Inventory) ON (i.p2)
  1. You're missing a parenthesis: WITH a MATCH i:Inventory)

  2. Isolate code steps, and get the profile of that Cypher.

result, err := tx.Run(
    `UNWIND $events AS event CREATE (a:TempEvent) SET a = event`,
    map[string]interface{}{"events": data})

Once that's run once, open a Neo4j Browser, and run the next step there, with EXPLAIN. That'll show you where the bottleneck is.

EXPLAIN 
MATCH (a:TempEvent)
MATCH (i:Inventory) WHERE i.p1 = a.p1 OR i.p2 = a.p2  
MERGE (i) -[:PRODUCES]->(a)

Note: simplify

If you really are trying to copy all $event properties onto a :Product node, and an $event never contains an object in a property, you can simply do the following:

result, err := tx.Run(
    `UNWIND $events AS event
     MERGE (a:Product {key: event.key}) SET a = event`,
    map[string]interface{}{"events": data}
)

If an event does not meet that criteria, I'd give that Class a method which returns the structure you want in Neo that does meet that criteria.

Thank you for your time.

  1. It was created before running the program.
  2. Sorry it was copy/paste error here
  3. correct.. an $event does not contain an object and your unwind just worked fine. I'll follow this for creating new nodes.

Here is the EXPLAIN o/p. I didn't get much out of it. I am reading about EXPLAIN now and meanwhile, I thought I would share the screenshot here. Thanks!

....i just tried below query in the browser without OR condition.. it finished within 1ms.

MATCH (a:TempEvent) 
MATCH (i:Inventory) WHERE i.p1 = a.p2 
MERGE (i) -[:PRODUCES]->(a)

That profile looks good. It is using the index. These should be running pretty quick, even with such a large dataset.

... and how'd it do when you keep the OR condition?

I had to kill it after 30 mins. I was tracking CPU usage as well while it was running. I did not see it's going beyond 13%

Okay, the for whatever reason, it's trying to load all the :Inventory before applying the OR to exclude them. Shouldn't be doing that, but that'll be a tough one to debug.

We can work around the problem by never using an OR, and just calling the MATCH ... MERGE once for each property. The MERGE prevents creating duplicate rels, so there will only ever be one or none [:PRODUCES] for each i : p pair.

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

...sorry there's not really a cleaner way to do that without writing your own Neo4j procedure.

Now at least we know why your process was taking forever... it was trying to load your entire :Inventory into RAM.

Ok .. that makes sense. Thanks!!
(My earlier statement was incorrect so deleted it)

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

I tried this but it was still hanging and my program wouldn't complete. I killed it and interchanged the position of 2 queries in my code and voila it finished within a few ms.

result, err := tx.Run(`
        UNWIND $events AS event
        MERGE (a:Product {key: event.key})
        SET a = event

        WITH a MATCH (i:Inventory) WHERE i.p2 = a.p2 MERGE (i)-[:PRODUCES]-(a)
        WITH a MATCH (i:Inventory) WHERE i.p1 = a.p1 MERGE (i)-[:PRODUCES]-(a)
    `,
    map[string]interface{}{"events": data}
)

I am going to test it against 5000 nodes now. Will see how it behaves.

Thanks!

That is very odd. There might be something off with your i.p1 data or index.

Relationship Direction

You're missing the direction on your :PRODUCES relationship:

(i)-[:PRODUCES]-(a)

Which leads me to three graph-data-design questions:

  • Which is the parent? :Inventory or :Product?
  • What will your queries be looking for?
  • Where's the "many" side of the relationship between i and p?

The answers to these questions should inform the directionality of the :PRODUCES relationship, because query performance starts with choosing either in or out relationships from initially selected nodes. When in doubt, point from many to few, or children to parents. For your case, I suspect you have many more Products than Inventory, and each Inventory item "produces" many different Products? If that is the case, you should consider (:Product)-[:PRODUCED_BY]->(:Inventory).

Code Standards Note: :Products shouldn't be plural unless each node represents many products.

Sorry, my bad. There were some issues with i.p1. I fixed it this morning and it's not hanging anymore!!

Your understanding is right. I have many more Products(~10-20M) that are PRODUCED_BY by Inventory nodes.
I am very new to Neo4j and I was reading a blog last week where they recommended to go with one -> many relationship so I chose Inventory -> Product.

  • Which is the parent? :Inventory or :Product ?
    Inventory. As I am planning to connect other node labels (Price, Owner, Location.. etc) with my Inventory as well.
  • What will your queries be looking for? My queries will start from the Inventory node and then it will pull all data connected to that inventory node.
  • Where's the "many" side of the relationship between i and p ?
    This is where I am getting confused a bit.I think I can go with either way right? In my case an Inventory node can produce many products. Similarly, same product can be produced by more than one Inventory node.

Could you link me, so I can go yell at them?

Information Architecture, Entity Markup Language, Type Theory, Object Modeling, and many, many other notation systems which are inter-related all follow a standard rotational system of child --> parent or many --> one. For Neo4j and Cypher it doesn't matter much, as long as you stay consistent in your data and queries. However, in logic, object modeling, type theory, and information architecture (which can all inform or be informed by the graph) directionality can have significant impact.

Obsessed! This really makes sense and thanks to all parties involved in deriving this solution.