cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Create edge using apoc.periodic.iterate suffer from Cartesian product

The following is the cypher which I run

####

CALL apoc.periodic.iterate("MATCH(e:User), (f:User)  WHERE  e.buyid = f.sellid RETURN  e,f",

"CREATE(e)-[r:sell_prodcut_to]->(f)",

{batchSize:10000, parallel: true}) YIELD batch

####

It suffer from something maybe Cartesian product since it takes too much time with no any result showed in 10 billion node but 5 billion still do. How can I alter the cypher so that the problem can be solved? 

 

Remark : The node must be named as "User" for both seller and buyer. 

 

Thanks.

 

3 ACCEPTED SOLUTIONS

Do you have indexes created for these two properties?  

create index user_sellid if not exists for (n:User) on n.sellid;
create index user_buyid if not exists for (n:User) on n.buyid;

 

View solution in original post

@glilienfield 

No, but excuse me, should I Create index before or after creating node ?

View solution in original post

Although Create index can return the result successfully in 20min, but it spend too much time on creating index not only before but also after creating node, my data need to be delete/add dynamic everyday it's not what I want.

 

I tried the following cypher and it work successfully, share with you !

####

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid})    RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
####
 
It take only 10 min without the help of index. Although create index must can improve the speed, but I don't due to create index take too much time.
 
However, are there any method that I can avoid Match two times? For example,
#####
CALL apoc.periodic.iterate("MATCH(e:User) where e.buyid=e.sellid   RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(e)",
{batchSize:10000, parallel: true}) YIELD batch
####
But the result of edge (sell_product_to) would duplicate two times...
 
Thanks.

View solution in original post

8 REPLIES 8

Do you have indexes created for these two properties?  

create index user_sellid if not exists for (n:User) on n.sellid;
create index user_buyid if not exists for (n:User) on n.buyid;

 

@glilienfield 

No, but excuse me, should I Create index before or after creating node ?

You should create them as early as possible, so they are leveraged as needed.  Anyway, the indexes will be built in the background and will come online when finished. The above query should run faster, as well as other queries that use these properties. 

Although Create index can return the result successfully in 20min, but it spend too much time on creating index not only before but also after creating node, my data need to be delete/add dynamic everyday it's not what I want.

 

I tried the following cypher and it work successfully, share with you !

####

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid})    RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
####
 
It take only 10 min without the help of index. Although create index must can improve the speed, but I don't due to create index take too much time.
 
However, are there any method that I can avoid Match two times? For example,
#####
CALL apoc.periodic.iterate("MATCH(e:User) where e.buyid=e.sellid   RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(e)",
{batchSize:10000, parallel: true}) YIELD batch
####
But the result of edge (sell_product_to) would duplicate two times...
 
Thanks.

I figured it would take a while to create the initial index because you have a lot of nodes.  Did it really take long to save a new node once the indexes where online? 

Your issue is that you are asking to find all pairs of nodes that match your criteria, then you create a relationship between each pair of nodes.  Can you create the relationship when you add the nodes, instead of after all the nodes are entered? 

For the first one, I would make some test and show the result for you.

For the second, how can I create the relationship when add the node? 

The cypher that I add the node is the following : 

 

###

CALL{

CALL apoc.periodic.iterate('

CALL apoc.load.csv("user.csv") YIELD  value  return  value','

WITH  value

CREATE(User:user{sellid:user.sellid, buyid:user.buyid, name:user.name})',

{batchSize:10000, iterateList:true, parallel:true})  YIELD batches

}

###

 

@glilienfield  It's strange that the index do not improve but even worsen.

I try the following

case1 

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid})    RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
 
case 2
CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid})   USING INDEX f:User(sellid) RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
 
CASE 1 : 10 MIN 
CASE 2 : 12 MIN
 
Why ? (in both case all 0.1 billion node)
 
 
B.T.W., case 3 
CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User{sellid:e.buyid})   USING INDEX e:User(buyid) RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
 
Fail, show 
Failed to invoke procedure `apoc.periodic.iterate`: Caused by: org.neo4j.exceptions.SyntaxException: Cannot use index hint `USING INDEX e:User(buyid)` in this context: Must use label `User`, that the hint is referring to, on the node `e` either in the pattern or in supported predicates in `WHERE` (either directly or as part of a top-level `AND` or `OR`), but no label was found. Predicates must include the label literal `User`. That is, the function `labels()` is not compatible with indexes. Note that label `User` must be specified on a non-optional node

 

Did you run each case multiple times? What if you go back to your original query:

CALL apoc.periodic.iterate("MATCH(e:User) MATCH (f:User)  
WHERE e.buyid = f.sell if
RETURN  e,f",
"CREATE(e)-[r:sell_prodcut_to]->(f)",
{batchSize:10000, parallel: true}) YIELD batch
 
any difference. It may be the same. I would have to look at the query plan, which I can’t do on my phone. 
 
the index must be due to not having a predicate based one the index you hinted to use. 
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online