cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

5 billion data suffer from handling duplicate node spent too much time problem

Peter_Lian
Node Clone

The data are showed below (each row is the information about buyer)

Peter_Lian_0-1671787479445.png

Remark : The buyer do not repeat in each row but the seller do, and the seller may not from one of the buyer.

 

I want to create the node and edge as the following

Peter_Lian_1-1671787597660.png

i.e., the relation between buyer and seller. The following is the step and code

Step (1) Create buyer node (named as user)

Create (user:User {id:data.id, name:data.buyer, buyer = TRUE})

Step (2) Create seller node (named as user)

Create (user:User {id:data.id, name:data.seller, seller = TRUE})

Step (3) Link the buyer and seller based on id

Match(n:User)

Match(p:User)

Where n.id = p.id AND n.buyer IS NOT NULL and p.seller IS NOT NULL

Step (4) Merge the duplicate node (i.e., merge the node where name is the same)

MATCH (n:User)

WITH n.name AS repeatname,  collect(n) AS nodes

WHERE size(nodes) > 1

CALL{

WITH nodes

CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 

YIELD node

}

Then I can get the figure which I want successfully. However, I suffer from performance problem, since the data size is not just 5 but about 5 billion.  The performance problem happen in step (4), i.e., apoc.refactor.mergeNodes. It take too much time (more than one day)

 

Is there any other method (for example, do not need apoc.refactor.mergeNodes) that not only can spent less time (more performance) but also get the result of figure? I had tried the periodic method which make parallel compute but it still too much time.

 

Thanks a lot!

 

1 ACCEPTED SOLUTION

When 'merging', you should include only the parameters in the in the '{}' that represent the primary key of the node. In most cases, this is just one property that uniquely identifies the node. It can also be multiple properties, as would be the case for a composite primary key. I think in your case, the primary key is just the 'name'. As such, you only want to include the name in the 'merge' clause, to avoid duplicates as the other properties change. I suggest something like this, where you 'merge' only one the entity's primary key, and use 'set' to set all the remaining properties. 

 

 

merge(b:User{name:data.buyer})
set b.property = data.aaa
merge(s:User{name:data.seller})
set s.property = data.bbb
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time

 

You are correct. If you need to remediate your data by merging nodes with the same 'name', you will need each batch to contain all the data for each name. This is achieved by collecting the nodes by name outside of the batching, and sending the collections to the call subquery, which will batch them into groups of collections to process. Try this:

:auto MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
    WITH nodes
    CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 
    YIELD node
} in transactions of 10000 rows

You could try apoc.periodic.iterate to see if it is better:

CALL apoc.periodic.iterate(
  '
    MATCH (n:User)
    WITH n.name AS name, collect(n) AS nodes
    WHERE size(nodes) > 1 
    RETURN nodes
  ',
  'CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})',
  {batchSize:10000, parallel:false})

As I suggested, the best approach is to import that data so that you avoid the need to merge the duplicate nodes afterwards.  

View solution in original post

5 REPLIES 5

It looks like you are creating your own problem they way you are creating the data. Each row represents a transaction between a buyer and seller. The id is the transaction id. In your algorithm, you are creating a buyer specific node for this transaction and a seller specific node for this transaction, since you are creating new nodes with the transaction id for each transaction. You then are left with duplicate nodes for each seller and buyer, when view by name only, that you then want to merge by name. If my understanding is correct, you should be able to avoid the necessity of a merge phase if you import the data differently. 

I suggest you create the seller and buyer nodes for each transaction and link the two nodes. The seller and buyer nodes do not need the transaction id as a property; it is not an attribute of the user. This is what is causing the duplication. Your merge operation results in the same graph. 

You could do something like the following:

merge(b:User{name:data.buyer})

merge(s:User{name:data.seller})

merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)

set t.item=data.object, t.time=data.time

the resulting graph will have one node per user, a relation  between seller and buyer per transaction, and the relationship will identify the transaction. There is no need for the Boolean  ‘seller’ and ‘buyer’ properties, as you can tell if a user was every a buyer, since the user will have an outgoing PURCHASED_FROM relationship, and sellers will have an incoming PURCHASED_FROM relationship. Users that that are both seller and buyer will have at least one of each relationship. You can also get all the transactions a user made, of either type. 

You may want to consider having a transaction node that has the transaction properties and links the buyer and seller through HAS_BUYER and HAS_SELLER relationships respectively. This would be useful if you needed the transaction to have relationships to other entities, like shipper or supplier, ship to address, payment info, etc. 

 

Thanks, it almost work but two more problems left. In summary, the following two problems

(1) How to let the merge code recognize the specify property (name) which used to be criterion for duplicate node when there're other property present?

(2) How can the merge method applied in apoc.periodic or just apply parallel computation but no Batchsize

 

Detail : 

In fact, I need to add one more property (do not showed in Table I provided) in node which can not store in edge, i.e., the following

 

merge(b:User{name:data.buyer, property:data.aaa})

merge(s:User{name:data.seller, property:data.bbb})

merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)

set t.item=data.object, t.time=data.time

 

but the Merge method result in the duplicate node again (duplicate based on name) since I add another property named aaa and bbb. Is there any way that I can control merge to recognize just the name but not the two-dimension vector (name and aaa or name bbb)? 

 

Another problem is that with the merge code it seems that it can not using the periodic comment otherwise the duplicate node would appear in the different Batch. Are there method to using periodic or just parallel computation with no Batchsize in periodic? Thanks

When 'merging', you should include only the parameters in the in the '{}' that represent the primary key of the node. In most cases, this is just one property that uniquely identifies the node. It can also be multiple properties, as would be the case for a composite primary key. I think in your case, the primary key is just the 'name'. As such, you only want to include the name in the 'merge' clause, to avoid duplicates as the other properties change. I suggest something like this, where you 'merge' only one the entity's primary key, and use 'set' to set all the remaining properties. 

 

 

merge(b:User{name:data.buyer})
set b.property = data.aaa
merge(s:User{name:data.seller})
set s.property = data.bbb
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time

 

You are correct. If you need to remediate your data by merging nodes with the same 'name', you will need each batch to contain all the data for each name. This is achieved by collecting the nodes by name outside of the batching, and sending the collections to the call subquery, which will batch them into groups of collections to process. Try this:

:auto MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
    WITH nodes
    CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'}) 
    YIELD node
} in transactions of 10000 rows

You could try apoc.periodic.iterate to see if it is better:

CALL apoc.periodic.iterate(
  '
    MATCH (n:User)
    WITH n.name AS name, collect(n) AS nodes
    WHERE size(nodes) > 1 
    RETURN nodes
  ',
  'CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})',
  {batchSize:10000, parallel:false})

As I suggested, the best approach is to import that data so that you avoid the need to merge the duplicate nodes afterwards.  

It showed 

Procedure call inside a query does not support naming results implicitly (name explicitly using `YIELD` instead)
The following work
CALL apoc.periodic.iterate('MATCH (n:User) WITH n.name AS name, collect(n) AS nodes WHERE size(nodes) > 1  RETURN nodes',
'CALL apoc.refactor.mergeNodes(nodes) yield node return node',
{batchSize:10000, parallel:false}) yield batch
 
But it take too much time, I choice to use Merge with set.property instead, appreciate !

 

Just for everyone who may searching the same problem-solving method. 

I change the data structure to avoid using "merge" or anything else in the end, and this is the best method for my data which more than 10 billion node otherwise it take so long that I can almost play 50 times League of legend game.

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online