Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
12-23-2022 01:48 AM - edited 12-23-2022 02:02 AM
The data are showed below (each row is the information about buyer)
Remark : The buyer do not repeat in each row but the seller do, and the seller may not from one of the buyer.
I want to create the node and edge as the following
i.e., the relation between buyer and seller. The following is the step and code
Step (1) Create buyer node (named as user)
Create (user:User {id:data.id, name:data.buyer, buyer = TRUE})
Step (2) Create seller node (named as user)
Create (user:User {id:data.id, name:data.seller, seller = TRUE})
Step (3) Link the buyer and seller based on id
Match(n:User)
Match(p:User)
Where n.id = p.id AND n.buyer IS NOT NULL and p.seller IS NOT NULL
Step (4) Merge the duplicate node (i.e., merge the node where name is the same)
MATCH (n:User)
WITH n.name AS repeatname, collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
WITH nodes
CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})
YIELD node
}
Then I can get the figure which I want successfully. However, I suffer from performance problem, since the data size is not just 5 but about 5 billion. The performance problem happen in step (4), i.e., apoc.refactor.mergeNodes. It take too much time (more than one day)
Is there any other method (for example, do not need apoc.refactor.mergeNodes) that not only can spent less time (more performance) but also get the result of figure? I had tried the periodic method which make parallel compute but it still too much time.
Thanks a lot!
Solved! Go to Solution.
12-25-2022 12:36 PM
When 'merging', you should include only the parameters in the in the '{}' that represent the primary key of the node. In most cases, this is just one property that uniquely identifies the node. It can also be multiple properties, as would be the case for a composite primary key. I think in your case, the primary key is just the 'name'. As such, you only want to include the name in the 'merge' clause, to avoid duplicates as the other properties change. I suggest something like this, where you 'merge' only one the entity's primary key, and use 'set' to set all the remaining properties.
merge(b:User{name:data.buyer})
set b.property = data.aaa
merge(s:User{name:data.seller})
set s.property = data.bbb
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time
You are correct. If you need to remediate your data by merging nodes with the same 'name', you will need each batch to contain all the data for each name. This is achieved by collecting the nodes by name outside of the batching, and sending the collections to the call subquery, which will batch them into groups of collections to process. Try this:
:auto MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
WITH nodes
CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})
YIELD node
} in transactions of 10000 rows
You could try apoc.periodic.iterate to see if it is better:
CALL apoc.periodic.iterate(
'
MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
RETURN nodes
',
'CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})',
{batchSize:10000, parallel:false})
As I suggested, the best approach is to import that data so that you avoid the need to merge the duplicate nodes afterwards.
12-23-2022 04:45 AM
It looks like you are creating your own problem they way you are creating the data. Each row represents a transaction between a buyer and seller. The id is the transaction id. In your algorithm, you are creating a buyer specific node for this transaction and a seller specific node for this transaction, since you are creating new nodes with the transaction id for each transaction. You then are left with duplicate nodes for each seller and buyer, when view by name only, that you then want to merge by name. If my understanding is correct, you should be able to avoid the necessity of a merge phase if you import the data differently.
I suggest you create the seller and buyer nodes for each transaction and link the two nodes. The seller and buyer nodes do not need the transaction id as a property; it is not an attribute of the user. This is what is causing the duplication. Your merge operation results in the same graph.
You could do something like the following:
merge(b:User{name:data.buyer})
merge(s:User{name:data.seller})
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time
the resulting graph will have one node per user, a relation between seller and buyer per transaction, and the relationship will identify the transaction. There is no need for the Boolean ‘seller’ and ‘buyer’ properties, as you can tell if a user was every a buyer, since the user will have an outgoing PURCHASED_FROM relationship, and sellers will have an incoming PURCHASED_FROM relationship. Users that that are both seller and buyer will have at least one of each relationship. You can also get all the transactions a user made, of either type.
You may want to consider having a transaction node that has the transaction properties and links the buyer and seller through HAS_BUYER and HAS_SELLER relationships respectively. This would be useful if you needed the transaction to have relationships to other entities, like shipper or supplier, ship to address, payment info, etc.
12-25-2022 06:07 AM - edited 12-25-2022 07:02 AM
Thanks, it almost work but two more problems left. In summary, the following two problems
(1) How to let the merge code recognize the specify property (name) which used to be criterion for duplicate node when there're other property present?
(2) How can the merge method applied in apoc.periodic or just apply parallel computation but no Batchsize
Detail :
In fact, I need to add one more property (do not showed in Table I provided) in node which can not store in edge, i.e., the following
merge(b:User{name:data.buyer, property:data.aaa})
merge(s:User{name:data.seller, property:data.bbb})
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time
but the Merge method result in the duplicate node again (duplicate based on name) since I add another property named aaa and bbb. Is there any way that I can control merge to recognize just the name but not the two-dimension vector (name and aaa or name bbb)?
Another problem is that with the merge code it seems that it can not using the periodic comment otherwise the duplicate node would appear in the different Batch. Are there method to using periodic or just parallel computation with no Batchsize in periodic? Thanks
12-25-2022 12:36 PM
When 'merging', you should include only the parameters in the in the '{}' that represent the primary key of the node. In most cases, this is just one property that uniquely identifies the node. It can also be multiple properties, as would be the case for a composite primary key. I think in your case, the primary key is just the 'name'. As such, you only want to include the name in the 'merge' clause, to avoid duplicates as the other properties change. I suggest something like this, where you 'merge' only one the entity's primary key, and use 'set' to set all the remaining properties.
merge(b:User{name:data.buyer})
set b.property = data.aaa
merge(s:User{name:data.seller})
set s.property = data.bbb
merge(b)-[t:PURCHASED_FROM{id:data.id}]->(s)
set t.item=data.object, t.time=data.time
You are correct. If you need to remediate your data by merging nodes with the same 'name', you will need each batch to contain all the data for each name. This is achieved by collecting the nodes by name outside of the batching, and sending the collections to the call subquery, which will batch them into groups of collections to process. Try this:
:auto MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
CALL{
WITH nodes
CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})
YIELD node
} in transactions of 10000 rows
You could try apoc.periodic.iterate to see if it is better:
CALL apoc.periodic.iterate(
'
MATCH (n:User)
WITH n.name AS name, collect(n) AS nodes
WHERE size(nodes) > 1
RETURN nodes
',
'CALL apoc.refactor.mergeNodes(nodes,{properties:'combine'})',
{batchSize:10000, parallel:false})
As I suggested, the best approach is to import that data so that you avoid the need to merge the duplicate nodes afterwards.
12-25-2022 05:57 PM - edited 12-25-2022 06:01 PM
It showed
Procedure call inside a query does not support naming results implicitly (name explicitly using `YIELD` instead)
01-05-2023 02:26 AM
Just for everyone who may searching the same problem-solving method.
I change the data structure to avoid using "merge" or anything else in the end, and this is the best method for my data which more than 10 billion node otherwise it take so long that I can almost play 50 times League of legend game.
All the sessions of the conference are now available online