cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Delete duplicate node checking if specific keys have same values

lx2pwnd
Node Clone

In my db are some nodes which have this form:

n:Contact
{
  "name": "name surname",
  "created_timestampe": 1460411261,
  "updated_timestamp": 1494083876,
  "contact_point": [
    "+39*********"
  ]
}

but there are some nodes where I have some values for the keys name and contact_point which are duplicated. I don't care about created_timestamp or updated_timestamp, but I want to delete this duplicates checking if the contact_point value is really the same. I wrote this, but I'm not sure if it's correct:

MATCH (c:Contact)
WITH c.contact_point AS contact_point, COLLECT(c) AS contacts
WHERE SIZE(contacts) > 1
FOREACH (n IN contacts| DETACH DELETE n);
1 ACCEPTED SOLUTION

I am assuming that you're looking for the combination of the nodes above that would be the union of properties based on a common name property into one node with a distinct name. Always an interesting question on how to combine things in a graph when a classic MERGE statement does not quiet fit. Fortunately there's an apoc procedure apoc.refactor.mergeNodes that can help. The Cypher query below walks through the list of nodes by name and uses apoc.refactor.mergeNodes to decide what to do with the properties. I changed the created_timestampe values to unique integers so that you can see that there's not an ordering. That doesn't seem to apply to this case, but is good to know. A test set of data is below the query and the results.

One other thought is to clean the data before as part of the load process which is usually easier to do.

// apoc.refactor.mergeNodes() to combine nodes by name and handle the node properties
//  note: indexes are 0 based
MATCH (c:Contact)
WITH c.name AS name, COLLECT(c) AS contacts,
     SIZE(COLLECT(c)) AS nbr_nodes
WHERE SIZE(contacts) > 1  // only want names that have more than one node
// uncomment the RETURN  and delete lines below it to see the grouping
// RETURN name, contacts, SIZE(contacts) 
WITH name as name, contacts as contacts, nbr_nodes AS nbr_nodes
  UNWIND RANGE(1, nbr_nodes - 1) as idx
    CALL apoc.refactor.mergeNodes([contacts[0], contacts[idx]], {properties: {name:'combine', created_timestampe:'overwrite', updated_timestamp:'overwrite', contact_point:'combine'}}) YIELD node
RETURN name, max(idx) + 1 AS `Nbr Nodes Merged`

OUTPUT

╒═══════╤══════════════════╕
│"name" │"Nbr Nodes Merged"│
╞═══════╪══════════════════╡
│"Mamà" │3                 │
├───────┼──────────────────┤
│"Kiddo"│2                 │
├───────┼──────────────────┤
│"Papà" │4                 │
└───────┴──────────────────┘
MATCH(n) RETURN n
╒══════════════════════════════════════════════════════════════════════╕
│"n"                                                                   │
╞══════════════════════════════════════════════════════════════════════╡
│{"name":"Mamà","created_timestampe":5,"updated_timestamp":1520798382,"│
│contact_point":"+39*********332"}                                     │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"Kiddo","created_timestampe":9,"updated_timestamp":1520798382,│
│"contact_point":["1234567","abcdefg"]}                                │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"Papà","created_timestampe":3,"updated_timestamp":1520798382,"│
│contact_point":["plus-minus-plus-minus","+39********009","+39********0│
│0X","39*********332","+39*********332"]}                              │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"The Dag","created_timestampe":10,"updated_timestamp":15207983│
│82,"contact_point":["food dish"]}                                     │
└──────────────────────────────────────────────────────────────────────┘
// Test Data
CREATE (:Contact {
  name: "Papà",
  created_timestampe: 1,
  updated_timestamp: 1493240714,
  contact_point: [
    "+39********009",
    "+39********00X"
  ]
} ),

(:Contact {
  name: "Papà",
  created_timestampe: 2,
  updated_timestamp: 1493240714,
  contact_point: [
    "+39********009",
    "39*********332"
  ]
}),

(:Contact {
  name: "Papà",
  created_timestampe: 3,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Papà",
  created_timestampe: 4,
  updated_timestamp: 1520798382,
  contact_point: [
    "plus-minus-plus-minus"
  ]
}),

(:Contact {
  name: "Mamà",
  created_timestampe: 5,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Mamà",
  created_timestampe: 6,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Mamà",
  created_timestampe: 7,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Kiddo",
  created_timestampe: 8,
  updated_timestamp: 1520798382,
  contact_point: [
    "1234567"
  ]
}),
(:Contact {
  name: "Kiddo",
  created_timestampe: 9,
  updated_timestamp: 1520798382,
  contact_point: [
    "abcdefg"
  ]
}),
(:Contact {
  name: "The Dag",
  created_timestampe: 10,
  updated_timestamp: 1520798382,
  contact_point: [
    "food dish"
  ]
})

View solution in original post

4 REPLIES 4

That looks pretty close, but right now it will delete all of the nodes when presumably you want to delete all but one of them?

The code below does what you want:

MATCH (c:Contact)
WITH c.contact_point AS contact_point, COLLECT(c) AS contacts
WHERE SIZE(contacts) > 1
UNWIND contacts[1..] AS contact
DETACH DELETE contact

Yes, In fact the code that I wrote deleted all the nodes where size(contact)=1. The code that you write works great, but after I execute it i have a problem. In fact, for example, I have this 3 node

{
  "name": "Papà",
  "created_timestampe": 1442311441,
  "updated_timestamp": 1493240714,
  "contact_point": [
    "+39********009"
  ]
}

{
  "name": "Papà",
  "created_timestampe": 1442194672,
  "updated_timestamp": 1493240714,
  "contact_point": [
    "+39********009",
    "39*********332"
  ]
}

{
  "name": "Papà",
  "created_timestampe": 1515581146,
  "updated_timestamp": 1520798382,
  "contact_point": [
    "+39*********332"
  ]
}

If I want to delete also the first node and the third node, because the second one contains both, how i should do ?

I am assuming that you're looking for the combination of the nodes above that would be the union of properties based on a common name property into one node with a distinct name. Always an interesting question on how to combine things in a graph when a classic MERGE statement does not quiet fit. Fortunately there's an apoc procedure apoc.refactor.mergeNodes that can help. The Cypher query below walks through the list of nodes by name and uses apoc.refactor.mergeNodes to decide what to do with the properties. I changed the created_timestampe values to unique integers so that you can see that there's not an ordering. That doesn't seem to apply to this case, but is good to know. A test set of data is below the query and the results.

One other thought is to clean the data before as part of the load process which is usually easier to do.

// apoc.refactor.mergeNodes() to combine nodes by name and handle the node properties
//  note: indexes are 0 based
MATCH (c:Contact)
WITH c.name AS name, COLLECT(c) AS contacts,
     SIZE(COLLECT(c)) AS nbr_nodes
WHERE SIZE(contacts) > 1  // only want names that have more than one node
// uncomment the RETURN  and delete lines below it to see the grouping
// RETURN name, contacts, SIZE(contacts) 
WITH name as name, contacts as contacts, nbr_nodes AS nbr_nodes
  UNWIND RANGE(1, nbr_nodes - 1) as idx
    CALL apoc.refactor.mergeNodes([contacts[0], contacts[idx]], {properties: {name:'combine', created_timestampe:'overwrite', updated_timestamp:'overwrite', contact_point:'combine'}}) YIELD node
RETURN name, max(idx) + 1 AS `Nbr Nodes Merged`

OUTPUT

╒═══════╤══════════════════╕
│"name" │"Nbr Nodes Merged"│
╞═══════╪══════════════════╡
│"Mamà" │3                 │
├───────┼──────────────────┤
│"Kiddo"│2                 │
├───────┼──────────────────┤
│"Papà" │4                 │
└───────┴──────────────────┘
MATCH(n) RETURN n
╒══════════════════════════════════════════════════════════════════════╕
│"n"                                                                   │
╞══════════════════════════════════════════════════════════════════════╡
│{"name":"Mamà","created_timestampe":5,"updated_timestamp":1520798382,"│
│contact_point":"+39*********332"}                                     │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"Kiddo","created_timestampe":9,"updated_timestamp":1520798382,│
│"contact_point":["1234567","abcdefg"]}                                │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"Papà","created_timestampe":3,"updated_timestamp":1520798382,"│
│contact_point":["plus-minus-plus-minus","+39********009","+39********0│
│0X","39*********332","+39*********332"]}                              │
├──────────────────────────────────────────────────────────────────────┤
│{"name":"The Dag","created_timestampe":10,"updated_timestamp":15207983│
│82,"contact_point":["food dish"]}                                     │
└──────────────────────────────────────────────────────────────────────┘
// Test Data
CREATE (:Contact {
  name: "Papà",
  created_timestampe: 1,
  updated_timestamp: 1493240714,
  contact_point: [
    "+39********009",
    "+39********00X"
  ]
} ),

(:Contact {
  name: "Papà",
  created_timestampe: 2,
  updated_timestamp: 1493240714,
  contact_point: [
    "+39********009",
    "39*********332"
  ]
}),

(:Contact {
  name: "Papà",
  created_timestampe: 3,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Papà",
  created_timestampe: 4,
  updated_timestamp: 1520798382,
  contact_point: [
    "plus-minus-plus-minus"
  ]
}),

(:Contact {
  name: "Mamà",
  created_timestampe: 5,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Mamà",
  created_timestampe: 6,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Mamà",
  created_timestampe: 7,
  updated_timestamp: 1520798382,
  contact_point: [
    "+39*********332"
  ]
}),
(:Contact {
  name: "Kiddo",
  created_timestampe: 8,
  updated_timestamp: 1520798382,
  contact_point: [
    "1234567"
  ]
}),
(:Contact {
  name: "Kiddo",
  created_timestampe: 9,
  updated_timestamp: 1520798382,
  contact_point: [
    "abcdefg"
  ]
}),
(:Contact {
  name: "The Dag",
  created_timestampe: 10,
  updated_timestamp: 1520798382,
  contact_point: [
    "food dish"
  ]
})

@dan.flavin1 yes, my idea is to combine the nodes which have the same name. Thanks a lot ! It's works perfectly !