cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Remove property duplicates

Hi everybody!
I'm new at Neo4j and i'm trying to learn it in order to start a project for my thesis, i spent a lot of time to manually reorganize a large dataset from xml to csv, then i imported a piece of this db inside Neo4j, i would avoid those duplicates but i canno't find a proper query, can you help me?
Thanks!
Please tell me if you need to know anything about this database or other stuff to undestand better the situation.

2 REPLIES 2

intouch_vivek
Graph Steward

Hi @alessio9899,

Welcome to the Community!!

I guess post Creating Nodes using Create Statement your applied
CALL apoc.refactor.mergeNodes(nodes, {properties: "combine"}) YIELD node

If that is the case try something like

MATCH (p:Person)
WITH p ORDER BY p.created DESC // newest one first
WITH p.email, collect(p) as nodes
CALL apoc.refactor.mergeNodes(nodes, {properties: {name:'discard', age:'overwrite', kids:'combine', `addr.*`, 'overwrite',`.*`: 'discard'}}) YIELD node
RETURN node

given @ http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.5/graph-refactoring/merge-nodes/
If your already have that data in the DB then you can try split function on the property and take first index [0]

thanks you for answer!
i'll try to explaine you step by step what exactly i did:

  1. first of all i used this query to load from .CSV:
    LOAD CSV WITH HEADERS FROM
    "file:///C:/opere_prova.csv"
    AS line FIELDTERMINATOR ';'
    merge(o:Work{
    shortTitle:COALESCE(line.ShortTitle,$defaultValue),
    longTitle:COALESCE(line.LongTitle,$defaultValue),
    academyName:COALESCE(line.ItacAcademyItem,$defaultValue),
    academyId:COALESCE(toInteger(line.AcademyId),$defaultInteger),
    format:COALESCE(line.Format,$defaultValue),
    cityPublicationNameIt:COALESCE(line.CityPublicationPlaceItalianName,$defaultValue),
    cityPublicationNameLat:COALESCE(line.CityPublicationPlaceLatinName,$defaultValue),
    cityPublicationNameEng:COALESCE(line.CityPublicationPlaceEnglishName,$defaultValue),
    cityPublicationPlaceFictitiousName:COALESCE(line.CityPublicationPlaceFictitiousName,$defaultValue),
    cityId:COALESCE(toInteger(line.CityId),$defaultInteger),
    publicationYear:COALESCE(line.PublicationYear,$defaultValue),
    dedicationPlaceDate:COALESCE(line.DedicationPlaceDate,$defaultValue),
    shelfmark:COALESCE(line.Shelfmark,$defaultValue),
    content:COALESCE(line.Content,$defaultValue),
    subjects:COALESCE(line.Subjects,$defaultValue),
    language:COALESCE(line.Language,$defaultValue),
    illustration:COALESCE(line.Illustration,$defaultValue),
    pagination:COALESCE(line.Pagination,$defaultValue),
    marginalia:COALESCE(line.Marginalia,$defaultValue),
    authorInfo:COALESCE(line.AuthorsInfo,$defaultValue),
    authorPersonId:COALESCE(toInteger(line.AuthorsPersonId),$defaultInteger),
    edition:COALESCE(line.Edition,$defaultValue),
    contributorInfo:COALESCE(line.ContributorsInfo,$defaultValue),
    contributorPersonId:COALESCE(toInteger(line.ContributorsPersonId),$defaultInteger),
    censorInfo:COALESCE(line.CensorsInfo,$defaultValue),
    censorPersonId:COALESCE(toInteger(line.CensorsPersonId),$defaultInteger),
    censorAgreement:COALESCE(line.CensorsAgreement,$defaultValue),
    dedicateeInfo:COALESCE(line.DedicateesInfo,$defaultValue),
    dedicateePersonId:COALESCE(toInteger(line.DedicateesPersonId),$defaultInteger),

editorInfo:COALESCE(line.EditorsInfo,$defaultValue),
editorPersonId:COALESCE(toInteger(line.EditorsPersonId),$defaultInteger),
artistInfo:COALESCE(line.ArtistsInfo,$defaultValue),
artistPersonId:COALESCE(toInteger(line.ArtistsPersonId),$defaultInteger),
illustratorInfo:COALESCE(line.IllustratorsInfo,$defaultValue),
illustratorPersonId:COALESCE(toInteger(line.IllustratorsPersonId),$defaultInteger),
printerInfo:COALESCE(line.PrintersInfo,$defaultValue),
printerPersonId:COALESCE(toInteger(line.PrintersPersonId),$defaultInteger),

printerOrnament:COALESCE(line.PrinterOrnament,$defaultValue),
engraverReference:COALESCE(line.EngraverReference,$defaultValue),
essayTitle:COALESCE(line.EssayTitle,$defaultValue),
essayContributorInfo:COALESCE(line.EssayContributorInfo,$defaultValue),
essayContributorPersonId:COALESCE(toInteger(line.EssayContributorPersonId),$defaultInteger),

citation:COALESCE(line.Citation,$defaultValue),
workId:COALESCE(toInteger(line.WorkId),$defaultInteger),
notes:COALESCE(line.Notes,$defaultValue)
})
RETURN o

  1. i obtained this result: https://ibb.co/6mvYKgS
    as you can see there is a lot of nodes while i just need 3 nodes in total, based on uniqueness of WorkId.

  2. then i executed this:

MATCH (o:Work)
WITH o.shortTitle AS title, collect(o) AS node2Merge

WITH node2Merge, extract(x IN node2Merge | x.dedicateePersonId) AS dedicateePersonId,
extract(x IN node2Merge | x.dedicateeInfo) AS dedicateeInfo,

extract(x IN node2Merge | x.authorPersonId) AS authorPersonId,
extract(x IN node2Merge | x.authorInfo) AS authorInfo,

extract(x IN node2Merge | x.censorPersonId) AS censorPersonId,
extract(x IN node2Merge | x.censorInfo) AS censorInfo,

extract(x IN node2Merge | x.contributorPersonId) AS contributorPersonId,
extract(x IN node2Merge | x.contributorInfo) AS contributorInfo,

extract(x IN node2Merge | x.editorPersonId) AS editorPersonId,
extract(x IN node2Merge | x.editorInfo) AS editorInfo,

extract(x IN node2Merge | x.artistPersonId) AS artistPersonId,
extract(x IN node2Merge | x.artistInfo) AS artistInfo,

extract(x IN node2Merge | x.illustratorPersonId) AS illustratorPersonId,
extract(x IN node2Merge | x.illustratorInfo) AS illustratorInfo,

extract(x IN node2Merge | x.printerPersonId) AS printerPersonId,
extract(x IN node2Merge | x.printerInfo) AS printerInfo,

extract(x IN node2Merge | x.essayContributorPersonId) AS essayContributorPersonId,
extract(x IN node2Merge | x.essayContributorInfo) AS essayContributorInfo,
extract(x IN node2Merge | x.essayTitle) AS essayTitle

CALL apoc.refactor.mergeNodes(node2Merge) YIELD node

SET node.dedicateePersonId = dedicateePersonId
SET node.dedicateeInfo = dedicateeInfo

SET node.authorPersonId = authorPersonId
SET node.authorInfo = authorInfo

SET node.contributorPersonId = contributorPersonId
SET node.contributorInfo = contributorInfo

SET node.censorPersonId = censorPersonId
SET node.censorInfo = censorInfo

SET node.editorPersonId = editorPersonId
SET node.editorInfo = editorInfo

SET node.artistPersonId = artistPersonId
SET node.artistInfo = artistInfo

SET node.illustratorPersonId = illustratorPersonId
SET node.illustratorInfo = illustratorInfo

SET node.printerPersonId = printerPersonId
SET node.printerInfo = printerInfo

SET node.essayContributorPersonId = essayContributorPersonId
SET node.essayContributorInfo = essayContributorInfo
SET node.essayTitle = essayTitle

  1. i obtained this:
    https://ibb.co/DQCntDn
    as you can see there are a lot of repetitions, not for every properties but for many of them

any help?