Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
12-15-2019 04:16 AM
We receive a large amount of data and we build a Cypher query that needs to "sync" relationships of a single node :Talent
with a large amount of other nodes :Segment
. The average number of relationships that need to be synced is around 900,000, and we need to write that data to the database within the least time possible. The current setup is as follows:
heap.max_size
and heap.initial_size
are 13Gi
The approach we've taken to implement this is to first delete all relations between :Talent
and :Segment
and then write the updated relations again. This is one of the operations that are questionable but we've decided on this approach in contrast with querying existing relations and diffing at application layer.
Question 1 is there a better way to approach syncing relations instead of delete/create, in order to remove existing relations that are not part of the given list to create?
The query to write relations is as follows:
WITH $list as list
UNWIND list as item
MATCH (t:Talent {talent_id: $talentID}), (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
r.affinity = item.affinity,
r.audience_size_percentage = item.audience_size_percentage
Note that this query will not contain all of the list of segments that will be received, and since we are using Golang we decided to utilise its concurrency abilities to try and achieve high write amounts with low time, still weren't able to achieve it.
Example: Say we received a list of 700,000 :Segment
details that need to be written, and the :Talent
already had 500,000.
140
concurrent operations for the 700k, each writing 5000 relationsQuestion 2 This process is taking up to an hour to finish per :Talent
. How can we bring it down to the minimum possible (a minute ideally, though seems impossible).
It is also worth mentioning the current schema:
Indexes
ON :AudienceSegment(checksum) ONLINE
ON :Talent(talent_id) ONLINE
ON :AudienceSegment(uuid) ONLINE (for uniqueness constraint)
ON :Talent(uuid) ONLINE (for uniqueness constraint)
Constraints
ON ( audiencesegment:AudienceSegment ) ASSERT audiencesegment.uuid IS UNIQUE
ON ( talent:Talent ) ASSERT talent.uuid IS UNIQUE
Note:
uuid
is a randomly generatedUUIDv4
that is added to all nodes
Note:
:Segment
is identified by thechecksum
field rather thanuuid
Solved! Go to Solution.
12-19-2019 12:25 AM
Welcome back Abed my friend
did you configure the page-cache size too?
what is the IO load of your system?
use apoc.periodic.iterate with a batchSize of 100k
so each tx has 100k updates
call apoc.periodic.iterate(
'
MATCH (t:Talent {talent_id: $talentID})
UNWIND $list as item RETURN $talent as t, item',
'MATCH (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
r.affinity = item.affinity,
r.audience_size_percentage = item.audience_size_percentage
', {batchSize:100000, params: {talentID:$talentID, list:$list, updated_at: $updated_at, created_at:$created_at})
In general the model leads to super nodes, so that might be a bit of an issue.
Neo4j should still MERGE from the Segment not the talent to check for the relationships.
In principle you could also take the already existing segment checksums and remove them from the list
ala
MATCH (t:Talent {talent_id: $talentID})-[r:BELONGS_TO]->(s)
WITH t, collect(s.checksum) as existing
WITH t, [item IN list WHERE NOT item.checksum IN existing] as filtered
...
12-19-2019 12:25 AM
Welcome back Abed my friend
did you configure the page-cache size too?
what is the IO load of your system?
use apoc.periodic.iterate with a batchSize of 100k
so each tx has 100k updates
call apoc.periodic.iterate(
'
MATCH (t:Talent {talent_id: $talentID})
UNWIND $list as item RETURN $talent as t, item',
'MATCH (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
r.affinity = item.affinity,
r.audience_size_percentage = item.audience_size_percentage
', {batchSize:100000, params: {talentID:$talentID, list:$list, updated_at: $updated_at, created_at:$created_at})
In general the model leads to super nodes, so that might be a bit of an issue.
Neo4j should still MERGE from the Segment not the talent to check for the relationships.
In principle you could also take the already existing segment checksums and remove them from the list
ala
MATCH (t:Talent {talent_id: $talentID})-[r:BELONGS_TO]->(s)
WITH t, collect(s.checksum) as existing
WITH t, [item IN list WHERE NOT item.checksum IN existing] as filtered
...
All the sessions of the conference are now available online