Neo4j

abed_halawi · ‎12-15-2019

We receive a large amount of data and we build a Cypher query that needs to "sync" relationships of a single node :Talent with a large amount of other nodes :Segment. The average number of relationships that need to be synced is around 900,000, and we need to write that data to the database within the least time possible. The current setup is as follows:

Driver: using official Golang driver
Neo4j causal cluster of 3 core and 2 read replicas. The cluster is managed by Kubernetes, so Neo4j is hosted within Docker containers and not on the machine directly
- Each core has a maximum of 3CPU units and 20Gi RAM with configuration of heap: heap.max_size and heap.initial_size are 13Gi
- Read replicas are configured with a maximum of 3CPU units and 18Gi RAM (read replicas are irrelevant to this question)

The approach we've taken to implement this is to first delete all relations between :Talent and :Segment and then write the updated relations again. This is one of the operations that are questionable but we've decided on this approach in contrast with querying existing relations and diffing at application layer.

Question 1 is there a better way to approach syncing relations instead of delete/create, in order to remove existing relations that are not part of the given list to create?

The query to write relations is as follows:

WITH $list as list
UNWIND list as item
MATCH (t:Talent {talent_id: $talentID}), (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
    r.affinity = item.affinity,
    r.audience_size_percentage = item.audience_size_percentage

Note that this query will not contain all of the list of segments that will be received, and since we are using Golang we decided to utilise its concurrency abilities to try and achieve high write amounts with low time, still weren't able to achieve it.

Example: Say we received a list of 700,000 :Segment details that need to be written, and the :Talent already had 500,000.

Before we write, we need to make sure that the segments that are not part of the received list are removed from relations
Build and execute a bunch of queries concurrently, with a chunk of 5000 items each. So we end up running 140 concurrent operations for the 700k, each writing 5000 relations

Question 2 This process is taking up to an hour to finish per :Talent. How can we bring it down to the minimum possible (a minute ideally, though seems impossible).

It is also worth mentioning the current schema:

Indexes
   ON :AudienceSegment(checksum) ONLINE
   ON :Talent(talent_id) ONLINE 
   ON :AudienceSegment(uuid) ONLINE  (for uniqueness constraint)
   ON :Talent(uuid) ONLINE  (for uniqueness constraint)

Constraints
   ON ( audiencesegment:AudienceSegment ) ASSERT audiencesegment.uuid IS UNIQUE
   ON ( talent:Talent ) ASSERT talent.uuid IS UNIQUE

Note: uuid is a randomly generated UUIDv4 that is added to all nodes

Note: :Segment is identified by the checksum field rather than uuid

michael_hunger · ‎12-19-2019

Welcome back Abed my friend

did you configure the page-cache size too?
what is the IO load of your system?

use apoc.periodic.iterate with a batchSize of 100k
so each tx has 100k updates

call apoc.periodic.iterate(
'
MATCH (t:Talent {talent_id: $talentID})
UNWIND $list as item  RETURN $talent as t, item',
'MATCH (s:Segment {checksum: item.checksum})
MERGE (t)-[r:BELONGS_TO]->(s)
ON CREATE SET r.created_at = $created_at
SET r.updated_at = $updated_at,
    r.affinity = item.affinity,
    r.audience_size_percentage = item.audience_size_percentage
', {batchSize:100000, params: {talentID:$talentID, list:$list, updated_at: $updated_at, created_at:$created_at})

In general the model leads to super nodes, so that might be a bit of an issue.
Neo4j should still MERGE from the Segment not the talent to check for the relationships.

In principle you could also take the already existing segment checksums and remove them from the list
ala

MATCH (t:Talent {talent_id: $talentID})-[r:BELONGS_TO]->(s)
WITH t, collect(s.checksum) as existing
WITH t, [item IN list WHERE NOT item.checksum IN existing] as filtered
...

View solution in original post

michael_hunger · ‎12-19-2019