Neo4j

KostyaChenko · ‎11-01-2022

We are trying to run the following script:

CALL apoc.periodic.commit(
"MATCH (device:Device)
 WHERE device.rr_id='null' AND device.useragent_family <> 'Storebot' AND device.user_agent <> 'Shopify-Captain-Hook'
 WITH device limit $limit
 SET device.rr_id=randomUUID(), device.rrid_time=datetime()
 WITH device, device.rr_id as rr_id, device.rrid_time as rrid_time
CALL {
 WITH device, rr_id, rrid_time
 MATCH (device)-[:HAS_IDENTIFIER|HAS_FINGERPRINT|HAS_EMAIL|HAS_IP*]-(node)
 SET
 node.rr_id = rr_id,
 node.rrid_time = rrid_time
 RETURN count(*) AS sg_nodes
}
WITH device, sg_nodes
RETURN count(*)", {limit:1, parallel:false});

And we are experiencing the following problems when running it:
Our Neo4j database is crashing when we try to run it. It stops generating any response after we try to run. Only a restart of Neo4j instance helps.

The reason we are having this problem is because we have the following edge cases with our data model / data:
Device - 20 millions nodes
Identifier - 4,5 millions nodes
Fingerprint - 200 thousands nodes
IP - 3 millions nodes
Email - 300 thousands nodes

Identifier connected to Device, Device connected to IP, E-mail, Fingerprint
Devices can be connected via Email, IP, Fingerprint and Identifier
The relation between Identifier and Device is one to many, like one Identifier may have like up to 50000 Devices, have a look at the table below.

Top 5 Identifiers with the number of connected nodes:

MATCH (i:Identifier)--(r)
RETURN i.name, i.id, COUNT(r) AS rcnt ORDER BY rcnt DESC LIMIT 5;

╒═══════════════╤══════════════════════════════════════╤═══════╕
│"i.name"       │"i.id"                                │"rcnt" │
╞═══════════════╪══════════════════════════════════════╪═══════╡
│"person_id"    │"d41d8cd98f00b204e9800998ecf8427e"    │4144791│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│""                                    │3551436│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"a36f6585-cd3e-467c-ba74-f00684896d88"│  53449│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"bd73ee3b-a36f-4585-8d3e-967cba74f006"│  17632│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"b8cd94b0-3fda-4680-821c-35c3cbb96396"│   9457│
└───────────────┴──────────────────────────────────────┴───────┘

We have 2 nodes with extremely highly number of connected nodes
And we would like to exclude those two Identifiers because:

The id = "d41d8cd98f00b204e9800998ecf8427e" is result of ID generation function like
apoc.util.md5([row.first_name, row.last_name, row.email_address])
In case of all empty attributes the function return "d41d8cd98f00b204e9800998ecf8427e"
RETURN apoc.util.md5(['', '', '']);
Empty domain_userid identifier as referenced above

If we start from the Device node D1, and try to find the paths without any limitation on the length/depth of the paths then the number of paths grows exponentially.
So for a path with length 4 as shown in the diagram, the number of combinations becomes 1X1000X10X5 = 50000 paths.
But this is not the end in our case, we are trying to find even more longer paths.
Devices: D1,D2,D3
Identifier: I1
Email: E1

The above filter WHERE device.rr_id='null' used for the performance purposes to avoid NULLs.

Our IDs we are forming connections on are as a result of cookies we load on user devices

The rest of filter AND device.useragent_family <> 'Storebot' AND device.user_agent <> 'Shopify-Captain-Hook' used to exclude nodes loaded from Shopify api, because the shopify api’s IP is Shopify server IP not user IP and because it won’t have a domain user id due to a cookie not being loaded for these api calls.

We have tried the following troubleshooting steps:

Adjust memory with the advice from neo4j-admin memrec util
# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31300m
dbms.memory.heap.max_size=31300m
dbms.memory.pagecache.size=79900m
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError

How can we address the above problems? What is a better approach?

PS
We are trying to use Weakly Connected Components as described in https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/ to identify not connected subgraphs in our graph, but without success.
The code generated only 4 subgraphs identified rr_id for 3 millions of IP.

// 1. Project graph (Approx time taken: 17 Seconds)
CALL gds.graph.project(
  'rr_id-sub-graph',
  '*',
  '*',
  {}
);

// 2. Run Weakly Connected Components algo (Approx time taken: 260 Seconds)
CALL gds.wcc.write('rr_id-sub-graph', {
  writeProperty: 'componentId'
})
YIELD nodePropertiesWritten, componentCount;

// 3. Generate "rr_id" from "componentId" (Approx time taken: 140 Seconds)
CALL apoc.periodic.iterate(
"MATCH (n) RETURN n",
"SET n.rr_id = apoc.util.md5([n.componentId])", 
{batchSize:1000, parallel:true});

// 4. Drop projected graph
CALL gds.graph.drop('rr_id-sub-graph') YIELD graphName;

Neo4j

Set subgraphs identifier for graph with large amount of nodes and relations crash Neo4j instance