Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
11-01-2022 10:00 PM
We are trying to run the following script:
CALL apoc.periodic.commit(
"MATCH (device:Device)
WHERE device.rr_id='null' AND device.useragent_family <> 'Storebot' AND device.user_agent <> 'Shopify-Captain-Hook'
WITH device limit $limit
SET device.rr_id=randomUUID(), device.rrid_time=datetime()
WITH device, device.rr_id as rr_id, device.rrid_time as rrid_time
CALL {
WITH device, rr_id, rrid_time
MATCH (device)-[:HAS_IDENTIFIER|HAS_FINGERPRINT|HAS_EMAIL|HAS_IP*]-(node)
SET
node.rr_id = rr_id,
node.rrid_time = rrid_time
RETURN count(*) AS sg_nodes
}
WITH device, sg_nodes
RETURN count(*)", {limit:1, parallel:false});
And we are experiencing the following problems when running it:
Our Neo4j database is crashing when we try to run it. It stops generating any response after we try to run. Only a restart of Neo4j instance helps.
The reason we are having this problem is because we have the following edge cases with our data model / data:
Device - 20 millions nodes
Identifier - 4,5 millions nodes
Fingerprint - 200 thousands nodes
IP - 3 millions nodes
Email - 300 thousands nodes
Identifier connected to Device, Device connected to IP, E-mail, Fingerprint
Devices can be connected via Email, IP, Fingerprint and Identifier
The relation between Identifier and Device is one to many, like one Identifier may have like up to 50000 Devices, have a look at the table below.
Top 5 Identifiers with the number of connected nodes:
MATCH (i:Identifier)--(r)
RETURN i.name, i.id, COUNT(r) AS rcnt ORDER BY rcnt DESC LIMIT 5;
╒═══════════════╤══════════════════════════════════════╤═══════╕
│"i.name" │"i.id" │"rcnt" │
╞═══════════════╪══════════════════════════════════════╪═══════╡
│"person_id" │"d41d8cd98f00b204e9800998ecf8427e" │4144791│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"" │3551436│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"a36f6585-cd3e-467c-ba74-f00684896d88"│ 53449│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"bd73ee3b-a36f-4585-8d3e-967cba74f006"│ 17632│
├───────────────┼──────────────────────────────────────┼───────┤
│"domain_userid"│"b8cd94b0-3fda-4680-821c-35c3cbb96396"│ 9457│
└───────────────┴──────────────────────────────────────┴───────┘
We have 2 nodes with extremely highly number of connected nodes
And we would like to exclude those two Identifiers because:
If we start from the Device node D1, and try to find the paths without any limitation on the length/depth of the paths then the number of paths grows exponentially.
So for a path with length 4 as shown in the diagram, the number of combinations becomes 1X1000X10X5 = 50000 paths.
But this is not the end in our case, we are trying to find even more longer paths.
Devices: D1,D2,D3
Identifier: I1
Email: E1
The above filter WHERE device.rr_id='null' used for the performance purposes to avoid NULLs.
Our IDs we are forming connections on are as a result of cookies we load on user devices
The rest of filter AND device.useragent_family <> 'Storebot' AND device.user_agent <> 'Shopify-Captain-Hook' used to exclude nodes loaded from Shopify api, because the shopify api’s IP is Shopify server IP not user IP and because it won’t have a domain user id due to a cookie not being loaded for these api calls.
We have tried the following troubleshooting steps:
Adjust memory with the advice from neo4j-admin memrec util
# Based on the above, the following memory settings are recommended:
dbms.memory.heap.initial_size=31300m
dbms.memory.heap.max_size=31300m
dbms.memory.pagecache.size=79900m
#
# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
dbms.jvm.additional=-XX:+ExitOnOutOfMemoryError
How can we address the above problems? What is a better approach?
PS
We are trying to use Weakly Connected Components as described in https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/ to identify not connected subgraphs in our graph, but without success.
The code generated only 4 subgraphs identified rr_id for 3 millions of IP.
// 1. Project graph (Approx time taken: 17 Seconds)
CALL gds.graph.project(
'rr_id-sub-graph',
'*',
'*',
{}
);
// 2. Run Weakly Connected Components algo (Approx time taken: 260 Seconds)
CALL gds.wcc.write('rr_id-sub-graph', {
writeProperty: 'componentId'
})
YIELD nodePropertiesWritten, componentCount;
// 3. Generate "rr_id" from "componentId" (Approx time taken: 140 Seconds)
CALL apoc.periodic.iterate(
"MATCH (n) RETURN n",
"SET n.rr_id = apoc.util.md5([n.componentId])",
{batchSize:1000, parallel:true});
// 4. Drop projected graph
CALL gds.graph.drop('rr_id-sub-graph') YIELD graphName;
All the sessions of the conference are now available online