cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

What is the most efficient way to Delete 20 Million Nodes from a database with over 7.5 Billion Nodes and 15 Billion relationships?

I have been having trouble with doing a mass delete of data observed on the first day. I am trying to delete around 20-30 Million nodes each day to cycle my data.

I have 3 proposed methods and they are all too slow.

method 1:
step 1 -Mark all the nodes to be deleted in parallel using apoc.periodic
step 2 -Detach delete user nodes using apoc.periodic (non parallel)

method 2:
step 1- Mark all the nodes to be deleted in parallel using apoc.periodic
step 2- using apoc.periodic + apoc.nodes.delete (non-parallel)

method 3:
step 1- Mark all the nodes to be deleted in parallel using apoc.periodic
step 2- Delete all the relationships using apoc.periodic (non-parallel)
step 3- Delete all the disconnected marked nodes in parallel using apoc.periodic

Right now my estimates are to delete 20 Million nodes using method 1 will take 55 hours.
Method 2 is even slower

Using Method 3 it will take 6 hours to delete all the relationships and I am testing how long to delete all the nodes in parallel now that the relationships have been removed.

Looking for thoughts and ideas on improving this. Seems like Neo4j struggles at scale.

  • neo4j 3.49
    -32 vCPU 244 GB Ram server with 16000 IOPs SSD
27 REPLIES 27

12kunal34
Graph Fellow

Hi @benjamin.squire ,

There is a trick to delete large dataset from neo4j.
simply use LIMIT with each delete and perform this operation at 100k records once.
you have to give manual efforts here but it would be very much faster .

Just curious to know why have to delete such 20M data and relationship everyday.

[1] If so i believe the Data Model needs to be revisited.

[2] If cost and time factors allows , can we create a one more GraphDB and just take only the needed data and relationship and import to the new GraphDB instead of messing up with existing one (assume the data is not sensitive , storage cost may not be big constraint)

[3] I think any DB kind of this operation wud take time

@12kunal34 I have deleted in this fashion using limit 1000 and 10000, it takes 8-10 seconds for 1k and 80-90 seconds for 10k deletions and detaches. Hence, the time estimates of 55 hours, further apoc.periodic commit or iterate should batch the same as limit does and so should be considered equivalent.

@senthilc78 the reasoning of deleting 20M records per day is related to the amount of data I am dealing with and the length of time. I need to maintain at least 1 year of data of all clickstream records for 150 websites. Right now at 10 Months the data is 7.5 Billion records and 15 Billion relationships. When I hit 12 months it will be getting higher, the fact is I am using community and have a limit of 34 billion rels and relationships, in a previous data model I already hit that amount at 10 months. Since then I have revised the model to bare minimum needs. The 20M records per day relates to deleting the first day of logs a year ago. Since I am adding about 20M records per day and will be getting closer to the 34 Bill limit (plus hardware constraints of my machine) I need to be able to recycle that same and delete some of these old logs which we identify as no longer required.

It's not a technical solution but it does solve getting around the size limit of community, purchase an enterprise license. Then you won't have a size limit and don't need to prune your data. It would also open you up to more flexibility such as clustering as your data grows. Otherwise 2 hrs to prune that much data isn't that bad, you could consider moving to an hourly pruning schedule if daily is too large.

Got it Benjamin . Thanks for the details. so you made me thinking on Time to live nodes . I cud see some thirdparty plug in for that but believe using trigger and common counter with little change in data model we could achieve in community edition as an alternative i think. Let me check that out in detail and update. Thanks!

What are your batch sizes for period.iterate calls?
If you observe the machine is it bound on IO or CPU?

What are the degree's of the nodes you are deleting?
Are those nodes related/connected to each other or unrelated?

I will try to reproduce your use-case and see how it behaves on our test machine.

Also what is your heap config and your page-cache size?
And how do you invoke the procedures?

@michael.hunger the batch size I tested was for 1k and 10k. Using method 1 detach delete node in series batches took 8-10 seconds for 1k and 80-100 seconds for 10k. When I tested apoc.nodes.delete (method 2) it was even slower 15-17 seconds for 1k and 10k took almost 2.5 mins per batch. Method 3 where I delete rels in batches was much faster, 1k took 5 to 6 seconds and 10k took like 30 seconds. The estimated time for Method 1 is around 50 hours ((20 MM / 10000 node batches) * 100 sec = 50 hrs). Method 2 was even slower. and Method three finished deleting all rels in 7 hours and finished deleteing the remaining floating nodes in 9 hours.

I don't have the exact number degree of each node but It should have an average of 4 nodes to each user, given there were 5 MM users on the first day and 20 MM connected nodes, but I will have to do some sampling to give a direct answer. They do have some connecting third party nodes.

dbms.memory.heap.initial_size=25g
dbms.memory.heap.max_size=25g
dbms.memory.pagecache.size=165g

total machine memory is 244 GB and all of it is dedicated to the neo4j server

This is Method 3 (deleting rels in series and deleting nodes in parallel that remain)

Call apoc.periodic.iterate("Match (u:User) where u.last_obs < datetime('2018-01-03') return u","Match (u) with apoc.coll.toSet([(u)-[:OBSERVED_WITH]->(a) | a]) + collect(distinct u) as nodes unwind nodes as n Set n:MarkDel",{batchSize:100,iteratelist:true,parallel:true,retries:3});

Call apoc.periodic.iterate("Match (u:MarkDel:User)-[r]->() return r", "delete r",{batchSize:10000});

Call apoc.periodic.iterate("Match (u:MarkDel) where size((u)--()) = 0 return u", "detach delete u",{batchSize:1000,iteratelist:true,parallel:true,retries:3});

The reason I markDel but only delete rels from Users is: 1.) because users are the central connector in the graph hence everything that is connected has a rel from a user so it looks like (User)-[r]->(id) 2.) I don't want to delete all (id) connected to Users markDel, because some (id) also touch users who were more recent, i.e. an (id) that connects to a user from the first day was also observed connected to a user yesterday, I don't want them to be deleted so the third iterate looks for orphaned nodes after I delete all the users which connect them. Note - there is no race condition once all the nodes rels are deleted because they all have size 0.

Did you observe the machine during the deletion process to see where the bottleneck was?
Also check with iotop how much write throughput is happening.

For single threaded deletes you can also increase the batch-size to 100k to 10m.
One other thing that could help is to order the to be deleted rels by id but I have to try that first.

One other thing one could do is to run a clustering algo on the :MarkDel nodes and delete clusters (nodes and rels) in parallel.

I checked Iotop and these are the results. I upped the delete rels from 1000 to 1000000 in batches in series:

Total DISK READ :      23.07 M/s | Total DISK WRITE :	 1703.66 K/s
Actual DISK READ:      23.01 M/s | Actual DISK WRITE:       0.00 B/s
   TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                     
 50759 be/4 neo4j      10.91 M/s   46.25 K/s  0.00 % 73.48 % java -cp /var/lib/neo4j/plugins:/etc/ne~=/var/lib/neo4j --config-dir=/etc/neo4j
 46510 be/4 neo4j      11.14 M/s   34.69 K/s  0.00 % 70.67 % java -cp /var/lib/neo4j/plugins:/etc/ne~=/var/lib/neo4j --config-dir=/etc/neo4j
 42817 be/4 neo4j    1040.70 K/s 1622.72 K/s  0.00 % 15.63 % java -cp /var/lib/neo4j/plugins:/etc/ne~=/var/lib/neo4j --config-dir=/etc/neo4j
 43008 be/4 neo4j	0.00 B/s    0.00 B/s  0.00 %  0.00 % java -cp /var/lib/neo4j/plugins:/etc/ne~=/var/lib/neo4j --config-dir=/etc/neo4j

Are these expected results? Do I need a higher IOPS SSD- currently it is a 16000 IOPS EBS

Tasks: 344 total,   2 running, 182 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 97.3 id,  2.4 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 25174608+total,  1702016 free, 20220547+used, 47838596 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 47205400 avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                               
 42623 neo4j     20   0  210.2g 192.0g   3656 S  25.9 80.0   1961:36 java                                             

Those numbers are not really high.
I thinnk we should at least see 10x (250MB/s) on IO.

cpu utilization is zero only io/wait.

I need to test this on our machine to see how it behaves there.

I have reviewed my EC2 and everything appears to be correct, Neo4j is highly underutilizing the available hardware.

ec2 - 32 vCPUs - 244 GB Ram - 8 TB 16000 IOPS Provisioned SSD - r4.8xlarge

Instance type	EBS-optimized by default	Maximum bandwidth (Mbps)	Maximum throughput (MB/s, 128 KB I/O)	Maximum IOPS (16 KB I/O)
r4.8xlarge	    Yes	                              7,000	                           875	                         37,500

These instance types can support maximum performance for 30 minutes at least once every 24 hours. For example, r4.8xlarge instances can deliver 875 MB/s for 30 minutes at least once every 24 hours.

I have reached out to AWS support team to see what their opinion on this is.

I forgot to mention one thing. Due to the way the compiled runtime works, it doesn't behave well in periodic iterate
that's why it's best to prefix the driving statement with cypher runtime=slotted.

I have here tried to delete 1M rels, which took about 90s

 Call apoc.periodic.iterate("cypher runtime=slotted Match (:MarkDel)-[r]->() return r limit 1000000", "delete r",{batchSize:100000});
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| batches | total   | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch                                                     | operations                                                          | wasTerminated |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10      | 1000000 | 91        | 1000000             | 0                | 0             | 0       | {}            | {total -> 10, committed -> 10, failed -> 0, errors -> {}} | {total -> 1000000, committed -> 1000000, failed -> 0, errors -> {}} | false         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
92048 ms

I'm recording my experiments here:

I now deleted 82291607M rels on our machine and it took. 2hrs 44 minutes

With disk I/O going up to these levels.

Total DISK READ :       0.00 B/s | Total DISK WRITE :     772.07 M/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:    1790.73 M/s

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| batches | total    | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch                                                       | operations                                                            | wasTerminated |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 823     | 82291607 | 8677      | 82291607            | 0                | 0             | 0       | {}            | {total -> 823, committed -> 823, failed -> 0, errors -> {}} | {total -> 82291607, committed -> 82291607, failed -> 0, errors -> {}} | false         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
8677673 ms

Now off to delete the nodes.

gigauser
Graph Buddy

we did it like the following codes and took a lot of time, so we tried Michael's guide, but it took a little longer than our methods below:

MATCH (n:News)-[r]-(w:Word2)
WITH r LIMIT 10000
DELETE r;

#!/bin/bash
counter=1

while [ $counter -le 50772 ]
do
echo $counter
time cat cypher/delete.cypher | bin/cypher-shell -a bolt://192.168.1.:* -u neo4j -p "********"
((counter++))
done

Btw. I realized that I had too little page-cache configured on that machine, that's why it took much longer than expected. If you can up your page-cache that works better. Also you can delete more of it, e.g. 100k per batch.

I would have used APOC's periodic.iterate instead so you don't start a new driver/connection for every query.

In graph-algos we're currently trying to use clustering (e.g. unionFind) to find independent clusters which then can have their relationships created/deleted in concurrently. That might also be an option for you.

In graph-algos we're currently trying to use clustering (e.g. unionFind) to find independent clusters which then can have their relationships created/deleted in concurrently.

@michael.hunger does this imply that relationships should be able to be deleted in parallel, if neither of the relationship's nodes are also being deleted? I tried to run a parallel delete of relationships via apoc.path.subgraphAll to extract fully disjoint subgraphs (which I had thought would be similar to your above suggested unionFind approach), but got NullPointerExceptions in the logs.

e.g.

CALL apoc.periodic.iterate(
  "MATCH (metadata:METADATA)
  WHERE metadata.uri STARTS WITH $sourceUri
  RETURN metadata",
  "CALL apoc.path.subgraphAll(metadata, {relationshipFilter: 'METADATA>,RELATIONSHIP'}) YIELD relationships
  UNWIND relationships as r
  DELETE r",
  { batchSize: 1000, iterateList: true, parallel: true, concurrency: 50, params: { sourceUri: $sourceUri } }
)

@jameslaneconkling What @michael.hunger was talking about is when you try to delete in parallel, the reason you get null.point.exception error's is basically because there is a race condition if you can't guarantee segregation of the nodes and relationships per thread. when he mentioned UnionFind, he is saying if you can basically cluster all the data (e.g. running subgraph all on all of your data) and then use periodic iterate so that the Delete Rels per thread only within a given community, you can avoid the race condition.

What you are doing is gathering all the rels for each of the nodes from your first query, but then on successive iterations, some of the nodes might have been deleted because they were connected to your subgraphAll in one of the batches. What you want to do to delete in parallel is make sure that for each delete r unwind, no other threads touch the data that is being deleted in a different thread.

Hence, run UnionFind with writes on all data, it writes a property community (write:true) or streams (UnionFind.Stream()) those clusters to a periodic iterate, then delete in parallel the clusters that show up in periodic iterate so that for each thread the batch contains data exclusive to that thread and can't be accessed by another thread.

Hope that helps

Hi Benjamin,

Did AWS get back to you on why the disk was being underutilized? We are seeing 1/10th the max throughput on our Neo4j box, even though iotop says 99% for most threads.

Any ideas?

Yes, get a different EC2 type.. After working for a few months on various boxes I found the only way Neo4j is performant is on i3 type boxes. EBS will not perform well under high IO loads and it throttles it based on size of box, i.e. 16xlarge types get less of a throttle. Using local instance storage, although ephemeral and a bit of a pain to back up to ebs/s3 as a GZIP, is 10-20x faster. If you don't want to use i3 then look at the nitro boost systems. We did not go that route because the m5 and c5 instance types with nitro boost didn't offer enough ram and has too much CPU for us to use properly. i3 or i3en are your best bets. I get reads and write exceeding 1000M/s whereas with EBS on a 32000 IOPS which are very expensive I might add, were only gett 100k/s if that.

Thanks benjamin, that helps a lot. We will be using equivalent machines on Azure (Ls_v2). Out of curiousity, how do you managed your database backups? If I understand correctly, having a non ebs store means that your data is much more susceptible to data loss.

Yes, if the EC2 freezes or bugs out the db needs to be started from scratch. We are working on our own auto deployment but at a minimum of keeping data backed up use pigz or zstd on a daily basis, upload it to s3. I ran a compression test between zstd and gzip with a 170 GB tarfile
zstd - start 16:58, end 17:28, size 58G
gzip - start 18:01, end 20:38, size 56G
Note as you hit TB it will inevitably get slower.
Alternatively you could just upload the database directly to s3. As long as it is stored in the cloud somewhere it is not too bad.

Ah, we used to do something similar. But with 2.3Tb of data its painfully slow. Now we run neo4j-backup on another machine and snapshot that disk. We keep 3 snapshots around incase one of them is corrupt.

How does your neo4j-backup on another machine and snapshot work? Could you describe the process, I worried about the back up and we have not scaled to the full db size yet which makes me worried for when we reach out 2 + TB db mark

The backup tool can be run on an external machine, as long as the backup port of your production neo4j is accessible to that machine. Most of the workflow is managed through a buch of lambdas, but in a nutshell here is what happens:

  1. Allocate a new ec2 machine with neo4j installed
  2. Mount a new volume with enough space to store a full backup of production neo4j
  3. Initialize the backup process by calling neo4j-backup -from $EXTERNAL_IP on the newly allocated box
  4. Wait for the backup process to finish, once done initiate EBS snapshotting
  5. Wait for snapshotting to finish, then deallocate machine and destory EBS volume

You can choose to keep the backups lying around on your EBS volume rather than taking a snapshot. We mainly do it to save cost as 4TB of SSD is much more expensive than storing a 2.3TB snapshot. If you have a large database, you also want have decent bandwidth on your backup machine.

sam
Node Clone

Thanks Ben,

We have a database that is 2.3TB in size so your suggestion will really help us manage the scale.