Neo4j

doug · ‎07-11-2022

For some reason I was unable to post the following with <code> or <pre> so I've included it as an image. I was getting a 404 on submit of this form.

david_allen · ‎07-12-2022

If you're a tiny company and you want to use Neo4j EE, first thing is you should check out the Neo4j Startup Program which can permit you to use EE in certain circumstances. Scroll to the bottom of the page and identify the heading that says: the Startup Self-Hosted Option.

Transaction logs are super important for the database. The situation you're describing is something I've seen before that can occur when your disk fills up. To keep data in Neo4j safe, we log transactions to disk. If your database gets corrupted, we can re-apply transactions in order, just like a git repo re-applies commits to get up to the current state of a git repo. None of this works if the database or transaction logs are corrupt or broken. Can't say for sure, but like I said a common reason why this could occur is failure to log the TXs in the first place because of disk IO errors or full disk.

To diagnose and come further, probably the best you can try is to use the option (fail_on_missing_files=false) and then come back with a separate question with the full logs of *that* operation and study the stack trace carefully to figure out what it is saying. Sadly it is possible to be missing so much out of the transaction logs or database that recovery isn't possible, but you haven't said enough to establish that's the case here.

doug · ‎07-12-2022

Thank you very much David. Your help is greatly appreciated. I am currently running the instance with the last working version of the database. I can still access data in this version, but I get the "Recovery required from position..." sequence mentioned above on start:

2022-07-12 14:14:19.164+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Commits found after last check point (which is at LogPosition{logVersion=0, byteOffset=191545629}). First txId after last checkpoint: 98943 
2022-07-12 14:14:19.165+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Recovery required from position LogPosition{logVersion=0, byteOffset=191545629}
2022-07-12 14:14:21.466+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   10% completed
2022-07-12 14:14:21.803+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   20% completed
2022-07-12 14:14:22.159+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   30% completed
2022-07-12 14:14:22.840+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   40% completed
2022-07-12 14:14:23.167+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   50% completed
2022-07-12 14:14:23.494+0000 INFO  [o.n.k.i.a.i.IndexingService] [neo4j/2443e357] IndexingService.init: indexes not specifically mentioned above are ONLINE
2022-07-12 14:14:26.768+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   60% completed
2022-07-12 14:14:28.446+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   70% completed
2022-07-12 14:14:29.775+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   80% completed
2022-07-12 14:14:30.859+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   90% completed
2022-07-12 14:14:31.757+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   100% completed
2022-07-12 14:14:31.781+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Recovery completed. 195308 transactions, first:98943, last:294250 recovered, time spent: 12s 612ms
2022-07-12 14:14:31.788+0000 INFO  [o.n.k.i.a.i.IndexingService] [neo4j/2443e357] IndexingService.start: indexes not specifically mentioned above are ONLINE
2022-07-12 14:14:31.816+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] [neo4j/2443e357] Checkpoint triggered by "Recovery completed." @ txId: 294250 checkpoint started...
2022-07-12 14:14:32.541+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] [neo4j/2443e357] Checkpoint triggered by "Recovery completed." @ txId: 294250 checkpoint completed in 724ms
2022-07-12 14:14:32.607+0000 INFO  [o.n.k.i.s.f.RecordFormatSelector] [neo4j/2443e357] Selected RecordFormat:StandardV4_0[SF4.0.0] record format from store /var/lib/neo4j/data/databases/neo4j

This happens every time I restart neo4j. So it's like it's repairing in memory only or something.

I updated the neo4j.conf to include "dbms.recovery.fail_on_missing_files=false", but when I restart neo4j server I see exactly the sequence above again. So that doesn't seem to have an impact in the current situation.

It's odd because I can see data and access it etc, but then when I run more inserts / updates / deletes it goes into the "recovery not possible" state mentioned in the first line and the database fails. There is plenty of disk space and I doubled disk space during my experiements in case that was the issue.

Here is the full dump from the debug.log (with urls removed) when I do a restart of the neo4j database. I had to do it as an drive link as there is a character limit in this post and I couldn't attach a txt file. I hope that's ok.

What would you recommend next please? Thank you very much!

doug · ‎07-12-2022

Thank you very much for your help on this. It's greatly appreciated. For some reason my reply yesterday doesn't seem to be on here, but I tried this approach. I tried it on a working version of the database which has the inconsistancy issue (i.e. before it reached the stage that it would completely fail to load). I get the same 0-100% recovery steps mentioned above:

2022-07-13 05:58:14.950+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Commits found after last check point (which is at LogPosition{logVersion=0, byteOffset=191545629}). First txId after last checkpoint: 98943 
2022-07-13 05:58:14.951+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Recovery required from position LogPosition{logVersion=0, byteOffset=191545629}
2022-07-13 05:58:17.190+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   10% completed
2022-07-13 05:58:17.535+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   20% completed
2022-07-13 05:58:17.897+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   30% completed
2022-07-13 05:58:18.580+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   40% completed
2022-07-13 05:58:18.942+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   50% completed
2022-07-13 05:58:19.299+0000 INFO  [o.n.k.i.a.i.IndexingService] [neo4j/2443e357] IndexingService.init: indexes not specifically mentioned above are ONLINE
2022-07-13 05:58:22.669+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   60% completed
2022-07-13 05:58:24.501+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   70% completed
2022-07-13 05:58:25.966+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   80% completed
2022-07-13 05:58:27.095+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   90% completed
2022-07-13 05:58:28.081+0000 INFO  [o.n.k.r.Recovery] [neo4j/2443e357]   100% completed
2022-07-13 05:58:28.102+0000 INFO  [o.n.k.d.Database] [neo4j/2443e357] Recovery completed. 195315 transactions, first:98943, last:294257 recovered, time spent: 13s 150ms
2022-07-13 05:58:28.112+0000 INFO  [o.n.k.i.a.i.IndexingService] [neo4j/2443e357] IndexingService.start: indexes not specifically mentioned above are ONLINE
2022-07-13 05:58:28.139+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] [neo4j/2443e357] Checkpoint triggered by "Recovery completed." @ txId: 294257 checkpoint started...
2022-07-13 05:58:28.861+0000 INFO  [o.n.k.i.t.l.c.CheckPointerImpl] [neo4j/2443e357] Checkpoint triggered by "Recovery completed." @ txId: 294257 checkpoint completed in 721ms

If I restart neo4j get this every time, but then if new transactions are added it seems to "tip it over the edge" and then the database goes into the failed state where I can't get it to run anymore.

So it's like it's fixing the in memory version but not actually affecting the on disk transaction logs or something.

Here is the full debug log. I link to it because I can't include it in this post due to character length and attachment type restrictions.

Thanks so so much, I really appreciate your help. And thank you also for the info about the new pricing setup. I will definately check out Aura.

david_allen · ‎07-13-2022

Unfortunately these logs don't have enough information to spot what's going on, in fact the database looks like system and neo4j are starting relatively normally. Can't tell what the issue is or whether data is missing from this.

In general if you complete a recovery and your database comes up, then you should be good. If post-recovery, new transactions cause corruption again, then this needs to be captured and reported in a log we could take a look at.

doug · ‎07-13-2022

Hi David. Thank you very much. I've just stepped through the sequence which causes the database failure (it's exactly the same every time I do it in case that's helpful).

The steps are:

Database running OK, but having gone through the 0-100% recovery thing mentioned
Restarted database to check that restarting doesn't cause it to fail at this point. Again database runs OK, but goes through the 0-100% recovery thing.
Run my script which updates/deletes/inserts a fair bit of data
Database still running
Run sudo service neo4j restart
Neo4j service runs, but the "neo4j" database now fails with the following error.

"Error reading transaction logs, recovery not possible. To force the database to start anyway, you can specify 'unsupported.dbms.tx_log.fail_on_corrupted_log_files=false'. This will try to recover as much as possible and then truncate the corrupt part of the transaction log. Doing this means your database integrity might be compromised, please consider restoring from a consistent backup instead."

The script I run at step 3 queries a large number of nodes. Might it be that it queries some of the nodes recovered in the 0-100% thing and thus reads bad data into the transaction log or something?

Here is the full debug log. I've annotated the points at which each of the steps above were executed for convenience.

Thanks so much again. Your help is really very much appreciated.

david_allen · ‎07-18-2022

This is a tough one. I will attempt to check with our internal engineering teams and find any pointers I can on what's going on, and check back if I can.

doug · ‎09-02-2022

Thanks. I've actually sorted it by a different means. I had no success with changing settings in the conf file or forcing checkpoints etc.

The solution that worked for me was:

Used the apoc.export.cypher.all function to export a cypher-shell script based backup of the data
Stopped the database
Removed everything in /var/lib/neo4j/data/
Restarted the database to create a new "neo4j" database.
Created the user account credentials
Ran the /usr/bin/cypher-shell script to restore the database from the scripted backup

This keeps all the settings, gives a completely new database and populates it with the data from the original database. It looks like everything is there and there were no errors with this approach. Needed to scale up the server to get a bit more memory during the restore, but that was easy enough. I am assuming that this is all ok because it was the transaction log rather than the data store that was corrupted.

After doing this I was able to run my pre-calculation scripts without error.

So now I cross my fingers and hope it's all intact!

Thanks for your help David.