Neo4j

wookie · ‎05-30-2020

Hello, I'm Lukas. Because it's my first post on the forums, I would like to take this opportunity to thank Neo4j Team for making a swift, easy to use graph database dream come true and whole community for the support and help for rookies to overcome issues.

Turning to the merits. I am using Neo4j Community Edition 4.0.4 locally hosted server on Linux (18.04.4 LTS) and Neo4j Python Driver 4.0, however this problem also occured on Neo4j Community Edition 3.5.18 server and Neo4j Python Driver 1.7. Briefly speaking, the problem is after importing data using neo4j-admin import tool the data persists after reconnecting to local server (localhost) but after importing data using Neo4j Python Driver the data persists only during the same connection it was imported at. When reconnecting, driver-imported data wipes away (as I can see it from neo4j browser), suggesting it was loaded only in-memory OR the transaction was not properly commited. The details of the problem and my way to import the data are described below.

I'm developing archaeological database, and since standard import tools do not cope with more complex data patterns I developed my own way to import data using Neo4j Driver and divided the process into four steps. They are executed from the main importing python script.

First, I wipe away all database data to be able to use import-tool. To do so, I remove contents of two directories: data/transactions/<db_name> and data/databases/<db_name>.
Secondly, I use standard import-tool utilized by neo4j-admin to load so-called attributary data. These are node and relations set in tree-like structures which are immutable during application usage and are used to describe proper data. From the main script I call a command:

subprocess.run(f"./bin/neo4j-admin import {nodes} {relationships} --database={args.dbname} --report-file={importReportPath}", stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, check=True)

where "{nodes}" and "{relationships}" are consecutive calls for parameters "--nodes=<CSV_FILE>" and "--relationships=<CSV_FILE>" used to pass files' paths corresponding to either nodes or relationships. This step goes as expected and the data is loaded.
Example of a node data file:

sexId:ID(sexGroup),lang_pl__name:string,lang_en__name:string,:LABEL
SEX_male,mężczyzna,male,Sex
SEX_female,kobieta,female,Sex

Thirdly, I use Python Driver and self written scripts to import data from another files by utilizing transaction connections. This is the pattern I use to create new nodes and relationships:
a) At the beginning, I create driver and session instances in self-defined Graph class object:

self.driver = GraphDatabase.driver("bolt://localhost:7687", auth=(login, password), encrypted=False)
self.session = self.driver.session()
self.transaction = None

b) Then, basing on self-defined data schema stored in JSON file, I build cypher query used to create unique constraints. After cypher query is done I run it using auto-commit transaction:

self.session.run(query)

c) This is where whole data importing begins. I want to make sure, that whole imported dataset is consistent, so whenever any error occurs during importing subsequent data row, I rollback everything that was imported so far. That's why I do:

graph.beginTransaction() 
if loadDataToGraph(data, graph) == True:
  graph.commit()
else:
  graph.rollback()
// beginTransaction() method body: self.transaction = self.session.begin_transaction()
// commit() method body: self.transaction.commit()
// rollback() method body: self.transaction.rollback()

In loadDataToGraph function body graph.match and graph.create methods are called, which, as in the case of unique constraints, build cypher queries utilizing either MATCH or CREATE clauses. Both methods, after building query, call graph.explicitRun(query) method, whose body is:

if self.transaction != None and not self.transaction.closed():
  return self.transaction.run(query)
else:
  raise NoBeganTransaction(query)

d) After successfully importing whole data, the connection is closed with graph.close() method call. It's body is:

if self.transaction != None and not self.transaction.closed():
  self.transaction.commit() // Just to make sure that transaction is commited
if not self.session.closed():
  self.session.close()
if not self.driver.closed():
  self.driver.close()

The importing is done and python script ends. The data, both proper and attributary, is present in Neo4j Browser (+16k nodes). After terminating connection and connecting again (using either "console" command or "start" results in the same behaviour), proper data disappears and only the attributary one, which was loaded in the standard way using import tool and inputing CSV files, persist (+14k nodes).
Another odd thing I discovered is that my transactions/<db_name>/neostore.transaction.db.0 file weighs 262.1MB and is full of zeros. It begins like this:

0700 0000 0000 0000 0000 0000 0000 0001
0000 0172 65ea d03a 4939 5b28 e9a4 b883
302e 302e 3446 5307 0000 0172 65ea d03a
0000 0000 0000 0001 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
... ... ...

and to the end of the files there are streams of 0's. On the contrary, my databases/<db_name> directory weighs totally only 2.7MB. Should it look like this?

I upload my config files, as it may be relevant.
neo4j.txt (15.3 KB)

I will greatly appreciate any help or a clue. Solving this problem is a matter of life and death for the project

Regards,
Lukas

wookie · ‎06-05-2020

Hello again!

I have finally managed to solve my problem.
In a case someone has struggled with the same issue, here's what I've done:

I've produced a minimal example to reproduce the problem. It turned out, that I misunderstood the usage of the import tool and the consequences of removing all database files.
What I was doing wrong is that I firstly started the server and WHILE the server was up, I removed the Neo4j/data directory contents through script and then I IMMEDIATELY used import tool to load attributary data and the driver to load the rest of it without reconnecting to server and, I presume, giving Neo4j some time to build-up fresh database tables and to setup. I'm surprised that it even had been working that way and there hadn't been thrown any error upon me.

So the right workflow of self-importing persisting data to Neo4j database utilizing both import tool and Neo4j Python Driver is following:

# Calls commands: 
# > rm -r <NEO4J_HOME>/data/transactions/<DB_NAME>
# > rm -r <NEO4J_HOME>/data/databases/<DB_NAME>
# > rm -r <NEO4J_HOME>/data/data/transactions/<DB_NAME>
wipeAwayPreviousData(args) 

# Calls command:
# ><NEO4J_HOME>/bin/neo4j-admin import <params...>
loadAttributaryData(args)

# Calls command:
# > <NEO4J_HOME>/bin/neo4j start
# and then calls function to give server some time to setup
# time.sleep(10)
startServer(args)

# Establish connection with the server
# Internally it calls:
# GraphDatabase.driver(URI, auth=(login, password), encrypted=False)
# encrypted=False flag is mandatory since Neo4j 4.0.0^
graph = Graph(login, password, "bolt://localhost:7687")

graph.beginTransaction()
# Create some nodes with transaction.run(query) ...
graph.commit()

# Internally calls:
# transaction.commit()
# session.close()
# driver.close()
graph.close()

# Calls command:
# > <NEO4J_HOME>/bin/neo4j stop
# and then call function to give server some time to rest in peace...
# time.sleep(10)
stopServer(args)

Regards!
L

View solution in original post

wookie · ‎06-05-2020

Hello again!

I have finally managed to solve my problem.
In a case someone has struggled with the same issue, here's what I've done:

I've produced a minimal example to reproduce the problem. It turned out, that I misunderstood the usage of the import tool and the consequences of removing all database files.
What I was doing wrong is that I firstly started the server and WHILE the server was up, I removed the Neo4j/data directory contents through script and then I IMMEDIATELY used import tool to load attributary data and the driver to load the rest of it without reconnecting to server and, I presume, giving Neo4j some time to build-up fresh database tables and to setup. I'm surprised that it even had been working that way and there hadn't been thrown any error upon me.

So the right workflow of self-importing persisting data to Neo4j database utilizing both import tool and Neo4j Python Driver is following:

# Calls commands: 
# > rm -r <NEO4J_HOME>/data/transactions/<DB_NAME>
# > rm -r <NEO4J_HOME>/data/databases/<DB_NAME>
# > rm -r <NEO4J_HOME>/data/data/transactions/<DB_NAME>
wipeAwayPreviousData(args) 

# Calls command:
# ><NEO4J_HOME>/bin/neo4j-admin import <params...>
loadAttributaryData(args)

# Calls command:
# > <NEO4J_HOME>/bin/neo4j start
# and then calls function to give server some time to setup
# time.sleep(10)
startServer(args)

# Establish connection with the server
# Internally it calls:
# GraphDatabase.driver(URI, auth=(login, password), encrypted=False)
# encrypted=False flag is mandatory since Neo4j 4.0.0^
graph = Graph(login, password, "bolt://localhost:7687")

graph.beginTransaction()
# Create some nodes with transaction.run(query) ...
graph.commit()

# Internally calls:
# transaction.commit()
# session.close()
# driver.close()
graph.close()

# Calls command:
# > <NEO4J_HOME>/bin/neo4j stop
# and then call function to give server some time to rest in peace...
# time.sleep(10)
stopServer(args)

Regards!
L

Neo4j

Data imported using Neo4j-Driver does not persist after reconnecting to Community Edition local server