Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
04-01-2021 03:06 PM
Hi,
I'm interested by using Neo4j as a version control database for a large hierarchical tree (which is a filesystem like a source code tree, or bigger).
My pet project is neogit
, an implementation of a git client, backed by Neo4J.
Git uses a Merkle-Tree like structure to store its data
I have implemented the same model:
And imported my data (example on neo4j source code repository)
And I made a little tet, modifying a specific file and recapturing the while tree.
As you can see in the following screenshot, PublicApiDoclet.java
has been modified, and a new set of Trees has been created for the new commit, but they have relationship with the old Trees as well, since the rest of the files are unchanged.
Colors:
Branch
Grey: Commit
Tree
Blob
the commit -[:OWNS_FILESYSTEM]-(root_tree:Tree)
and the each Tree -[:HAS_CHILD_TREE | :HAS_CHILD_BLOB]->(:Tree | :Blob)
with each relationship having a name
property corresponding to the child name (file or directory)
The main issue I'm facing now is how to query the difference between 2 commits, and highlight
the new state of files:
My current algorithm to query the differences is the following:
1 Get the list of modified Tree
in the new commit
2 For each modified Tree
Tree
at this path in the old commitTree
and the new Tree
respective children for differencesIn the following query example, I'm trying to find the modification between 2 releases
of Windows 10.
1 win10-20H1-xxxx
2 win10-20h2-xxxx
Slow: Checking all the Trees originating from the new commit, but not having the old commit as ancestor.
MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
MATCH (new_root)-[:HAS_CHILD_TREE*]->(new_child_t:Tree)
WHERE NOT (old_root)-[*]->(new_child_t)
RETURN count(new_child_t)
This method is unbearably slow, so I reworked the query to filter using the calculated sha1sum.
Fast : sha1sums filter
MATCH (new_os:OS {name: "win10-20H2-2009.19042.631"})-[:OWNS_FILESYSTEM]->(new_root:Tree)
WITH new_root
MATCH (old_os {name: "win10-20H1-2004.19041.264"})-[:OWNS_FILESYSTEM]->(old_root:Tree)
WITH old_root,new_root
// collect old SHA1 trees
MATCH (old_root)-[:HAS_CHILD_TREE*]->(old_child_tree:Tree)
WITH collect(old_child_tree.sha1sum) as old_sha1_list
// filter new trees
MATCH new_paths = (new_root)-[:HAS_CHILD_TREE*]->(new_child_tree:Tree)
WHERE NOT new_child_tree.sha1sum IN old_sha1_list
RETURN new_child_tree.sha1sum, new_paths
The goal now is to get the filesystem path from each new_paths
collected
before, and build a new query that would retrieve the old Tree, if it exists.
This has to be done by a script outside of cypher as there is not way to build a variable-length pattern dynamically (AFAIK)
tree_path = Path("/Windows/System32/cmd.exe")
tree_match_partial = ["MATCH path = (old_root_fs:Tree {sha1sum: $old_root_sha1})"]
for path_part_index, path_part in enumerate(tree_path.parts[1:]):
# {
# 'path_part_0': 'Windows',
# 'path_part_1': 'System32',
# }
param_var_name = f'path_part_{path_part_index}'
params[param_var_name] = cypher_escape(path_part)
rel = f'-[{{name: ${param_var_name}}}]->(:Tree)'
# -[name: "Windows"]->(:Tree) ......
tree_match_partial.append(rel)
This is trivial, once we have both Trees to be compared:
MATCH (t:Tree {sha1sum: $sha1})-[r:HAS_CHILD_TREE|HAS_CHILD_BLOB]->(c)
RETURN r.name as filename, c.sha1sum as sha1sum
If we sum up the algorithm, it takes a lot of queries to achieve a full diff of this tree structure:
1 query to get the modified Trees
n
queries to get the related Trees
n
queries to get their respective children
Unfortunately this obliterates the performance of my git diff
command.
The goal would be to do the full diff in a single Cypher query.
I would like your suggestions, both on the data model and the Cypher language to improve this situtation.
I'm ccing @lju , since her talk: "How to keep track of change - versioning approaches in Neo4j" is clearly on the topic here 🙂
4.2.4
, Ubuntu 20.04, DockerThank you very much for your help !
06-30-2021 02:49 AM
Hi @Wenzel
Completely missed this, so sorry!
Will have a think about this - do you have a repo with the data you loaded into the graph?
All the sessions of the conference are now available online