Querying Linux Kernel Git history with neo4j
08-03-2022 02:56 AM
Hi everyone,
TLDR: A newbie loads a large graph, is disappointed with traversal performance, seeks advice.
I work on the KCIDB project, a part of KernelCI project, sponsored by The Linux Foundation. The KCIDB project is trying to aggregate results coming from all CI systems testing the Linux Kernel (such as Red Hat CKI, Intel 0day, Google Syzbot, Linaro Tuxsuite, KernelCI native testing, and others), unify reporting, and provide long-term analysis. So far we store testing results in Google BigQuery (prototype dashboard available), and generate basic result notifications based on various criteria.
One of our next steps is implementing known-issue triaging, where we have a database of registered known issues, look for them in incoming test results, and amend test outcomes based on what is found. We have to cope with constant incoming test failures, because it's unfeasible to test every change on every possible bit of hardware, and fix issues before merging.
To be able to intelligently pick which issues should be triaged for which revision, and which issues were found previously we have to reason about revision connectivity (Git commit graph, plus patches attached to it), and answer questions such as: "is this revision based on another revision", "is this revision between these revision", "what are the last 50 revisions before this one", "what are 10 releases (tagged commits) before this one", and so on, joining these with other data, such as triaging results, known issues, and so on.
I loaded minimum information (commit hash, parent commit hash, author name and email, and commit subject) on complete history of the Linux mainline branch into neo4j Desktop 1.4.15 on my laptop, resulting in a graph with 1088612 nodes and 1176193 relations, and created an index on the commit hash.
I then tried making a few queries we might use, and hit a performance wall. Attempting a naive approach to checking if a commit is reachable from another commit:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'1da177e4c3f41524e886b7f1b8a0c1fc7321cac2'})
WHERE exists((s)-[:PARENT*0..]->(e))
RETURN true
using the start and end commits in the history, completes in about 38ms, great!
But doing the same for just 34 hops doesn't complete in 20 minutes:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'d0d96121d03d6d9cf608d948247a9f24f5a02da9'})
WHERE exists((s)-[:PARENT*0..]->(e))
RETURN true
It gets better with APOC (completes in 356ms):
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'d0d96121d03d6d9cf608d948247a9f24f5a02da9'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
terminatorNodes: e,
limit: 1
})
YIELD path
RETURN length(path)
But once you go longer, to a subgraph with 85 nodes (a path through it would be shorter), even with a level limit:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
terminatorNodes: e,
maxLevel: 100,
limit: 1
})
YIELD path
RETURN length(path)
it just runs out of memory after a few minutes. Other similar queries wouldn't return in at least 10 minutes (after which I'd terminate them).
Needless to say, that the last query would take git about half a second, without an explicit level limit, starting the process from scratch, and finding *all* the paths, not just one.
Am I doing something wrong, or am I just using the wrong tool for the job?
Any advice of how (and how much) I can improve this with neo4j, or perhaps a suggestion of another graph database?
Thank you for any responses!
P.S. The sample I tried this on is just one branch of one repo. We need to store and query a whole bunch of repos, some of them with multiple branches. Those would reuse the bulk of history, but there would be considerable data churn, so we could probably end up with 3x the graph size quickly.
08-03-2022 05:28 AM
Hi @spbnick - thanks for the details in your question. Could I ask you a few questions?
Have you made any changes to the memory parameters in Neo4j desktop? The memory defaults are pretty skimpy out-of-the-box. How large is your database on disk? One possibility of what's going on here is that you're running with a too small page cache, and all of the time is spent swapping data between disk and the cache.
This: https://neo4j.com/docs/operations-manual/current/performance/memory-configuration/#memory-configurat... may be useful information.
Do you have the data available somewhere in a format that I could take a look at? I'm willing to load it into Desktop (provided it fits in my laptop) and have a look with you.
08-03-2022 05:46 AM
Hi @steggy,
Thanks for your prompt response! Sure, the data is from publicly-available Linux git history. You can find the CSV files I was loading into neo4j, along with the script you can use to generate them here: https://drive.google.com/file/d/1QpjCYdXjsu-O0DQZbyJgx2DFtukU7ZRZ/view
The script was executed on a recent mainline master like this:
git log --format='%H%n%P%n%an%n%ae%n%aI%n%s' | ./log2csv commits.csv relations.csv
And then the files were loaded into neo4j like this:
// Load commits
:auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///commits.csv' AS commit
CREATE (c:Commit)
SET c = commit
// Create index for hash
CREATE INDEX FOR (c:Commit) ON (c.hash)
// Load relations
:auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///relations.csv' AS relation
MATCH (child:Commit {hash:relation.hash})
MATCH (parent:Commit {hash:relation.parent_hash})
CREATE (child)-[:PARENT]->(parent)
This would leave the author timestamps (which I added later) as strings, though, so if you want to use them, you'd have to tweak your loading a bit.
Thanks for the advice! I'll take try to tweak the memory parameters later this week, I used the defaults so far.
08-03-2022 10:05 AM
Hi @spbnick ,
In your second example, are you sure you have a path between those two nodes? If I'm looking for existence, I may want to consider the uniqueness parameter. Something like:
MATCH (s:Commit {hash:'3123109284176b1532874591f7c81f3837bbdc17'})
MATCH (e:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
uniqueness: 'NODE_GLOBAL',
terminatorNodes: [e],
limit: 1
})
YIELD path
RETURN length(path)
Otherwise your memory can easily drown on nasty loops.
08-04-2022 02:24 AM
Indeed! I made a mistake in that query - pasted an incorrect commit hash. Coincidentally, these are actually connected, but the other way around. I had problems with similar queries.
Thank you very much, 'NODE_GLOBAL' indeed makes all the difference! As does 'RELATIONSHIP_GLOBAL'.
I read about the parameter and considered trying different values, but for some reason stopped short of actually doing that. Now, queries I've tried so far complete in very reasonable time, but I will have to experiment more with these queries to see if we can get all we need. Will get back with results after a while!
Thanks!
08-05-2022 04:55 AM
Alright, I experimented a bit more with queries, and indeed using 'NODE_GLOBAL' would work quite well for checking if a commit is reachable from another commit 👍
However, we also want to query things like "all issues found since the last release", or "count the percentage of times this issue reproduced since the last release". And that involves joining each commit since the last release with the issues (and incidents for the latter).
Trying to find the fastest way to just get 90 commits between two commits (to e.g. join them to something):
MATCH (s:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})
MATCH (e:Commit {hash:'1e20904e417738066b26490de2daf7ef3ed34483'})
CALL apoc.path.expandConfig(s, {
relationshipFilter: 'PARENT>',
uniqueness:'NONE',
terminatorNodes: [e]
})
YIELD path
UNWIND nodes(path) AS node
RETURN count(DISTINCT node)
goes away for several minutes and then runs out of memory. Git, of course, finishes that in under 300ms, from the usual cold start.
The naive approach works decently with a small depth limit of 50:
MATCH p=(s:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})-[:PARENT*0..50]->(e:Commit {hash:'1e20904e417738066b26490de2daf7ef3ed34483'})
UNWIND nodes(p) AS c
RETURN count(distinct c)
But once you bump the limit to just 100, it also runs out of memory after 10 minutes or so.
So, I presume to be able to approach git speeds (or go faster, it's a daemon after all?), neo4j needs to be made aware (like git) that this is an acyclic graph (DAG), which has nodes added on top only, so it could avoid traversing hopeless paths, and build/use appropriate indices (?). However, I couldn't find any way to do this 🤔
08-05-2022 05:39 AM
Hi @spbnick - have you explored the GDS (graph data science) library (https://neo4j.com/docs/graph-data-science/current/) I tried this:
call gds.graph.project('steggy', 'Commit', 'PARENT');
MATCH (s:Commit {hash:'68e77ffbfd06ae3ef8f2abf1c3b971383c866983'})
MATCH (e:Commit {hash:'1e20904e417738066b26490de2daf7ef3ed34483'})
call gds.bfs.stream('steggy', {sourceNode: ID(s), targetNodes: [ID(e)]})
yield path as p
unwind nodes(p) as node
return count(distinct node)
I'm admittedly not a data science expert, but the GDS library has several path finding algorithms that appear to be very fast when there *is* a path. When there isn't a path, I don't know that there is any algorithm that will be able to figure that out without visiting every node that's reachable from the starting point.
08-05-2022 06:13 AM
Thank you, @steggy. I've seen the library before, and took a better look now, after your message. However, I think this is a fundamental problem. Neo4j (and seemingly many other graph databases) are trying to tackle the case of a general graph, which is much harder than a DAG, if I understand correctly. But at the same time, they lack the algorithms and optimizations (essentially specialization) useful for DAGs.
The GDS library doesn't have topological sorting (because it cannot assume DAG), which I presume is what makes git fast, so I don't expect it to be fast enough in the general case either. Plus, I'm not comfortable with the idea of having the graph in memory for routine (as opposed to research) processing.
It would've been great if neo4j supported DAGs as a special case with the associated performance boost, but I don't expect that to be easy to implement.
08-03-2022 07:05 AM
...this doesn't seem to be memory related. It looks to be a combinatorial explosion problem 🙂 In your first problematic query, if you specify the path length in your query (34) or even use a range like 0..50, the query is fast. Let me keep noodling on this one some more
08-04-2022 02:30 AM
Thank you, @steggy! Yes, the combinatorial explosion is easy to hit, of course, and memory running out as the result is expected 😁 Unfortunately, I can't really limit the path, at least not to any small number, since we can expect, for example, issues to be present in the tree for months at a time, or longer, and that runs in tens of thousands of commits.
I'll explore @bennu_neo's suggestion further, as it seems to produce very good results so far.
08-05-2022 11:49 AM
A couple things to chime in here:
- Depending on the structure of your DAG, you could potentially reduce computation time by traversing the opposite direction, child -> parent. In the extreme case of a Tree this would be obvious as there is only one option for going up the tree as opposed to multiple going down.
- While I hear your concerns about using an in-memory graph for routine processing, I wouldn’t be so quick to overlook it, especially given the amount of memory you are already battling using transactions. GDS has a family of relatively well optimized path finding algorithms. The in-memory graph is fairly robust and the size of your data seems small enough that it should be pretty quick to re-project/update the in-memory graph regularly with the right configuration.
A few questions:
In this problem is there potential for multiple paths? Or just one at most. If multiple paths do you want all of them? just the shortest ones? Or filter by another criteria?
Are you always going to be looking for paths between a single source node and single target node? Or will you want to go from a single source node to multiple target nodes?
08-06-2022 02:16 AM
Thank you, @zach_blumenfel1, for insightful questions and advice.
I'm traversing from child to parent and my relation direction is from child to parent currently. There are a lot of merges in the Linux kernel repository, so this is normally a full DAG (not just a tree), with a lot of branches.
I see. I'll reevaluate the practical memory usage of GDS, if I come back to it, and I'll think about it more. What I'm concerned with is the potential non-linear performance of the path-finding algorithms in GDS, making it difficult to predict response times, since they're not aware of, and are not optimized for DAG (which could have linear-time algorithms). But I agree, I'll have to try it before deciding.
I'm yet to build this system, and am in the research phase right now. However, I can already imagine two kinds of queries: connectivity (whether a node is connected to another node), and getting all nodes (from *all* paths) between and including any two nodes, ordered either topologically, or by date. And of course filtering and joining the latter nodes with other types of nodes. We don't really need to look for shortest or longest path, or pick the paths we take.
The commits/revisions will always be a single graph, but it would be good to have other types of nodes too, like issues and incidents, which are mostly connected in a simple manner, well-suited for an SQL database too.
I don't think we would need to go from one to multiple nodes, so far 🤔
Thank you!
08-06-2022 02:34 AM
Hi @spbnick!
I'm glad that the initial approach was kinda working for you. Of course as @zach_blumenfel1 suggested, keep in mind that deciding your starting point correctly my speed up some queries.
About your second question, I feel you are looking for a simple subgraph (having certain commits as starting points). This should work.
https://neo4j.com/labs/apoc/4.1/overview/apoc.path/apoc.path.subgraphNodes/
Sorry if I don't send an example (not in the pc right now).
Regards
08-08-2022 01:19 AM - edited 08-08-2022 01:22 AM
Thank you, @bennu_neo. Yeah, I had a glimmer of hope for a while, that I won't have to work too hard for this one, and Neo4j already did everything for me 😄
I tried both 'subgraphNodes' and 'subgraphAll', but could never get more than the last node out of either of them, when I would specify either the 'endNodes' or the 'terminatorNodes' 🤷🏻
I would post an example, but Neo4j desktop's browser is failing to connect to the database *again* for some reason, and I don't have the time to debug that, sorry.
I'll have to shelve the idea of using a graph database for now and move onto closer goals which could bring us value faster, then will get back to the problem later. So far it seems my only option is to literally use a Git server to hold the revision graph, but that, of course, doesn't let us store anything else, like the issues and incidents. I might try something TinkerPop-based and Gremlin. Perhaps I can squeeze some performance out with its imperative approach.
08-08-2022 12:50 PM
Hi @spbnick,
I'm sorry to hear about that. I remember that endNodes interaction. If you can share the query you were trying and some examples of your tries on the most challenging one I would love to help you with that. (after this week cuz I'm out on holidays 🙂 )
I think your use case is really interesting and should get a lot of value from a graphy solution.
Best regards