Dear Community,
we are Connor and Daniel, Software Engineers at Anaconda, and we are currently looking at neo4j to represent our package meta data to answer questions like:
- What packages are available for python X on platform Y but not on platform Z?
- What missing dependencies do I need to build in which order, if I want to build package X (currently available on platfrom Y) on platform Z?
- What are all the downstream packages of package X (to e.g. execute their tests, when package X is updated)
Background: Package managers like conda, pip, apt, yum solve dependency trees to install a package via SAT solvers. Loading dependency trees into a Graph database like neo4j feels most natural to model relations, to query, visualize and compare graphs and sub-graphs with inspiring articles:
But can this be applied to conda packages, where multiple versions are available for each package at the same time with specificly ranged version constraints?
Each package contains a index.js file that describes it dependencies:
...
"depends": [
"bleach",
"bokeh >=2.4.0,<2.5.0",
"markdown",
"param >=1.12.0",
"pyct >=0.4.4",
"python >=3.8,<3.9.0a0",
"pyviz_comms >=0.7.4",
"requests",
"tqdm >=4.48.0"
],
...
Model Idea
Specifics:
- The blue nodes are virtual packages that represent a library that packages exist for but not a real package (e.g. python, numpy)
- The orange nodes are real packages that can be installed (e.g. `conda install numpy`)
- Each package can have 0-n type:run dependencies:
- The dependencies from package to package are not drawn directly as relation between real packages and their dependencies are constaint based on a virtual package (and multiple package version can fullfil those)
- The dependencies are instead drawn as relation to a version-less blue virtual package node, where relation properties tell what version would satisfy the dependency
- Finding all dependency of a package including all the indirect transitive dependencies requires to match from the real package through the blue virtual package to the next real package, idea:
MATCH (X:real {name: numpy, version: 1.19.2}) -[r:DEPENDS_ON {type: "run"}]->
(Y:virtual) <-[s:PROVIDES]-
(Z:real)
WHERE s.version in r.constraints
RETURN z.name, max(z.version), max(z.build)
- Open model questions:
- How to apply above MATCH pattern recursively to find all transitive dependencies of each dependency
- How to apply the "s.version in r.constraints" condition .. we can normalize the versions to comparable integers before data ingestion and split the constraints into multiple versions and their compare-operators, which results into 3 where conditions:
- How to handle duplicates within the transitive dependency relations for the same package, but with slightly different constraints?
- How to prefer the max version of packages (max(z.version))?
- How to do that in an efficient manner?
- The ultimate question: Can the problem of finding all the dependencies of a package including their order be solved by neo4j and cypher queries (maybe together with a path search algorithms like A*)?
Any feedback, hints, model ideas appreciated.
Thanks