‎09-30-2021 07:11 AM
Hey neo4j community,
I'm using the Neo4j server version 4.3.2 (community).
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions. A version can have multiple dependencies.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts
.
Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
My nodes have the properties
-
version_id
: integer -
name
: string -
version
: string
I'm starting with what I thought would be a simple query but it's already failing. Start with version that has version_id
16674850
and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id
.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth / variable length to 12
or greater. Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
I uploaded a sample data set. It's the same data I'm currently using. Here are the links
- https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
- https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph". I spent the last 6 weeks on this problem. Thank you very much!
Solved! Go to Solution.
- Labels:
-
Cypher
‎09-30-2021 11:55 AM
The plan changes to one that is far less efficient. It may be a little tough to try to force the planner not to do that.
As an alternate approach, if you have APOC Procedures installed, please try using apoc.path.subgraphNodes()
, as this should be a more efficient approach to matching to distinct nodes when there is a web of paths that can lead to the same nodes.
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
‎09-30-2021 11:55 AM
The plan changes to one that is far less efficient. It may be a little tough to try to force the planner not to do that.
As an alternate approach, if you have APOC Procedures installed, please try using apoc.path.subgraphNodes()
, as this should be a more efficient approach to matching to distinct nodes when there is a web of paths that can lead to the same nodes.
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
‎09-30-2021 01:06 PM
Holy moly! Thank you very much. It works! I'm so glad that I finally found a solution.
Started streaming 1725 records after 3 ms and completed after 28 ms, displaying first 1000 rows.
Here is the query plan.
Unfortunately I have to learn more about apoc
now
Maybe one additional question. How can I include the path to each node? I'd like to show my users their dependencies and also why this specific dependency is included in the list. Since there are usually multiple paths to a dependency it would be OK to simply return the shortest path.
‎10-01-2021 11:48 AM
You can use apoc.path.spanningTree()
which yields path
instead of node
, that will have a single path per end node.
‎10-01-2021 01:46 PM
Awesome! Thank you very much. If you want you can also grab the bounty at Stack Overflow cypher - Neo4j - Variable length greater 11 runs forever and query never returns - Stack Overflow.
I would be happy to give to you.
‎10-01-2021 05:58 PM
I'll take you up on that! My username there is InverseFalcon, I've posted my answer.