Neo4j

mirco_zeiss · ‎09-30-2021

Hey neo4j community,

I'm using the Neo4j server version 4.3.2 (community).

I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions. A version can have multiple dependencies.

In my database I have 113.339.030 dependency relationships and 19.753.269 versions.

My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts.

Here you can see the package information.

https://registry.npmjs.org/react-scripts

One visualizer never finishes

https://npm.anvaka.com/#/view/2d/react-scripts

and another one creates a dependency graph so big it's hard to analyze.

https://npmgraph.js.org/?q=react-scripts

My nodes have the properties

version_id : integer
name : string
version : string

I'm starting with what I thought would be a simple query but it's already failing. Start with version that has version_id 16674850 and give me all its dependencies.

MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;

I have an index on version_id .

CREATE INDEX FOR (version:Version) ON (version.version_id)

That works until I set the depth / variable length to 12 or greater. Then the query runs forever. Here is the query plan.

Neo4j runs inside Docker. I've increased some memory settings.

- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G

I uploaded a sample data set. It's the same data I'm currently using. Here are the links

https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)

Here is the script to import the data.

neo4j-admin import \
    --database=deps \
    --skip-bad-relationships \
    --id-type=INTEGER \
    --nodes=Version=import/versions.csv \
    --relationships=DEPENDS_ON=import/dependencies.csv

That might help to do some experiments on your side and to reproduce my problem.

Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph". I spent the last 6 weeks on this problem. Thank you very much!

andrew_bowman · ‎09-30-2021

The plan changes to one that is far less efficient. It may be a little tough to try to force the planner not to do that.

As an alternate approach, if you have APOC Procedures installed, please try using apoc.path.subgraphNodes(), as this should be a more efficient approach to matching to distinct nodes when there is a web of paths that can lead to the same nodes.

MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b

View solution in original post

andrew_bowman · ‎09-30-2021

The plan changes to one that is far less efficient. It may be a little tough to try to force the planner not to do that.

As an alternate approach, if you have APOC Procedures installed, please try using apoc.path.subgraphNodes(), as this should be a more efficient approach to matching to distinct nodes when there is a web of paths that can lead to the same nodes.

MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b

mirco_zeiss · ‎09-30-2021

Holy moly! Thank you very much. It works! I'm so glad that I finally found a solution.

Started streaming 1725 records after 3 ms and completed after 28 ms, displaying first 1000 rows.

Here is the query plan.

Unfortunately I have to learn more about apoc now

Maybe one additional question. How can I include the path to each node? I'd like to show my users their dependencies and also why this specific dependency is included in the list. Since there are usually multiple paths to a dependency it would be OK to simply return the shortest path.

andrew_bowman · ‎10-01-2021

You can use apoc.path.spanningTree() which yields path instead of node, that will have a single path per end node.

mirco_zeiss · ‎10-01-2021

Awesome! Thank you very much. If you want you can also grab the bounty at Stack Overflow cypher - Neo4j - Variable length greater 11 runs forever and query never returns - Stack Overflow.

I would be happy to give to you.

andrew_bowman · ‎10-01-2021

I'll take you up on that! My username there is InverseFalcon, I've posted my answer.

Neo4j

Variable length greater 11 cause my query to run forever