Neo4j

JaHo · ‎03-02-2021

Can anyone tell me how long I should roughly expect call apoc.meta.graph() to run on a graph with 500 million nodes and 1.5 billion relationships?

I ran into the bug with call db.schema.visualize() and apoc.meta.graph() gave the correct answer before but is taking a while this time.

Thanks!

markhneedham · ‎03-04-2021

It's probably gonna take ages to return. db.schema.visualize is using pre-computed data, whereas apoc.meta.graph is computing it all from scratch. Maybe you can take a look at apoc.meta.graphSample instead?

JaHo · ‎03-04-2021

I tried apoc.meta.graphSample but similarly to db.schema.visualize (and as stated in the documentation) it returned extra relationships.
I also played around with apoc.meta.subGraph a bit which I got to yield a satisfactory result in the end. I'm still a bit confused though where the computational cost is coming from; for many subsets of nodes and relationships the result was instant while including some labels with fairly small sets of nodes/relationships resulted in long runtimes that I stopped after a while.

markhneedham · ‎03-04-2021

I don't know this code off by heart, but this is the function that it's calling:

github.com

neo4j-contrib/neo4j-apoc-procedures/blob/4.2/core/src/main/java/apoc/meta/Meta.java#L993



@Procedure
@Description("apoc.meta.graphSample() - examines the database statistics to build the meta graph, very fast, might report extra relationships")
public Stream<GraphResult> graphSample(@Name(value = "config",defaultValue = "{}") Map<String,Object> config) {
    MetaConfig metaConfig = new MetaConfig(config);
    return metaGraph(null, null, false, metaConfig);
}

@Procedure
@Description("apoc.meta.subGraph({labels:[labels],rels:[rel-types], excludes:[labels,rel-types]}) - examines a sample sub graph to create the meta-graph")
public Stream<GraphResult> subGraph(@Name("config") Map<String,Object> config ) {

    MetaConfig metaConfig = new MetaConfig(config);

    return filterResultStream(metaConfig.getExcludes(), metaGraph(metaConfig.getIncludesLabels(), metaConfig.getIncludesRels(),true, metaConfig));
}

private Stream<GraphResult> filterResultStream(Set<String> excludes, Stream<GraphResult> graphResultStream) {
    if (excludes == null || excludes.isEmpty()) return graphResultStream;
    return graphResultStream.map(gr -> {
        Iterator<Node> it = gr.nodes.iterator();

that then calls the metaGraph function:

github.com

neo4j-contrib/neo4j-apoc-procedures/blob/f3a42f8b6e344a0ca1a2b7d497be7c02853b4ca9/core/src/main/java...


        return RelationshipType.withName(type);
    }
}
@Procedure
@Description("apoc.meta.graph - examines the full graph to create the meta-graph")
public Stream<GraphResult> graph(@Name(value = "config",defaultValue = "{}") Map<String,Object> config) {
    MetaConfig metaConfig = new MetaConfig(config);
    return metaGraph(null, null, true, metaConfig);
}

private Stream<GraphResult> metaGraph(Collection<String> labelNames, Collection<String> relTypeNames, boolean removeMissing, MetaConfig metaConfig) {
    Read read = kernelTx.dataRead();
    TokenRead tokenRead = kernelTx.tokenRead();

    Map<String, Integer> labels = labelsInUse(tokenRead, labelNames);
    Map<String, Integer> relTypes = relTypesInUse(tokenRead, relTypeNames);

    Map<String, Node> vNodes = new TreeMap<>();
    Map<Pattern, Relationship> vRels = new HashMap<>(relTypes.size() * 2);

    labels.forEach((labelName, id) -> {

And actually it doesn't look like it computes everything from scratch like I thought it did. It's kinda hard to say why it would be working better for some labels than others.

JaHo · ‎03-04-2021

Thanks for the pointer, I don't really know any Java, though.
It actually only started being slow after I recently added some new labels that about doubled the number of existing nodes. With the already pretty large number of nodes before that it worked instantly and returned the correct result.

markhneedham · ‎03-04-2021

And on that graph you said apoc.meta.graphSample returns quickly but has extra relationships?

The only difference between apoc.meta.graphSample and apoc.meta.graph is a post processing step where missing relationships are removed (or not) so that's where the time must be spent.

Reading the code of that function I can see that it's doing a scan of all the nodes with each label and then checking all of the relationships for 1 in 1000 of those nodes, which would be time-consuming. You can configure the sampling rate via the sample key e.g. sample: 10000 would make it sample every 10,000 nodes instead of every 1,000 nodes.

JaHo · ‎03-10-2021

Sorry for the delay.
Yes, for the full graph, apoc.meta.graphSample runs quickly but has extra relationships. I tried running it with different sample sizes but I must have been doing it wrong as there was no difference in both runtime and result. Is call apoc.met.graphSample({sample: 1000}) the correct syntax?

markhneedham · ‎03-10-2021

Yup, just gotta fix the typo on here:

call apoc.meta.graphSample({sample: 1000})

JaHo · ‎03-10-2021

Ah my bad. Still, even if I call it with sample: 1 (which I guess would mean it checks every node), it returns instantly and contains additional relationships.

markhneedham · ‎03-10-2021

Can you try:

call apoc.meta.graph({sample: 1000})

JaHo · ‎03-10-2021

That seems to run slowly irrespective of what I set sample to. I haven't let it run longer than a minute or so, though.

markhneedham · ‎03-10-2021

I'm playing around with it on a dummy graph with 40m nodes/relationships and I can see different speeds of response when specifying sample.

JaHo · ‎03-11-2021

That's strange, not exactly sure what's going on. Anyways, it's not a pressing issue for me at the moment so I don't want to steal too much of your time. If I can help by providing more info I'd be happy to. Thanks again for your help!

Neo4j

Call apoc.meta.graph() expected runtime