Neo4j

tie219 · ‎09-26-2022

Hi all,

I have a graph which traces material relationships in a manufacturing process from raw material to finished good with semi-finished good steps in between. On each edge of the graph, there are unit factors with values that range from .0001 to 1000's. I am calculating an initial unit factor (Raw Material: Finished Good) by using the reduce function to multiply the values across the edges.

A raw material can have a relationship to thousands of finished goods. This calculation is painfully slow as you can imagine. I already took some steps to 'optimize' the query by removing edges with less than a .000001 value.

My thought is, I can further speed up the query but ignoring paths where the sum of the reduce function becomes less than .000001. You can imagine if a path has 3 edges in succession with .01, then you would arrive at an initial unit factor of .000001 which I wouldn't want in my end result. If I can flag these prior to Neo4j traversing the paths, then I 'think' I can save significant amounts of solve time.

I am a neo4j user of about 6 months and have a lot of to learn. Your help and expertise would be very appreciated,

sfg represents a list of materials I am tracing to an end finished good. Here is my code as it stands:

CALL {
WITH sfg
MATCH p = (input_sku:SKU{ID:sfg})-[r:GOES_INTO_B_UF|TRANSFERS_TO_B_UF*1..10]->(output_sku:SKU)
MATCH m = (mat:Material)-[:BASE_MATERIAL_OF]->(output_sku)
MATCH (output_sku)-[:MADE_AT]->(pl:Plant)
WHERE none(rel in relationships(p) where startNode(rel) = endNode(rel)) AND mat.MATERIAL_TYPE = 'FERT' AND output_sku.PLANT_ID = pl.PLANT_ID
RETURN input_sku.ID as COMPONENT_NAME,input_sku.UOM as COMPONENT_UOM,reduce(res = 1.00, r IN relationships(p) | res*tofloat(r.UF)) as SUM_UF,output_sku.ID AS OUTPUT_NAME,output_sku.UOM as OUTPUT_UOM
}
WITH COMPONENT_NAME,COMPONENT_UOM,SUM(SUM_UF) as UNIT_FC,OUTPUT_NAME,OUTPUT_UOM
WHERE UNIT_FC > .0000001
RETURN DISTINCT  COMPONENT_NAME,COMPONENT_UOM,UNIT_FC,OUTPUT_NAME,OUTPUT_UOM

glilienfield · ‎09-26-2022

The first observation I have is with the use of the three sequential MATCH clauses. The result of this is the Cartesian product of the results from each match statement. This is not relevant if each of the match results in one row. Not knowing your data, I would look at implementing it differently. As I see it, the second and third MATCH statements are being used as a constraint, since you don't reference any of their variable in the later part of the query. If my interpretation is true, you can convert those to a predicate using the EXISTS clause. This eliminates the Cartesian product issue and would improve performance if the those two matches result in multiple rows. Also, the Cartesian product can create unintentional duplicate values in your result that requires using DISTINCT to eliminate them. The following query is a refactor of yours with this change.

CALL {
WITH sfg
MATCH p = (input_sku:SKU{ID:sfg})-[r:GOES_INTO_B_UF|TRANSFERS_TO_B_UF*1..10]->(output_sku:SKU)
WHERE none(rel in relationships(p) where startNode(rel) = endNode(rel)) 
AND EXISTS {
    MATCH (mat:Material)-[:BASE_MATERIAL_OF]->(output_sku)
    WHERE mat.MATERIAL_TYPE = 'FERT' 
}
AND EXISTS {
    MATCH (output_sku)-[:MADE_AT]->(pl:Plant)
    WHERE output_sku.PLANT_ID = pl.PLANT_ID
}
RETURN input_sku.ID as COMPONENT_NAME, input_sku.UOM as COMPONENT_UOM, reduce(res = 1.00, r IN relationships(p) | res*tofloat(r.UF)) as SUM_UF, output_sku.ID AS OUTPUT_NAME, output_sku.UOM as OUTPUT_UOM
}
WITH COMPONENT_NAME, COMPONENT_UOM, SUM(SUM_UF) as UNIT_FC,O UTPUT_NAME, OUTPUT_UOM
WHERE UNIT_FC > .0000001
RETURN DISTINCT COMPONENT_NAME, COMPONENT_UOM,UNIT_FC, OUTPUT_NAME, OUTPUT_UOM

My second observation is about the 'none' predicate. This will always be true unless you have self-referencing relationships, i.e. a relationship that relates a node to itself. Do you have any of these? If not, you should be able to safely remove the predicate in line 4.

Next, the CALL subquery is probably unnecessary in this cause. You should be able to remove it and change the RETURN clause in line 13 to a WITH and get the same results. Even so, it does not harm.

Next, the predicate 'output_sku.PLANT_ID = pl.PLANT_ID' looks like an inner join condition used in relational databases. I would figure that the existence of the :MADE_AT relationship is satisfactory.

Finally, you may be able to get rid of the DISTINCT after the refactoring out the sequential MATCH statements. That may been the cause for you to use it.

Neo4j

Optimize Number of Possible Paths Based on Value Derived from Reduce()