Neo4j

redha_benhisse1 · ‎09-23-2022

Hello everyone,
Please, can someone help me?
I have set up a large date graph representing the ssb as shown in the diagram below. however, queries performed only on nodes even those with aggregation functions give very fast answers. however, as soon as I introduce relations with nodes in my queries, the results take a long time.

example

match (l:lineorder) return sum (l.revenue) gives a very fast response (0.89s).

however as soon as a relationship is in the query, the response time is very long. example

match (l:lineorder)-[r:order-date]->(d:date {year :1993})
where l.quantity > 15
return sum (l.revenue)

I put indexes on most of the attributes but not on the relationship since they don't contain any attribute.
I have tried everything even with explain and profile command but no results.
If someone could help me because it's blocking for me
Thank you in advance.

michael_hunger · ‎09-27-2022

Hi,

there are a few issues at work here.
First of all this is not really a graph query but a massive aggregation query.
Secondly you need to check your page-cache configuration as you have a lot of page-cache misses, which means that for large parts of your query you measure disk speed.
Third it is a bit of a modeling problem, one question is - why do you have 23M date nodes? Shouldn't it only be 365*years-covered?
I don't see the concept of an "Order" in your model, most likely that might be something that is currently represented by your date node.

Anyhow, here are some suggestions

* increase your page-cache, so that you don't get the page-cache misses anymore
* the neo4j property store is not the fastest to read from, it's faster to read from indexes
* so if you create a compound index on revenue and quantity that might help
* best would be to find the line-orders by quantity and then check against the 365 date nodes

match (d:Date {year :1993}) 
match (l:LineItem)-[r:ON_DATE]->(d)
using join on l
// using index on l:lineorder(quantity,revenue)
where l.quantity > 15 and l.revenue is not null
return sum (l.revenue);

You can also leave off the index hint, they are just there the illustrate the index usage.

I put it all into a self-contained example here.

In an ideal setup for those 4.7M rows it goes down to 500ms response.

https://gist.github.com/jexp/255dad7bc6328060e20fdb937b950056

View solution in original post

steggy · ‎09-23-2022

Hi @redha_benhisse1 could you share what the profile of your query looks like? Thanks!

redha_benhisse1 · ‎09-24-2022

hi steggy

you have attached 2 requests with relations and a simple request. note that there are 2556 nodes of type DATE and not 23,000,000 (an error on my part in the diagram). there is only lineorder which contains 23,000,000 nodes

profile match (l:lineorder)-[r:order_date]->(d:date {D_YEAR: 1993})
where l.LO_QUANTITY < 15 return sum(l.LO_REVENUE);

profile match (l:lineorder)-[r:order_date]->(d:date {D_YEAR: 1993}) return count(r);

profile match (l:lineorder) return sum(l.LO_REVENUE);

thanks in advance

glilienfield · ‎09-24-2022

It seems the match by date is faster and produces less line order items to filter through when expanded, then when the first query which starts with finding the line order nodes as the query's anchor node. As such, I may try refactoring the query as follows to see if it faster:

match (d:date {D_YEAR: 1993})
match (d)<-[r:order_date]-(l:lineorder)
where l.LO_QUANTITY < 15 
return sum(l.LO_REVENUE)

Also, are there any other nodes other than 'lineorder' nodes that can have an 'order_date' relationship to a 'date' node? If not, then you can remove the 'lineorder' label on line two, as it will require a filter operation after expanding 'd' on line two. You can see this in the profile for the second query. Worth a shot to see if it has a positive impact.

redha_benhisse1 · ‎09-25-2022

hi glilienfield

he problem still persists

When I launch the request like picture

profile match (d:date{D_YEAR:1993})
match (d)<-[r:order_date]-(l)
where l.LO_QUANTITY>15
return sum(l.LO_REVENUE);

the index relative to D_YEAR is used but not the one of LO_QUANTITY and LO_REVENUE that I created myself

but when I enter directly on the node, it uses the index relative to LO_QUANTITY and LO_REVENUE

as soon as we use the relations, the request becomes very heavy

please, any another idea

redha_benhisse1 · ‎09-25-2022

Can I use the same coposite index for two nodes or one node and one relationship ?

glilienfield · ‎09-25-2022

This is not my expertise, but from what I read, it looks like the query planner picks one index to use as the starting point of the query. The rest of the query is performed by an 'expand' (through relationships) and applying 'filters' to evaluate predicates.

You could try adding two query hints to force the query planner to use both indexes. It will then perform a join on the two results to get the final result. You can see if this is any better. I think it would look like this:

match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX d:date(D_YEAR)
USING INDEX l:lineorder(LO_QUANTITY)
where l.LO_QUANTITY < 15 
return sum(l.LO_REVENUE)

Adding just one of the hints at a time specifies where to start in the query.

https://neo4j.com/docs/cypher-manual/current/query-tuning/using/

Adding just one hint as follows, should give you the first query plan you got, which was the query plans preferred choice.

match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX l:lineorder(LO_QUANTITY)
where l.LO_QUANTITY < 15 
return sum(l.LO_REVENUE)

Adding the hint as follows, should give you a query plan similar to the modified query I proposed.

match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX d:date(D_YEAR)
where l.LO_QUANTITY < 15 
return sum(l.LO_REVENUE)

redha_benhisse1 · ‎09-27-2022

hi glilienfield steggy

I have tried all the proposed solutions but without success. I don't understand anything, the joins take too much time. sometimes 100 times that of oracle for the same configuration; however we have chosen a graph oriented database for the absence of joins. I'm really stuck here.

glilienfield · ‎09-25-2022

A composite node is for multiple properties for the same label. In your case, you have a query over two properties from different labels.

michael_hunger · ‎09-27-2022

Hi,

there are a few issues at work here.
First of all this is not really a graph query but a massive aggregation query.
Secondly you need to check your page-cache configuration as you have a lot of page-cache misses, which means that for large parts of your query you measure disk speed.
Third it is a bit of a modeling problem, one question is - why do you have 23M date nodes? Shouldn't it only be 365*years-covered?
I don't see the concept of an "Order" in your model, most likely that might be something that is currently represented by your date node.

Anyhow, here are some suggestions

* increase your page-cache, so that you don't get the page-cache misses anymore
* the neo4j property store is not the fastest to read from, it's faster to read from indexes
* so if you create a compound index on revenue and quantity that might help
* best would be to find the line-orders by quantity and then check against the 365 date nodes

match (d:Date {year :1993}) 
match (l:LineItem)-[r:ON_DATE]->(d)
using join on l
// using index on l:lineorder(quantity,revenue)
where l.quantity > 15 and l.revenue is not null
return sum (l.revenue);

You can also leave off the index hint, they are just there the illustrate the index usage.

I put it all into a self-contained example here.

In an ideal setup for those 4.7M rows it goes down to 500ms response.

https://gist.github.com/jexp/255dad7bc6328060e20fdb937b950056

glilienfield · ‎09-27-2022

Would you consider this refactored data model?

A similar query may look like this. Does this model and query provide the same information? I was not sure about the quantity > 15 criteria on a line-item level.

match (o:Order {Year :1993}) 
match (o)-[r:HAS_ITEM]->()
where r.Quantity > 15 and r.Revenue is not null
return sum (r.Revenue);

redha_benhisse1 · ‎09-28-2022

As Michael has already said, it depends on the issue you want to address. The model you have proposed is a production model for order tracking and it answers the request of course. The model that I have set up represents a data warehouse and more precisely a star schema and it provides more information, particularly in terms of analysis for the future.

redha_benhisse1 · ‎09-28-2022

Thank you very much Michael

It was clear and precise and boosted the response time of my requests. I did make a typo on the number of date nodes, there are 2556 and not 23 000 000.

However, I have one last question please. If I want to do a left join (optimal match) on several nodes that will match the following sql query :

select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
from p_lineorder
left join dates on lo_orderdate = d_datekey
left join customer on lo_custkey = c_custkey
left join supplier on lo_suppkey = s_suppkey
left join part on lo_partkey = p_partkey
where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by d_year, c_nation
order by d_year, c_nation;

is it ideal to do like the 1st cypher query or like the 2nd

1st cypher query :

profile match (c:customer) where c.C_REGION starts with "AMERICA"
optional match (c)<-[:order_customer]-(l:lineorder)-[:order_part]->(p:part),(d:date)<-[:order_date]-(l)-[:order_supplier]->(s:supplier)
using join on l
where (p.P_MFGR = "MFGR#2" or p.P_MFGR = "MFGR#2")
and s.S_REGION starts with "AMERICA"
return d.D_YEAR, c.C_NATION, sum(l.LO_REVENUE) as revenu, sum(l.LO_SUPPLYCOST) as supplycost
ORDER BY d.D_YEAR, c.C_NATION;

2nd cypher query :

profile match (c:customer) where c.C_REGION starts with "AMERICA"
optional match (c)<-[:order_customer]-(l:lineorder)-[:order_part]->(p:part)
optional match (d:date)<-[:order_date]-(l)-[:order_supplier]->(s:supplier)
using join on l
where (p.P_MFGR = "MFGR#2" or p.P_MFGR = "MFGR#2")
and s.S_REGION starts with "AMERICA"
return d.D_YEAR, c.C_NATION, sum(l.LO_REVENUE) as revenu, sum(l.LO_SUPPLYCOST) as supplycost
ORDER BY d.D_YEAR, c.C_NATION;

Should I use one optimal match with a semi-colon in the middle or two optimal matches (or some other formulation of the query?!). If there were linear nodes the problem does not arise, but as there is a star structure with a central node, the query is not clear.

Thank you in advance

michael_hunger · ‎09-27-2022

Depends on the questions you need to ask, both LineItem as Node and Relationship can make sense in different use-cases.

Neo4j

query relationship poor performance