Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
09-23-2022 04:13 PM
Hello everyone,
Please, can someone help me?
I have set up a large date graph representing the ssb as shown in the diagram below. however, queries performed only on nodes even those with aggregation functions give very fast answers. however, as soon as I introduce relations with nodes in my queries, the results take a long time.
example
match (l:lineorder) return sum (l.revenue) gives a very fast response (0.89s).
however as soon as a relationship is in the query, the response time is very long. example
match (l:lineorder)-[r:order-date]->(d:date {year :1993})
where l.quantity > 15
return sum (l.revenue)
I put indexes on most of the attributes but not on the relationship since they don't contain any attribute.
I have tried everything even with explain and profile command but no results.
If someone could help me because it's blocking for me
Thank you in advance.
Solved! Go to Solution.
09-27-2022 07:44 AM - edited 09-27-2022 01:23 PM
Hi,
there are a few issues at work here.
First of all this is not really a graph query but a massive aggregation query.
Secondly you need to check your page-cache configuration as you have a lot of page-cache misses, which means that for large parts of your query you measure disk speed.
Third it is a bit of a modeling problem, one question is - why do you have 23M date nodes? Shouldn't it only be 365*years-covered?
I don't see the concept of an "Order" in your model, most likely that might be something that is currently represented by your date node.
Anyhow, here are some suggestions
* increase your page-cache, so that you don't get the page-cache misses anymore
* the neo4j property store is not the fastest to read from, it's faster to read from indexes
* so if you create a compound index on revenue and quantity that might help
* best would be to find the line-orders by quantity and then check against the 365 date nodes
match (d:Date {year :1993})
match (l:LineItem)-[r:ON_DATE]->(d)
using join on l
// using index on l:lineorder(quantity,revenue)
where l.quantity > 15 and l.revenue is not null
return sum (l.revenue);
You can also leave off the index hint, they are just there the illustrate the index usage.
I put it all into a self-contained example here.
In an ideal setup for those 4.7M rows it goes down to 500ms response.
https://gist.github.com/jexp/255dad7bc6328060e20fdb937b950056
09-23-2022 07:51 PM
Hi @redha_benhisse1 could you share what the profile of your query looks like? Thanks!
09-24-2022 04:49 AM
hi steggy
you have attached 2 requests with relations and a simple request. note that there are 2556 nodes of type DATE and not 23,000,000 (an error on my part in the diagram). there is only lineorder which contains 23,000,000 nodes
profile match (l:lineorder)-[r:order_date]->(d:date {D_YEAR: 1993})
where l.LO_QUANTITY < 15 return sum(l.LO_REVENUE);
profile match (l:lineorder)-[r:order_date]->(d:date {D_YEAR: 1993}) return count(r);
profile match (l:lineorder) return sum(l.LO_REVENUE);
thanks in advance
09-24-2022 11:01 AM
It seems the match by date is faster and produces less line order items to filter through when expanded, then when the first query which starts with finding the line order nodes as the query's anchor node. As such, I may try refactoring the query as follows to see if it faster:
match (d:date {D_YEAR: 1993})
match (d)<-[r:order_date]-(l:lineorder)
where l.LO_QUANTITY < 15
return sum(l.LO_REVENUE)
Also, are there any other nodes other than 'lineorder' nodes that can have an 'order_date' relationship to a 'date' node? If not, then you can remove the 'lineorder' label on line two, as it will require a filter operation after expanding 'd' on line two. You can see this in the profile for the second query. Worth a shot to see if it has a positive impact.
09-25-2022 10:43 AM
hi glilienfield
he problem still persists
When I launch the request like picture
profile match (d:date{D_YEAR:1993})
match (d)<-[r:order_date]-(l)
where l.LO_QUANTITY>15
return sum(l.LO_REVENUE);
the index relative to D_YEAR is used but not the one of LO_QUANTITY and LO_REVENUE that I created myself
but when I enter directly on the node, it uses the index relative to LO_QUANTITY and LO_REVENUE
as soon as we use the relations, the request becomes very heavy
please, any another idea
09-25-2022 01:24 PM
Can I use the same coposite index for two nodes or one node and one relationship ?
09-25-2022 02:09 PM - edited 09-25-2022 02:10 PM
This is not my expertise, but from what I read, it looks like the query planner picks one index to use as the starting point of the query. The rest of the query is performed by an 'expand' (through relationships) and applying 'filters' to evaluate predicates.
You could try adding two query hints to force the query planner to use both indexes. It will then perform a join on the two results to get the final result. You can see if this is any better. I think it would look like this:
match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX d:date(D_YEAR)
USING INDEX l:lineorder(LO_QUANTITY)
where l.LO_QUANTITY < 15
return sum(l.LO_REVENUE)
Adding just one of the hints at a time specifies where to start in the query.
https://neo4j.com/docs/cypher-manual/current/query-tuning/using/
Adding just one hint as follows, should give you the first query plan you got, which was the query plans preferred choice.
match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX l:lineorder(LO_QUANTITY)
where l.LO_QUANTITY < 15
return sum(l.LO_REVENUE)
Adding the hint as follows, should give you a query plan similar to the modified query I proposed.
match (d:date {D_YEAR: 1993})<-[r:order_date]-(l:lineorder)
USING INDEX d:date(D_YEAR)
where l.LO_QUANTITY < 15
return sum(l.LO_REVENUE)
09-27-2022 04:45 AM
I have tried all the proposed solutions but without success. I don't understand anything, the joins take too much time. sometimes 100 times that of oracle for the same configuration; however we have chosen a graph oriented database for the absence of joins. I'm really stuck here.
09-25-2022 02:12 PM
A composite node is for multiple properties for the same label. In your case, you have a query over two properties from different labels.
09-27-2022 07:44 AM - edited 09-27-2022 01:23 PM
Hi,
there are a few issues at work here.
First of all this is not really a graph query but a massive aggregation query.
Secondly you need to check your page-cache configuration as you have a lot of page-cache misses, which means that for large parts of your query you measure disk speed.
Third it is a bit of a modeling problem, one question is - why do you have 23M date nodes? Shouldn't it only be 365*years-covered?
I don't see the concept of an "Order" in your model, most likely that might be something that is currently represented by your date node.
Anyhow, here are some suggestions
* increase your page-cache, so that you don't get the page-cache misses anymore
* the neo4j property store is not the fastest to read from, it's faster to read from indexes
* so if you create a compound index on revenue and quantity that might help
* best would be to find the line-orders by quantity and then check against the 365 date nodes
match (d:Date {year :1993})
match (l:LineItem)-[r:ON_DATE]->(d)
using join on l
// using index on l:lineorder(quantity,revenue)
where l.quantity > 15 and l.revenue is not null
return sum (l.revenue);
You can also leave off the index hint, they are just there the illustrate the index usage.
I put it all into a self-contained example here.
In an ideal setup for those 4.7M rows it goes down to 500ms response.
https://gist.github.com/jexp/255dad7bc6328060e20fdb937b950056
09-27-2022 03:10 PM
Would you consider this refactored data model?
A similar query may look like this. Does this model and query provide the same information? I was not sure about the quantity > 15 criteria on a line-item level.
match (o:Order {Year :1993})
match (o)-[r:HAS_ITEM]->()
where r.Quantity > 15 and r.Revenue is not null
return sum (r.Revenue);
09-28-2022 01:56 AM
As Michael has already said, it depends on the issue you want to address. The model you have proposed is a production model for order tracking and it answers the request of course. The model that I have set up represents a data warehouse and more precisely a star schema and it provides more information, particularly in terms of analysis for the future.
09-28-2022 02:48 AM
Thank you very much Michael
It was clear and precise and boosted the response time of my requests. I did make a typo on the number of date nodes, there are 2556 and not 23 000 000.
However, I have one last question please. If I want to do a left join (optimal match) on several nodes that will match the following sql query :
select d_year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit from p_lineorder left join dates on lo_orderdate = d_datekey left join customer on lo_custkey = c_custkey left join supplier on lo_suppkey = s_suppkey left join part on lo_partkey = p_partkey where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2') group by d_year, c_nation order by d_year, c_nation;
is it ideal to do like the 1st cypher query or like the 2nd
1st cypher query :
profile match (c:customer) where c.C_REGION starts with "AMERICA"
optional match (c)<-[:order_customer]-(l:lineorder)-[:order_part]->(p:part),(d:date)<-[:order_date]-(l)-[:order_supplier]->(s:supplier)
using join on l
where (p.P_MFGR = "MFGR#2" or p.P_MFGR = "MFGR#2")
and s.S_REGION starts with "AMERICA"
return d.D_YEAR, c.C_NATION, sum(l.LO_REVENUE) as revenu, sum(l.LO_SUPPLYCOST) as supplycost
ORDER BY d.D_YEAR, c.C_NATION;
2nd cypher query :
profile match (c:customer) where c.C_REGION starts with "AMERICA"
optional match (c)<-[:order_customer]-(l:lineorder)-[:order_part]->(p:part)
optional match (d:date)<-[:order_date]-(l)-[:order_supplier]->(s:supplier)
using join on l
where (p.P_MFGR = "MFGR#2" or p.P_MFGR = "MFGR#2")
and s.S_REGION starts with "AMERICA"
return d.D_YEAR, c.C_NATION, sum(l.LO_REVENUE) as revenu, sum(l.LO_SUPPLYCOST) as supplycost
ORDER BY d.D_YEAR, c.C_NATION;
Should I use one optimal match with a semi-colon in the middle or two optimal matches (or some other formulation of the query?!). If there were linear nodes the problem does not arise, but as there is a star structure with a central node, the query is not clear.
Thank you in advance
09-27-2022 04:15 PM
Depends on the questions you need to ask, both LineItem as Node and Relationship can make sense in different use-cases.
All the sessions of the conference are now available online