Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-23-2022 09:13 AM
Hi,
we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.
First, I will describe a simplified description of our schema. We have the following nodes:
We also have the following relationships:
The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.
The current query is similar to:
(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
WITH nodes(path) AS usergroups
UNWIND usergroups AS ug
(2)
MATCH (node:Asset)
WHERE AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
We have a few more types of restrictions but here we have the main idea.
Some observations we have made are:
return ug
after unwind, this query is solved in 1msMATCH (ug:Usergroup {uid: $usergroup_uid})
ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.
For that, we have these questions:
NOT <restrictions>
(ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?The database version is Neo4j 3.5.X
Captures of the query profile are:
Thanks in advance
Solved! Go to Solution.
05-25-2022 10:10 AM
Now that I looked back on my query, this should also work:
MATCH (u:UserGroup {uid: $usergroup_uid})
CALL apoc.path.subgraphNodes(u, {
relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
WHERE id(node) not in bl
RETURN node
ORDER BY node.date DESC
LIMIT 20
Bennu
05-24-2022 06:53 AM
Is the logic in query 2 correct? You get a user’s groups in query 1. You then search group-by-group for all the nodes that that one group doesn’t have restrict access to. You then return the union of all these nodes that have been found to be unrestricted by at least one group. Don’t you need to find the nodes that are found to be unrestricted by all groups? In your result you could have a node that is found unrestricted by one group A, included in the result, but later found to be restricted by group B. This assets will be included in the result with the present logic.
Your algorithm needs to interrogate every asset node to find the ones that are not restricted. To do so, every path from that asset needs to be determinedw not to match one of the three patterns. I don’t think performance is going to scale well as the number of assets grows.
Maybe approaching this from the opposite direction would be better. Find the assets that the user is restricted from viewing. This is easy since you have paths to each restricted paths. You then collect the uniques asset identifier for each restricted asset. This should be a fast query. Now you need the asset nodes whose unique identifier is not in this list. This is a simple query and should be fast if you have an index on the property.
The question is how big of a list can cypher handle for a ‘not in’ clause. Oracle has a limit of 1000 elements, we approach that restriction by partitioning large lists into lists of a 1000 elements and ORing the results. What is the number of restricted assets for a typical user?
05-24-2022 07:03 AM
Hello @ainsausti !
As @glilienfield suggested, retrieving nodes that not supposed to be traversed and then using their identifiers in order to remove them from a general match may be a better approach.
Quick question, do you APOC on your installation?
Bennu
05-25-2022 01:54 AM
Thanks for the responses @bennu.neo and @glilienfield.
You are right, we've just tested this, and it's making the union of the lists.
We've considered this approach but we may have built the query incorrectly because all the attempts to face it in this way were even slower.
Restricted directly from a usergroup to an asset through the HAS_RESTRICTED_ASSET
relationship there could be tipically less than 1000, but indirectly trhough the other relationships HAS_RESTRICTED_FEATURE
, HAS_RESTRICTED_SECTION
we could have 100K or more.
Yes we have APOC installed. So any suggestion with this approach will be valid for us
05-25-2022 02:21 AM
Nice!
So let's play together a little bit. I'll not consider the property condition HAS_RESTRICTED_ASSET for a while so we can first get an idea of the timing like this.
MATCH (u:UserGroup {uid: $usergroup_uid})
WITH u
CALL apoc.path.subgraphNodes(u, {
relationshipFilter: "HAS_PARENT_GROUP>"
}) yield node
WITH collect(node) AS usergroups
CALL apoc.path.subgraphNodes(usergroups, {
relationshipFilter: "HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
WHERE id(node) not in bl
RETURN node
ORDER BY node.date DESC
LIMIT 20
Let see if this runs
05-25-2022 09:54 AM
One way to fix the logic to ensure all user groups don't have restricted access for an Asset node is to use the 'all' predicate. The following should work. I have included this for educational purposes so you can see how to solve the problem. Like @bennu.neo and I stated, your original approach doesn't seem like it will scale well. Try @bennu.neo's query to see how that works.
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
WITH nodes(path) AS usergroups
MATCH (node:Asset)
WHERE all(ug in usergroups where NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE{restriction_type:"view"}]-(ug))
RETURN DISTINCT node
ORDER BY node.date DESC
SKIP 0
LIMIT 20
05-25-2022 10:10 AM
Now that I looked back on my query, this should also work:
MATCH (u:UserGroup {uid: $usergroup_uid})
CALL apoc.path.subgraphNodes(u, {
relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
WHERE id(node) not in bl
RETURN node
ORDER BY node.date DESC
LIMIT 20
Bennu
05-25-2022 11:22 AM
Thank you!! @bennu.neo and @glilienfield
I have tried the first query proposed by @bennu.neo, and it's incredibly fast (~100ms), I've also tried the one of @glilienfield, but this is still slow (~25s).
The following step is adapt and add the APOC query to our tests and verify that it works as expected.
I expect to do this tomorrow and I'll tell you back the results.
05-25-2022 12:25 PM
Hi @ainsausti !
Keep in mind that my query should be equivalent to yours without the [:HAS_RESTRICTED_ASSET {restriction_type: "view"}]
condition and just [:HAS_RESTRICTED_ASSET ]
. Keep that on mind when comparing results. Then if they are they same, we can talk about ways to introduce that condition without loosing our performance.
Oh @glilienfield , don't abandon your child (query)
06-08-2022 04:18 AM
Hello @bennu_neo , I'm a workmate of @ainsausti , and I have succesfully implemented your suggested query, but I would like to introduce the {restriction_type: "view"} condition. Could you help us with this?
Thank you so much.
06-08-2022 04:56 AM
Hi @Isidoro !
Welcome to the community!
About this query condition. We have 2 solutions/options.
1. Remodeling. How many restriction types do you have? does have sense on your model having different relationships for every different relationship type? In that case, your can use that especific relationship on your APOC filter.
2. Creating your own traverser. You can 'easily' extend apoc in order to add this kind of conditions in a custom traverser. Do you have any experience forking public repositories?
06-09-2022 12:17 AM
Hi again @bennu_neo.
As I understand, with the first option you propose to convert the restriction property in a relationship. For example, we have download restrictions and view restrictions, so instead of having HAS_RESTRICTED_ASSET with restriction_type:"view" or restriction_type: "download", we could have HAS_RESTRICTED_VIEW_ASSET and HAS_RESTRICTED_DOWNLOAD_ASSET. Is this correct? This option is feasible, but we may need to migrate the existing structure in our production environment.
The second option is not valid for us because we do not want to maintain a fork customized branch.
Thank you very much for your response!
06-09-2022 02:30 AM
Hi @ainsausti !
Exactly. The first option may force you to remodel (something perfectly normal) in order to optimize other traversals but it will keep the performance of the query I showed you. I understand you position about the second option and it's completely valid. You can always create your own procedure that executes the traversal the way you want if changing the model became eventually not possible.
Soon this kind of queries will become much easier 😉
05-25-2022 11:47 AM
That is excellent. Hey, don't label that as my query. I didn't expect it to work well.
06-09-2022 12:19 AM
Hahaha, sorry, that was not my intention, I misread the post and interpreted that you were proposing that query 🙏
09-09-2022 09:19 AM
Hi again,
optimization issues are coming back 😅
We have now around 600K assets in the production database and the final load that the database should stand is around 4 million assets.
Our query now looks like this:
MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {
relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",
labelFilter: "/Asset"
})
yield node
with collect(id(node)) as bl
MATCH (node:Asset)
WHERE not id(node) in bl
WITH node
WHERE node.logically_deleted = false
RETURN DISTINCT node.date ORDER BY node.date DESC
SKIP 0 LIMIT 50
This query is completed in around 3~4 seconds. But this is the base query, more filters could be added dynamically making the query last more than 10 seconds.
After a bit of research, we've realized that by removing the "order by" statement from the query, the completion time is reduced to 30 ms, so we've been focused on reducing the time of the "order by" with indexes. The experiments we've tried are:
Is there something that we are not understanding about how indexes work in order by clauses?
Another piece of information that may be impacting the performance is that node.date is not a native Neo4j date but a string with the format "YYYY-MM-DD". Should transforming the date to a native data structure improve the performance?
Thanks in advance
09-12-2022 01:02 AM
Hey @ainsausti !
Good to see you around here again. Let's build a case...
First thing, Why does the relationship HAS_RESTRICTED_SECTION_VIEW has no relationship attached on your subgraph expansion?
What's the order of magnitude (cardinality) of assets on that subgraph expansion?
What's the count of Assets in general? What's the percentage of them with the property logically_deleted being true/false?
See u around
09-12-2022 02:30 AM
Hi again @bennu_neo! Thanks for your fast response!
As for the first question, sorry, but I don't really understand what you mean with "has no relationship attached". Do you mean why it doesn't have a direction specified?
The asset count we currently have is 537446.
There are 8700 assets restricted for the usergroup we are testing that match with the subgraph expansion. On average, the cardinality of these nodes is ~13 relationships.
The percentage of assets with the property logically deleted being true is very low -> 0.003% (17 assets)
I've also attached the profile in png.
09-12-2022 04:02 AM
My bad @ainsausti , I was trying to mean "direction".
How long does the first query takes? (just the apoc).
So if I understand properly, you look for some Assets in your first query but then you want EVERY Asset that's not softly deleted that it's not part of your previous list (while also ordering)?
09-12-2022 04:28 AM
In other thoughts @ainsausti ,
Do you have an ids on Assets with an index? Can you create a composite index with the delte property and try something like:
MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {
relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",
labelFilter: "/Asset"
})
yield node
with collect(node.id) as bl
MATCH (node:Asset)
WHERE not node.id in bl
AND node.logically_deleted = false
RETURN DISTINCT node.date ORDER BY node.date DESC
SKIP 0 LIMIT 50
And share the profile. It may not solve the issue but we can get some ideas 🙂
09-13-2022 05:47 AM
Hi again!
There is no reason why the direction is not set in some of the relationship filters, I've added them.
The apoc query takes 32 ms. So, I think, it's not the problematic part of the query.
You understood correctly the intention of the query. Nonetheless, this is the base query, some filters are added dynamically to this one. For example, to find the assets with a description equal to "Test description", the condition "AND node.description = 'Test description' " will be added.
We have a "uid" property with a unique index. I've added a composite index on Asset.uid and Asset.logically_deleted and chnged the query. But still there is no improvement. I think the index are not applied in the execution.
The profile with the index is:
I think the index was created succesfully:
All the sessions of the conference are now available online