Neo4j

ainsausti · ‎05-23-2022

Hi,

we are trying to optimize a query but the time explodes (~20 seconds) when having around 40K nodes in the database, but it should be way faster.

First, I will describe a simplified description of our schema. We have the following nodes:

Usergroup
Feature
Asset
Section

We also have the following relationships:

A Feature has only one Section (IS_IN_SECTION)
A Feature has one or more Asset (CONTAINS_ASSET)
An asset may be restricted for a Usergroup (HAS_RESTRICTED_ASSET)
A Feature may be restricted for a Usergroup (HAS_RESTRICTED_FEATURE)
A Section, and therefore, all the Feature of that Section, may be restricted for a Usergroup (HAS_RESTRICTED_SECTION)
A Usergroup may have a parent Usergroup (HAS_PARENT_GROUP) and it should fulfill its restrictions and those of its parents

The goal is, given a Usergroup, to list the top 20 assets ordered by date, that don't have any restrictions with the Usergroup.

The current query is similar to:

(1)
MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
  WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
  WITH nodes(path) AS usergroups
  UNWIND usergroups AS ug

(2)
MATCH (node:Asset)
  WHERE AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) 
  AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
  AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE {restriction_type: "view"}]-(ug)

RETURN DISTINCT node 
ORDER BY node.date DESC
SKIP 0
LIMIT 20

We have a few more types of restrictions but here we have the main idea.

Some observations we have made are:

If we execute the query part (1) adding return ug after unwind, this query is solved in 1ms
If we change the query part (1) to MATCH (ug:Usergroup {uid: $usergroup_uid}) ignoring the parent groups, the query is solved in around 800ms. And if we add back the original part (1) it is solved in 8 seconds even if the Usergroup has no parents.

Currently, our database is small compared to the expected number of nodes (~6 millions), and the number of restrictions will grow, and we need to optimize this kind of queries.

For that, we have these questions:

The NOT <restrictions> (ex: NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)) conditions is correct in this kind of situation or are there other approachs to get the job done more efficiently?
Do we need any type of index?
Is the structure of the schema right, or are there any inefficiencies?
How can we rewrite the part (1) of the query or what do you thinks is causing the overhead with it?

The database version is Neo4j 3.5.X
Captures of the query profile are:

Thanks in advance

bennu_neo · ‎05-25-2022

Now that I looked back on my query, this should also work:

MATCH (u:UserGroup {uid: $usergroup_uid})
CALL apoc.path.subgraphNodes(u, {
	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
    labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
  WHERE id(node) not in bl
RETURN node 
ORDER BY node.date DESC
LIMIT 20

Bennu

Oh, y’all wanted a twist, ey?

View solution in original post

glilienfield · ‎05-24-2022

Is the logic in query 2 correct? You get a user’s groups in query 1. You then search group-by-group for all the nodes that that one group doesn’t have restrict access to. You then return the union of all these nodes that have been found to be unrestricted by at least one group. Don’t you need to find the nodes that are found to be unrestricted by all groups? In your result you could have a node that is found unrestricted by one group A, included in the result, but later found to be restricted by group B. This assets will be included in the result with the present logic.

Your algorithm needs to interrogate every asset node to find the ones that are not restricted. To do so, every path from that asset needs to be determinedw not to match one of the three patterns. I don’t think performance is going to scale well as the number of assets grows.

Maybe approaching this from the opposite direction would be better. Find the assets that the user is restricted from viewing. This is easy since you have paths to each restricted paths. You then collect the uniques asset identifier for each restricted asset. This should be a fast query. Now you need the asset nodes whose unique identifier is not in this list. This is a simple query and should be fast if you have an index on the property.

The question is how big of a list can cypher handle for a ‘not in’ clause. Oracle has a limit of 1000 elements, we approach that restriction by partitioning large lists into lists of a 1000 elements and ORing the results. What is the number of restricted assets for a typical user?

bennu_neo · ‎05-24-2022

Hello @ainsausti !

As @glilienfield suggested, retrieving nodes that not supposed to be traversed and then using their identifiers in order to remove them from a general match may be a better approach.

Quick question, do you APOC on your installation?

Bennu

Oh, y’all wanted a twist, ey?

glilienfield · ‎05-24-2022

@bennu.neo, I feel an efficient apoc solution coming.

bennu_neo · ‎05-24-2022

@glilienfield Allegedly guilty!

Oh, y’all wanted a twist, ey?

ainsausti · ‎05-25-2022

Thanks for the responses @bennu.neo and @glilienfield.

You are right, we've just tested this, and it's making the union of the lists.

We've considered this approach but we may have built the query incorrectly because all the attempts to face it in this way were even slower.

Restricted directly from a usergroup to an asset through the HAS_RESTRICTED_ASSET relationship there could be tipically less than 1000, but indirectly trhough the other relationships HAS_RESTRICTED_FEATURE, HAS_RESTRICTED_SECTION we could have 100K or more.

Yes we have APOC installed. So any suggestion with this approach will be valid for us

bennu_neo · ‎05-25-2022

Nice!

So let's play together a little bit. I'll not consider the property condition HAS_RESTRICTED_ASSET for a while so we can first get an idea of the timing like this.

MATCH (u:UserGroup {uid: $usergroup_uid})
WITH u
CALL apoc.path.subgraphNodes(u, {
	relationshipFilter: "HAS_PARENT_GROUP>"
}) yield node
WITH collect(node) AS usergroups
CALL apoc.path.subgraphNodes(usergroups, {
	relationshipFilter: "HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
    labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
  WHERE id(node) not in bl
RETURN node 
ORDER BY node.date DESC
LIMIT 20

Let see if this runs

Oh, y’all wanted a twist, ey?

glilienfield · ‎05-25-2022

One way to fix the logic to ensure all user groups don't have restricted access for an Asset node is to use the 'all' predicate. The following should work. I have included this for educational purposes so you can see how to solve the problem. Like @bennu.neo and I stated, your original approach doesn't seem like it will scale well. Try @bennu.neo's query to see how that works.

MATCH path=(:UserGroup {uid: $usergroup_uid})-[:HAS_PARENT_GROUP*0..]->(root:UserGroup)
  WHERE NOT (root)-[:HAS_PARENT_GROUP]->(:UserGroup)
  WITH nodes(path) AS usergroups
  
MATCH (node:Asset)
  WHERE all(ug in usergroups where NOT (node)<-[:CONTAINS_ASSET]-(:Feature)-[:IS_IN_SECTION]->(:Section)<-[:HAS_RESTRICTED_SECTION {restriction_type: "view"}]-(ug) 
  AND NOT (node)<-[:HAS_RESTRICTED_ASSET {restriction_type: "view"}]-(ug)
  AND NOT (node)<-[:CONTAINS_ASSET]-(:Feature)<-[:HAS_RESTRICTED_FEATURE{restriction_type:"view"}]-(ug))

RETURN DISTINCT node 
ORDER BY node.date DESC
SKIP 0
LIMIT 20

bennu_neo · ‎05-25-2022

Now that I looked back on my query, this should also work:

MATCH (u:UserGroup {uid: $usergroup_uid})
CALL apoc.path.subgraphNodes(u, {
	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE>|HAS_RESTRICTED_SECTION|<IS_IN_SECTION",
    labelFilter: "/Asset"
}) yield node
with collect(id(node)) as bl
MATCH (node:Asset)
  WHERE id(node) not in bl
RETURN node 
ORDER BY node.date DESC
LIMIT 20

Bennu

Oh, y’all wanted a twist, ey?

ainsausti · ‎05-25-2022

Thank you!! @bennu.neo and @glilienfield

I have tried the first query proposed by @bennu.neo, and it's incredibly fast (~100ms), I've also tried the one of @glilienfield, but this is still slow (~25s).

The following step is adapt and add the APOC query to our tests and verify that it works as expected.

I expect to do this tomorrow and I'll tell you back the results.

bennu_neo · ‎05-25-2022

Hi @ainsausti !

Keep in mind that my query should be equivalent to yours without the [:HAS_RESTRICTED_ASSET {restriction_type: "view"}] condition and just [:HAS_RESTRICTED_ASSET ]. Keep that on mind when comparing results. Then if they are they same, we can talk about ways to introduce that condition without loosing our performance.

Oh @glilienfield , don't abandon your child (query)

Oh, y’all wanted a twist, ey?

Isidoro · ‎06-08-2022

Hello @bennu_neo , I'm a workmate of @ainsausti , and I have succesfully implemented your suggested query, but I would like to introduce the {restriction_type: "view"} condition. Could you help us with this?

Thank you so much.

bennu_neo · ‎06-08-2022

Hi @Isidoro !

Welcome to the community!

About this query condition. We have 2 solutions/options.

1. Remodeling. How many restriction types do you have? does have sense on your model having different relationships for every different relationship type? In that case, your can use that especific relationship on your APOC filter.

2. Creating your own traverser. You can 'easily' extend apoc in order to add this kind of conditions in a custom traverser. Do you have any experience forking public repositories?

Oh, y’all wanted a twist, ey?

ainsausti · ‎06-09-2022

Hi again @bennu_neo.

As I understand, with the first option you propose to convert the restriction property in a relationship. For example, we have download restrictions and view restrictions, so instead of having HAS_RESTRICTED_ASSET with restriction_type:"view" or restriction_type: "download", we could have HAS_RESTRICTED_VIEW_ASSET and HAS_RESTRICTED_DOWNLOAD_ASSET. Is this correct? This option is feasible, but we may need to migrate the existing structure in our production environment.

The second option is not valid for us because we do not want to maintain a fork customized branch.

Thank you very much for your response!

bennu_neo · ‎06-09-2022

Hi @ainsausti !

Exactly. The first option may force you to remodel (something perfectly normal) in order to optimize other traversals but it will keep the performance of the query I showed you. I understand you position about the second option and it's completely valid. You can always create your own procedure that executes the traversal the way you want if changing the model became eventually not possible.

Soon this kind of queries will become much easier 😉

Oh, y’all wanted a twist, ey?

glilienfield · ‎05-25-2022

That is excellent. Hey, don't label that as my query. I didn't expect it to work well.

ainsausti · ‎06-09-2022

Hahaha, sorry, that was not my intention, I misread the post and interpreted that you were proposing that query 🙏

ainsausti · ‎09-09-2022

Hi again,

optimization issues are coming back 😅

We have now around 600K assets in the production database and the final load that the database should stand is around 4 million assets.

Our query now looks like this:

MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {

	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",

	labelFilter: "/Asset"

	})

	yield node
	with collect(id(node)) as bl
	MATCH (node:Asset)
	WHERE not id(node) in bl
	WITH node
	WHERE node.logically_deleted = false 
	RETURN DISTINCT node.date ORDER BY node.date DESC 
        SKIP 0 LIMIT 50

This query is completed in around 3~4 seconds. But this is the base query, more filters could be added dynamically making the query last more than 10 seconds.

After a bit of research, we've realized that by removing the "order by" statement from the query, the completion time is reduced to 30 ms, so we've been focused on reducing the time of the "order by" with indexes. The experiments we've tried are:

Simply adding an index to Asset node.date -> Without significant improvement.
With the index, adding an artificial where clause to activate the index WHERE node.date is not NULL as in the example showed in the docs (https://neo4j.com/docs/cypher-manual/current/query-tuning/advanced-example/)-> Without significant improvements
We thought that maybe our neo4j version was outdated and didn't applied the index correctly (v3.5.14), we've migrated to v4.4.11 and tried -> Without significant improvements.

Is there something that we are not understanding about how indexes work in order by clauses?

Another piece of information that may be impacting the performance is that node.date is not a native Neo4j date but a string with the format "YYYY-MM-DD". Should transforming the date to a native data structure improve the performance?

Thanks in advance

bennu_neo · ‎09-12-2022

Hey @ainsausti !

Good to see you around here again. Let's build a case...

First thing, Why does the relationship HAS_RESTRICTED_SECTION_VIEW has no relationship attached on your subgraph expansion?

What's the order of magnitude (cardinality) of assets on that subgraph expansion?

What's the count of Assets in general? What's the percentage of them with the property logically_deleted being true/false?

See u around

Oh, y’all wanted a twist, ey?

ainsausti · ‎09-12-2022

Hi again @bennu_neo! Thanks for your fast response!

As for the first question, sorry, but I don't really understand what you mean with "has no relationship attached". Do you mean why it doesn't have a direction specified?

The asset count we currently have is 537446.

There are 8700 assets restricted for the usergroup we are testing that match with the subgraph expansion. On average, the cardinality of these nodes is ~13 relationships.

The percentage of assets with the property logically deleted being true is very low -> 0.003% (17 assets)

I've also attached the profile in png.

bennu_neo · ‎09-12-2022

My bad @ainsausti , I was trying to mean "direction".

How long does the first query takes? (just the apoc).

So if I understand properly, you look for some Assets in your first query but then you want EVERY Asset that's not softly deleted that it's not part of your previous list (while also ordering)?

Oh, y’all wanted a twist, ey?

bennu_neo · ‎09-12-2022

In other thoughts @ainsausti ,

Do you have an ids on Assets with an index? Can you create a composite index with the delte property and try something like:

MATCH (u:UserGroup {uid: "fake_usergroup_uid"})
CALL apoc.path.subgraphNodes(u, {

	relationshipFilter: "HAS_PARENT_GROUP>|HAS_RESTRICTED_ASSET_VIEW>|CONTAINS_ASSET>|HAS_RESTRICTED_FEATURE_VIEW>|HAS_RESTRICTED_SECTION_VIEW|<IS_IN_SECTION|HAS_RESTRICTED_PROVIDER_VIEW|<HAS_PROVIDER|HAS_RESTRICTED_CATEGORY_VIEW|<HAS_CATEGORY",

	labelFilter: "/Asset"
	})
	yield node
	with collect(node.id) as bl
	MATCH (node:Asset)
	WHERE not node.id in bl
	AND node.logically_deleted = false 
	RETURN DISTINCT node.date ORDER BY node.date DESC 
        SKIP 0 LIMIT 50

And share the profile. It may not solve the issue but we can get some ideas 🙂

Oh, y’all wanted a twist, ey?

ainsausti · ‎09-13-2022

Hi again!

There is no reason why the direction is not set in some of the relationship filters, I've added them.

The apoc query takes 32 ms. So, I think, it's not the problematic part of the query.

You understood correctly the intention of the query. Nonetheless, this is the base query, some filters are added dynamically to this one. For example, to find the assets with a description equal to "Test description", the condition "AND node.description = 'Test description' " will be added.

We have a "uid" property with a unique index. I've added a composite index on Asset.uid and Asset.logically_deleted and chnged the query. But still there is no improvement. I think the index are not applied in the execution.

The profile with the index is:

I think the index was created succesfully:

Neo4j

Query performance with not existing relationship checks in Neo4j