What is the impact of unused indices?
‎03-07-2022 08:19 AM
What is the impact, if any, of creating indices on a labeled node that are unused in many instances of that node? More specifically, what is the impact on write performance, read performance, and space?
I'm using neo4j-enterprise v4.4.4.
I have a labeled node (:AnalysisResult
) whose instances are created by several different analyzers. Each analyzer is represented by a different labeled node (:Analyzer
). My current design adds an index to :AnalysisResult
for each property created by a specific analyzer.
There are several million instances of AnalyisResult in the database. Any one Analysis Result instance has only a few properties -- the others are unused. There are several dozen indices defined on AnalysisResult, and each instance populates only a handful (less than six) of these.
When I do read queries, I use labeled relationships in a cypher query to collect the AnalysisResult
instances for a given analyzer:
MATCH (analysisResult:AnalysisResult)-[ANALYZER]->(analyzer:Analyzer {analyzerID: 'aUniqueID'})
RETURN analysisResult, analyzer
The instances of AnalysisResult
returned by this query do not have property values for the properties that are unused by the Analyzer
that created them.
Is it worth refactoring the database to reduce the number of unused properties? For example, is it worth adding multiple labels, so that instead of :AnalysisResult
, I instead create :AnalysisResult:HotSpot
?
I ask because I'm about to introduce another Analyzer that will create several new properties on the AnalysisResult nodes that it creates.
This is perhaps an object modeling question, but my immediate concern is whether the current approach I'm using is good enough.
The result is that I'll have several million instances of AnalysisResult that will have only 2-3 properties populated, and the other indices defined on AnalysisResult will be empty.
Please feel free to suggest whether this question belongs in a different discussion group.
- Labels:
-
Cypher
‎03-14-2022 02:58 PM
What makes you think you need all of these indexes on AnalysisResult? The cypher you provided will not use any of the indexes on AnalysisResult but will use an index on Analyzer.analyserID if one exists. The relationship between the two nodes is a direct lookup, no indexes necessary, and will be executed starting with Analyzer nodes.
Now if you have other cypher that restricts on a particular property in AnalysisResult, it might be used.
Every index you add adds overhead to a transaction, thus making it longer. Plus they take up storage space and add to backup space (you are backing up your database?).
‎03-14-2022 03:33 PM
The cypher you provided will not use any of the indexes on AnalysisResult but will use an index on Analyzer.analyserID if one exists. The relationship between the two nodes is a direct lookup, no indexes necessary, and will be executed starting with Analyzer nodes.
Perhaps my thread-starter was unclear. I know that the given query doesn't require the indices, I offered it as an example of a query that does not use them. Most of the queries that hit the DB will be like this.
I note that, in the context of my question, the alternative to these sparsely-populated property indices is to add the complexity of multiple labels. I would expect this latter to have an impact -- and again, I want to quantify that impact.
The indexed property names of AnalysisResult
are sparsely populated -- several million read queries use four property names and do not use another.
Every index you add adds overhead to a transaction, thus making it longer. Plus they take up storage space and add to backup space (you are backing up your database?).
Again, perhaps my thread-starter was unclear. I know that some overhead is added, the documentation is clear about that.
My question is how much overhead and where is it seen?
If the overhead is on write queries, how much overhead? In my application, write queries are much less frequent than read queries, so the overhead may be tolerable -- or even welcome if the increase in read performance is significant (which I think it is).
How much additional storage space do they take up? How much additional backup space do they add?
I understand the heuristic -- "Use indices only when the benefit merits the cost". I'm looking to quantify the cost in order to confirm the benefit.
‎03-15-2022 06:15 AM
Sorry, I got hung up on the cypher you provided.
How much overhead and where that overhead is seen depends on a lot of factors. I can say that additional indexes will cause slower updates and a lot of indexes will slow down the optimizer some because it is scanning all those indexes to determine if one will work better than another. Because there are so many factors to consider, the only way to really know how much additional overhead will be caused is to experiment and see if the end result is still within your requirements for performance.
Another modeling option that would not require so many indexes may work better. I have had similar instances where I have a sparse set of data and I use a model that is (:PropertyType)<-[:REPRESENTS]-(:PropertyValue)<-[:HAS]-(:Analyzer). The PropertyValue label has one property, "value", and I can index that property. This does require a more sophisticated cypher if you want multiple properties on the same return row.
I'm not sure I really answered your question. I have a coffee cup on my desk that says, "It Depends ..." and this is one of those situations.
‎03-15-2022 06:45 AM
I appreciate your attention.
I started with cypher very similar to yours (a labeled node for each property value).
I moved away from it because of increasing read performance issues -- as you observe, it requires pulling one node for each property in a scenario where multiple properties are defined. My experience was that the resulting read query was slower as well as more sophisticated :).
I guess I'll try it and see what happens.