cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Why graph indexing is important

chim3yy
Node Link

Hello Sir/Madam,
I have been reading various papers on graph indexing so far but, I am extremely confused.

  1. Firstly, why graph indexing is important? Several papers say it's for finding similar subgraphs.
  2. How is it related to graph algorithms and graph analytics?
  3. Does graph indexing works on dimension reduction or is it to find cluster of similar data sets?
  4. In a practical scenario, how does graph indexing helps? Like in terms of speed of query processing or in terms of accuracy in clustering.

Any suggestions on this topic would be extremely valuable.

1 ACCEPTED SOLUTION

chim3yy
Node Link

@dana.canzano @andrew.bowman Thank you so much for your time and response. I believe most of my doubts are clarified.

View solution in original post

5 REPLIES 5

12kunal34
Graph Fellow

Hi @chim3yy

Welcome to the community
We are glad to know that you are interested in neo4j.

Indexing is kind of thumb rule in terms of performance in neo4j.whenever you perform some query on graph db , indexes are the first thing that will refer the total db hits by your query. when you have large amount of data then you can see the difference in performance and time with indexes.
for more info you can refer below blog .

When Neo4j creates an index, it creates a redundant copy of the data in the database. Therefore using an index will result in more disk space being utilized, plus slower writes to the disk.
Therefore, you need to weigh up these factors when deciding which data/properties to index.
Generally, it's a good idea to create an index when you know there's going to be a lot of data on certain nodes. Also, if you find queries are taking too long to return, adding an index may help.

Hope this help you in query.

please let me know if you required further details.

Cheers.

Thank you so much for your response. This definitely answered why indexing is important for Neo4j. If I’m not wrong, using index for query processing can allow for first retrieval of certain features that you have indexed. However it can lead to high memory usage along with heavy write. But I’m still confused why graph based indexing is required for feature selection. Your suggestions are really valuable, thank you for your time.

regarding the comment of

When Neo4j creates an index, it creates a redundant copy of the data in the database. Therefore using an index will result in more disk space being utilized, plus slower writes to the disk.
Therefore, you need to weigh up these factors when deciding which data/properties to index.
as a point of clarification the 'creates a redundant copy of data`

as a point of clarification
creates a redundant copy of data
this should be
creates a redundant copy of data for the property indexed.

For example if you have 100 million :Person nodes and each node has 20 properties and you then create an index on :Person(age) we do not create a duplicate copy of those 100 million :Person nodes with 20 properties. Rather we simply create a redundant copy of the 100 million :Person nodes and on the given property age.
Also with regards to 'using an index will result in more disk space being utilized, plus slower writes to the disk.` this is true but this is true of most any/all RDBMS. Indexes are not exactly free. Free to create yes, but they do impact load/write performance simply because as you update the data you also then need to update the associated indexes.

As to indexes and why they are import, if one runs

match (n:Person) where n.age>20 and n.age<30 return n;

without in index on the age property we would need to iterate over the 100 million :Person nodes and check each node to see if it satisfies the where clause. However with a index on :Person(age) as the index has details on the age property the query would be much faster.

As clarification, I don't believe the nodes or their data is duplicated...I believe it's the graph id, which is essentially a pointer to the nodes in question. The property value for the indexed property is stored in the index however.

So we're keeping the minimal amount of data to serve the index, and not duplicating nodes or node data. Additionally, since we have the indexed property value in the index, we can use that in some optimization scenarios, such as 3.5's index-backed ORDER BY operations, when a hint is provided in the query about the property's type.

chim3yy
Node Link

@dana.canzano @andrew.bowman Thank you so much for your time and response. I believe most of my doubts are clarified.