cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

List lookup vs index property lookup

Hi all,

I am struggling with an optimization of my Graph Model and I was hoping you could help me!

All cypher query results and profiles have been performed on Neo4j Desktop 1.2.7 and Neo4j 4.0.3

I have a graph model in which I label my Gene nodes CREATE (g:Symbol:Gene {gname:'Gene1}) ,
as some of you may know genes can have quite a large number of aliases and I am on the fence whether I should model them in one of the following ways:

  1. (g:Symbol:Gene)-[:HAS_ALIASES]->(a:Aliases {synonyms:['Alias1', 'Alias2', ...]})
  2. (g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias1'}), (g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias2'}), ...

It is important for me that these lookups are as fast as possible because you never known beforehand if an incoming dataset containt all official symbols (eg. HGNC) or synonyms, or even a combination of both.

I have run a quick profile using both approaches and the results were a bit surprising to me. For the test I created 1 gene with 4 aliases using approach (1) and (2). When I profiled my query approach (2) had fewer db hits but took longer than approach (1). And when I indexed the 'synonym' property it took even longer with even fewer db hits?

I thought approach (2) would win for sure because Neo4j is optimized for traversels and not the retrieval of a long list of properties. Can someone explain to me why this is happening? Or suggest a better way of modelling this? Because this problem also translates to other id's, especially Ensembl gene and protein ID's.

Thanks in advance for your feedback!

1 ACCEPTED SOLUTION

If aliases are only used for retrieval, never lookup, then route 1 is what you want, as that will require only a single traversal and a single property lookup, vs n number of traversals and property lookups.

If you need to lookup a :Gene or :Symbol node via an alias, then you need to go with route 2, since you can index :Alias(synonym) to speed up the lookup, but you cannot apply an index to speed up route 1, since elements in a list property can't be individually indexed at this time.

Also, using only 4 entries as a test won't be a good indicator of real performance with actual data. You need to consider how this needs to scale as the number of :Alias nodes increase. An index lookup will always beat a label scan + filter at scale.

View solution in original post

2 REPLIES 2

If aliases are only used for retrieval, never lookup, then route 1 is what you want, as that will require only a single traversal and a single property lookup, vs n number of traversals and property lookups.

If you need to lookup a :Gene or :Symbol node via an alias, then you need to go with route 2, since you can index :Alias(synonym) to speed up the lookup, but you cannot apply an index to speed up route 1, since elements in a list property can't be individually indexed at this time.

Also, using only 4 entries as a test won't be a good indicator of real performance with actual data. You need to consider how this needs to scale as the number of :Alias nodes increase. An index lookup will always beat a label scan + filter at scale.

Hi Andrew,

I thought as much, than you very much for your answer!

I'll increase the size of my test graph for further performance profiling.