Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
04-18-2020 03:26 AM
Hi all,
I am struggling with an optimization of my Graph Model and I was hoping you could help me!
All cypher query results and profiles have been performed on Neo4j Desktop 1.2.7 and Neo4j 4.0.3
I have a graph model in which I label my Gene nodes CREATE (g:Symbol:Gene {gname:'Gene1})
,
as some of you may know genes can have quite a large number of aliases and I am on the fence whether I should model them in one of the following ways:
(g:Symbol:Gene)-[:HAS_ALIASES]->(a:Aliases {synonyms:['Alias1', 'Alias2', ...]})
(g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias1'}), (g:Symbol:Gene)-[:HAS_ALIAS]->(a:Alias {synonym:'Alias2'}), ...
It is important for me that these lookups are as fast as possible because you never known beforehand if an incoming dataset containt all official symbols (eg. HGNC) or synonyms, or even a combination of both.
I have run a quick profile using both approaches and the results were a bit surprising to me. For the test I created 1 gene with 4 aliases using approach (1) and (2). When I profiled my query approach (2) had fewer db hits but took longer than approach (1). And when I indexed the 'synonym' property it took even longer with even fewer db hits?
I thought approach (2) would win for sure because Neo4j is optimized for traversels and not the retrieval of a long list of properties. Can someone explain to me why this is happening? Or suggest a better way of modelling this? Because this problem also translates to other id's, especially Ensembl gene and protein ID's.
Thanks in advance for your feedback!
Solved! Go to Solution.
04-19-2020 08:45 PM
If aliases are only used for retrieval, never lookup, then route 1 is what you want, as that will require only a single traversal and a single property lookup, vs n number of traversals and property lookups.
If you need to lookup a :Gene or :Symbol node via an alias, then you need to go with route 2, since you can index :Alias(synonym) to speed up the lookup, but you cannot apply an index to speed up route 1, since elements in a list property can't be individually indexed at this time.
Also, using only 4 entries as a test won't be a good indicator of real performance with actual data. You need to consider how this needs to scale as the number of :Alias nodes increase. An index lookup will always beat a label scan + filter at scale.
04-19-2020 08:45 PM
If aliases are only used for retrieval, never lookup, then route 1 is what you want, as that will require only a single traversal and a single property lookup, vs n number of traversals and property lookups.
If you need to lookup a :Gene or :Symbol node via an alias, then you need to go with route 2, since you can index :Alias(synonym) to speed up the lookup, but you cannot apply an index to speed up route 1, since elements in a list property can't be individually indexed at this time.
Also, using only 4 entries as a test won't be a good indicator of real performance with actual data. You need to consider how this needs to scale as the number of :Alias nodes increase. An index lookup will always beat a label scan + filter at scale.
04-27-2020 01:32 AM
Hi Andrew,
I thought as much, than you very much for your answer!
I'll increase the size of my test graph for further performance profiling.
All the sessions of the conference are now available online