Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-06-2019 07:15 AM
Hi,
I am working on a Neo4j database on which we are running various machine learning algorithms. We would like to hold out some data for training, e.g. treating certain relationships in the data as our training data and others as validation or test data. We are looking for a way to query the graph in a flexible way (i.e. we don't want to have to do careful logic checking on the query itself) where we can specify that specific relationship ids are not used in the query. Does anyone know how this could be done without having to delete these relationships from the underlying database?
We are currently using the neo4j python driver to interact with the graph, however I think the implementation is flexible.
Thanks in advance,
Rachel
05-07-2019 10:27 AM
Hi Rachel,
Welcome to the Neo4j online community! Please introduce yourself over here: https://community.neo4j.com/c/general/introduce-yourself
I'd just mark the relationships with a boolean or enum-- indicating whether it's training or production data.
This would require your query just saying something like r.training = true
to indicate that it's looking for training relationships.
I can't think of any other way to do this other than to write a procedure and execute all your queries through a procedure that filters appropriately. But that seems painful for something you can do in cypher directly.
I guess you could also run two copies of the database -- which isn't too bad nowadays with Docker or even Neo4j Desktop.
Cheers,
-Ryan
05-09-2019 09:14 AM
Hi Ryan,
Thanks for your suggestions- they are in line with our discussions thus far. We are going to try out the option of creating multiple database instances, and may also have to filter out specific edges in certain complicated use-cases.
As a side note, I think this question could be translated into a feature request, to enable multiple access levels to the same graph instance. In our case, we would want to set up access='training_fold1', with a different subset of the data available compared to access='full' or access='training_fold2'. Perhaps this is related to some graph forking/versioning capabilities. Would be interested to hear thoughts on this, and whether anyone knows of other graph database technologies that offer such a feature?
Thanks,
Rachel
All the sessions of the conference are now available online