Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
10-10-2018 02:53 PM
Hello all,
I've been using Neo4j for some weeks and I think it's awesome.
I'm building an NLP application, and basically, I'm using Neo4j for storing the dependency graph generated by a semantic parser, something like this:
In the nodes, I store the single words contained in the sentences, and I connect them through relations with a number of different types.
For my application, I have the requirement to find all the nodes that contain a given word, so basically I have to search through all the nodes, finding those that contain the input word. Of course, I've already created an index on the word text field.
I'm working on a very big dataset (by the way, the CSV importer is a great thing).
Here are the details of the graph.db:
47.108.544 nodes
45.442.034 relationships
13.39 GiB db size
Index created on token.text field
PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
NodeIndexSeek
251,679 db hits
Projection
251,678 db hits
ProduceResults
251,678 db hits
I was in doubt if indexing such amount of nodes was a good practice. In the first prototype db, I created a new node for each word I encountered in the text, even if the text is the same of other nodes.
Then I've re-implemented the db structure using unique words/nodes, the number of nodes dropped from 47.108.544 to 1.934.049, and the db size to 3.5 Gigabyte
I still have a huge number of relationships, 45.442.034 that now point to the unique nodes, and I'm not sure if this is a good architecture.
My end goal is to find specific patterns in sentence structures, like the following example
(John)<-[NSUBJ]-(eat)-[DOBJ]->(apple)
Could you please help me with a suggestion or best practice to adopt for this specific case? I think that Neo4j is a great piece of software and I'd like to make the most out of it 🙂
thank you very much
10-11-2018 05:24 PM
I think it's better to continue from your original question and not do new posts
Perhaps @Christophe_Willemsen has some suggestions.
What do your current queries look like and what's their PROFILE output?
PROFILE
MATCH path = (token:{text:"John")<-[:NSUBJ]-(:token {text:"eat"})-[:DOBJ]->(:token {text:"apple")
RETURN path
10-15-2018 02:32 AM
Thank you Michael, actually the PROFILE query freezes, but I've got great suggestion from the reply of Christophe below
10-15-2018 12:05 AM
In neo4j-nlp ( https://github.com/graphaware/neo4j-nlp ), we store unique lemmas, and keep the occurrence of the word in a TagOccurrence nodes, which means the database can grow up easily when you want to keep the syntactic dependency graph in Neo4j. We also store the NER on the TagOccurrence and use indexes for the occurrence token value. 47 millions nodes is really nothing for Neo. What you need to take care is to have a good list of stopwords, because they will generally be useless and have a serious degree of incoming relationships, so avoid to store words like "the, if, ...".
10-15-2018 02:32 AM
Thank you Christophe, this is really helpful!
All the sessions of the conference are now available online