Neo4j

cuneyttyler · ‎07-22-2022

I have a database with wikidata entities and their properties. For example 'Mona Lisa' entity has 'creator' relationship with 'Leonardo da Vinci' entity. I want to obtain paintings of Leonardo da Vinci. I have a query as below. In this query, first I match entities with 'instance of' property(Property:P31) pointing to the entity 'Painting'. Then I match 'Leonarda da Vinci'. Finally I check wheter any relationship exists with these two with 'with .. where' clause. In simple words, first I find all paintings, then Leonardo then check if they are Leonardo's paintings. The problem is that 'With .. Where exists()' clause takes too much time. When I remove that, query results in a reasonable time but with that it's beyond acceptable. Is there a more efficient way to achieve this? Thanks.

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity), (subject)-[]->(subjectProp:Entity) where ((subjectIdentifier.name = "Painting")) 
MATCH(subjectOwner:Entity), (subjectOwner)-[]->(subjectOwnerProp:Entity) where (subjectOwner.name = "Leonardo da Vinci")
WITH subject, subjectOwner WHERE exists((subject)--(subjectOwner))  RETURN DISTINCT subject SKIP 0 LIMIT 10

glilienfield · ‎07-24-2022

In your original query above, you have two pattern matches where you don't use the matched related nodes:

1. (subject)-[]->(subjectProp:Entity) --> there is no other reference to 'subjectProp'

2. (subjectOwner)-[]->(subjectOwnerProp:Entity) --> there is no other reference to 'subjectOwnerProp'

Is this because you want to ensure there exists these relationships on the 'subject' and 'subjectOwner' nodes? If not, they should be to remove to reduce the unnecessary result rows to produce and filter. If true, I still think they can be removed because the other match patterns ensure there exists these relationships.

You could try the following query.

MATCH (subjectOwner)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity)
WHERE (subjectOwner.name = "Leonardo da Vinci") AND (subjectIdentifier.name = "Painting")
RETURN DISTINCT subject
LIMIT 10

View solution in original post

Cobra · ‎07-22-2022

Hello @cuneyttyler 🙂

You can try apoc.nodes.connected() function.

WHERE apoc.nodes.connected(subject, subjectOwner) = true

Regards,
Cobra

cuneyttyler · ‎07-22-2022

Thanks, but it didn't make any difference. Actually the thing is that, I have a big graph and what the query is doing is it first fetches all paintings (104000, in total), after that it checkes for each painting if it is connected to Leonardo which results in checks for number of paintings times. Might there be any way to reduce the amount of checkes in this process by changing the query entirely?

cuneyttyler · ‎07-22-2022

So giving a little bit of thought about it I changed the query as below. I first fetch Leonardo, then fetch properties of Leonardo which are 'instances of' 'Painting'. But now, surprisingly it results with 'Not enough memory' error after running for 30 seconds. I have 8GB initial and 16GB max heap size. The problem with this query is that it needs to check maybe whole graph to determine if they have any connections to Leonardo. I can't just make the direction of the first MATCH to 'right' because in the data, paintings are connected to artists, not the other way.

MATCH (subjectOwner:Entity)-[]-(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity) where ((subjectOwner.name = "Leonardo da Vinci")) and ((subjectIdentifier.name = "Painting"))  RETURN DISTINCT subject SKIP 0 LIMIT 10

But there is another interesting point. In Neo4j Browser, when you click to expand a Node, it both shows ingoing and outgoing (159 in total) relationships in a 10 seconds. When I run the below query it returns in 20 seconds.

MATCH (subjectOwner:Entity)-[]-(subject:Entity)  where ((subjectOwner.name = "Leonardo da Vinci")) RETURN  subjectOwner,subject

When I added the Property:P31 relation to this query, it checkes for this 159 nodes to see if they have 'Property:P31' (instance of) property of entity 'Painting'. So the needed time increases. Is there any way to reduce the time to get ingoing relationships of a node and also the second check for 'instance of'?

NOTE: I realize that it is a costly task and I only have 8 core Cpu. With better resources it surely perform more efficiently but there might be ways to optimize this query.

Cobra · ‎07-22-2022

Do you have constraints or indexes in your database?

I don't understand clearly what you try to achieve. Can you upload some queries to create a little dataset and give the desired output?

Regards,
Cobra

cuneyttyler · ‎07-23-2022

I have wikidata dataset in my database. There are mainly Entities represented by 'Entity' node and relationships between them corresponding to RDF triples (subject-predicate-object). In our example both subject and object are instance of 'Entity' node. Relationship between them are labelled with URIs. Take a look at Guernica - a painting by Pablo Picasso in wikidata : https://www.wikidata.org/wiki/Q175036

Here Guernica is an Entity. Also Pablo Picasso is another entity and Guarnica has a 'creator' relationship to Pablo Picasso which is 'Guernica'-['https://www.wikidata.org/wiki/Property:P170'] ->'Pablo Picasso'. Also Guernica has an 'instance of' relation to 'Painting' object which is important in our case.

What I'd like to do is simple Semantic Search similar to what Google does when you type 'Paintings of Pablo Picasso'. It returns structured knowledge using it's Knowledge Graph which is adopted from Freebase.

In order to do that, in the original question, I first fetch all Entities which has 'instance of' relationship to 'Painting' entity. Then I fetch 'Pablo Picasso'. Finally I check if there are any relationships between them.

Here is the link to data: https://easyupload.io/u6drbd and here is my query. It is fairly simple.

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity), (subject)-[]->(subjectProp:Entity) where ((subjectIdentifier.name = "Painting")) 
MATCH(subjectOwner:Entity), (subjectOwner)-[]->(subjectOwnerProp:Entity) where ((subjectOwner.dbpedia_uri = "http://dbpedia.org/resource/Pablo_Picasso")) 
WITH subject, subjectOwner WHERE exists((subject)--(subjectOwner)) RETURN DISTINCT subject SKIP 0 LIMIT 10

An alternative version of my query is below. Here I added a part to query to allow matching Entities with relationship 'instance of' to a subclass of 'Painting' rather than 'Painting' itself. Property:P279 is wikidata subclass of property.

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity)-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]->(adjective:Entity) 
where (adjective.name='Painting') 
MATCH(subjectOwner:Entity) where ((subjectOwner.dbpedia_uri = "http://dbpedia.org/resource/Pablo_Picasso")) 
WITH subject, subjectOwner WHERE exists((subject)--(subjectOwner)) RETURN DISTINCT subject SKIP 0 LIMIT 10

glilienfield · ‎07-24-2022

In your original query above, you have two pattern matches where you don't use the matched related nodes:

1. (subject)-[]->(subjectProp:Entity) --> there is no other reference to 'subjectProp'

2. (subjectOwner)-[]->(subjectOwnerProp:Entity) --> there is no other reference to 'subjectOwnerProp'

Is this because you want to ensure there exists these relationships on the 'subject' and 'subjectOwner' nodes? If not, they should be to remove to reduce the unnecessary result rows to produce and filter. If true, I still think they can be removed because the other match patterns ensure there exists these relationships.

You could try the following query.

MATCH (subjectOwner)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`]->(subjectIdentifier:Entity)
WHERE (subjectOwner.name = "Leonardo da Vinci") AND (subjectIdentifier.name = "Painting")
RETURN DISTINCT subject
LIMIT 10

cuneyttyler · ‎07-24-2022

Thanks, Actually in my server-side code, I apply filters to those 'subjectProp' and 'subjectOwnerProp' taking into account the given search input. For example, if search input is 'Watercolor paintings of Paul Klee', then my subject is 'Painting' and subjectProp is 'watercolor'. By this way, I make sure to return paintings only have relationship to 'Watercolor' entity. But in this case, they are unnecessary. So I removed them and query results very fast. But in my actually scenario I will be needing those.

So let me give you an example. Here is a query where those extra matches are needed.

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) 
and ( (subjectOwner.name = "Pablo Picasso")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

Here you can see the query of the example I gave above. Without (subject)-[]->(subjectProp:Entity), the query runs quickly. With it, it never ends. I think that's because there are 100k paintings in database and it has to check subjectProp relationship for each of them. I'm wondering if is there any other way to reduce these number of checkes. My use case is described in the answer above which I gave to @Cobra .

Thanks anyway.

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes.

The interesting is that here I have a 'Path Matching' to include entities which are subclasses of 'Painting' in the result. For painting entity, it returns 44 subclass (for 0..3). When I remove that part(<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity)) the query runs quickly again. It's probably because now it checkes for each painting that if they have [:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`] relationship to these 44 subclasses.

Here is the second version of the query: - which actually is the one you used in your reponse:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

And here is the query to retrieve these 44 subclasses of Paintings which results in 1 second.

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)
<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), (subject)-[]->(subjectProp:Entity) where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and subjectOwner.name="Paul Klee" RETURN DISTINCT subject SKIP 0 LIMIT 10

Can you see a better approach which results in more efficient computation?

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes.

The interesting is that here I have a 'Path Matching' to include entities which are subclasses of 'Painting' in the result. For painting entity, it returns 44 subclass (for 0..3). When I remove that part(<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity)) the query runs quickly again. It's probably because now it checkes for each painting that if they have [:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`] relationship to these 44 subclasses.

Here is the second version of the query - which actually is the one you used in your response:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

And here is the query to retrieve those 44 subclasses of 'Painting' which results in 6 second:

MATCH(subjectIdentifier:Entity)
<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity) where subjectIdentifier.name = "Painting" return adjective

Can you see a better approach which results in more efficient computation?

Thanks

cuneyttyler · ‎07-24-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ((subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above.

Although I added a 'Path Matching' to include entities which are subclasses of 'Painting' in the result. This query takes too much time(about 6 minutes). The interesting is that without subjectProp relationship and filter it runs quickly in about 1.5 seconds. For painting entity, it returns 44 subclass (for 0..3). When I remove that part(<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity)) the query runs quickly again. It's probably because now it checkes for each painting that if they have [:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`] relationship to these 44 subclasses. I wonder if there is any way to reduce this number of checkes. Here is the query:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity),
(subject)-[]->(subjectProp:Entity)
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ((subjectOwner.name = "Paul Klee"))
RETURN DISTINCT subject SKIP 0 LIMIT 10

And here is the query to retrieve those 44 subclasses of 'Painting' which results in 6 seconds:

MATCH(subjectIdentifier:Entity)
<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity) where subjectIdentifier.name = "Painting" return adjective

Can you see a better approach which results in more efficient computation?

Thanks

cuneyttyler · ‎07-25-2022

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Although it is interesting that now the query runs very quickly - even with subjectProp match and filter properly added, it's because of how you check (subjectOwner)--(subject) relationship. What I was doing was inefficient because I was matching subjectOwner separately. Now it somehow optimizes query.
Thanks

glilienfield · ‎07-25-2022

Your query was matching two separate patterns, most likely producing a lot of results for each pattern. The ‘exists’ clause then made the query sort through each result set looking for those rows in each that had a relationship between them. By not adding that constraint until the end, the query was probably generating a lot of results that did not meet your criteria and had to be filtered after creating them. The query I suggested has the relationship constraint in the pattern with the other constraints, thus the query can start with an initial match of records that meet that requirement and filter that set down to the final result as more of the pattern is applied during filtering.

cuneyttyler · ‎07-25-2022

Thanks, now I encountered something else. Below query is for searches without any subjectOwner's like 'Pablo Picasso'. They are simple queries like 'watercolor paintings'. In this case, there is subjectOwner relationship but there is subjectProp relationship(for filtering only 'watercolor' paintings). This query runs for 70 seconds. Without subjectProp relationship and filter it runs quickly. Is there any better approach for this query?

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp) 
where (
(adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")
) or (adjective.name = "Watercolor Painting") or (adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")) or (adjective.name = "Watercolor Painting")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

NOTE: Ignore my messages sent before my last response to you. They were spam and not sent initally - they will be removed.

glilienfield · ‎07-26-2022

I reformatted the query so I can understand the predicate. Adding line feeds, I got the following:

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp) 
where (
(adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")) 
or 
(adjective.name = "Watercolor Painting") 
or 
(adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")) 
or 
(adjective.name = "Watercolor Painting")
) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

Is this what you intended? You have the same predicate repeated twice in the 'where' clause. Also, you have one condition that has no constraint on the 'subjectProp' value, thus you are probably getting a lot of rows when the adjective.name = 'Watercolor Painting.' Definitely remove the redundant predicates, as the query plan was more complex with the extra filtering.

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp) 
where (
(adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")) 
or 
(adjective.name = "Watercolor Painting") 
) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

cuneyttyler · ‎07-26-2022

I'm sorry I should have organize the query. Although the repeated part is situational and it changes according to the given user search input. I ran the second query you provided with PROFILE keyword and It first does an index seek on Adjective. After that it expands first subject--adjective match, this results in 80k rows. After that, it expands subject--subjectProp relationship and it results in 3 Million rows which means each Entity have about 40 subjectProp and it really is so. Finally it applies - I attached the screenshot to it. It results in 6 Million rows and I didn't quite understand what it's doing there. That's why the query is expensive. Although I'd have expected that because it needs to expand every Entity so that it can check if they have the desired subjectProp.name. I'm not sure if there is some way to efficiently run this query - this is simply 'filter nodes by relationship'

EDIT: What I'm thinking about is that I need to reorganize my data to have a property for each Entity which contains all relationship names and values in it - and I need to create a text index on it. So when I search for subjectProp.name and subjectProp.value - I don't match that subjectProp but simply do an index search on that property I created. By this way, the query won't expand to these millions of rows. I hope a TEXT index would be enough rather than FULL TEXT index.

glilienfield · ‎07-26-2022

The second or clause does not have a constrain on the subjectProp variable and the query returns just the subject, not the subjectProp. As a result, you can remove the '(subject)-[]->(subjectProp)' match for this scenario. It will eliminate expanding the results from the first match pattern for the second scenario. Is there a reason to include this pattern considering you are not using the subjectProp node? Are you trying to ensure that a relationship from subject to another entity exists, besides the relationship that is matched in the first match? If so, we can implement it to just check for the existence instead of expanding the result set unnecessarily.

Query with the (subject)-[]->(subjectProp) pattern removed from the query where there is no constrain on subjectProp. I used a 'Union' clause to separate them.

call {
match (subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp)
where adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")
return subject
UNION
match (subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity)
where adjective.name = "Watercolor Painting"
return subject
}
RETURN DISTINCT subject SKIP 0 LIMIT 10

If you do want to ensure that a relationship exists from 'subject' to another entity, other than the entity matched in the (subject)-->(adjective) pattern, then we need to ensure that that exists more than one outgoing relationship from the 'subject' entity.

call {
match (subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp)
where adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")
return subject
UNION
match (subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity)
where adjective.name = "Watercolor Painting"
call {
    with subject
    match (subject)-[]->(subjectProp)
    return count(*) as cnt
}
with subject, cnt
where cnt>1
return subject
}
RETURN DISTINCT subject SKIP 0 LIMIT 10

BTW- do you really want ten records from the query, or is that limit just for test purposes? If you really only want ten, then I would think you could limit the number of records earlier, so the query doesn't generate millions of rows and then keeps only ten.

cuneyttyler · ‎07-26-2022

Thanks for the answer. My application is Semantic Search and I have Entities and Relationships defining those Entities. For example I have Paintings(Entity) - Entity nodes which have 'instanceof'(Property:P31) relationship to 'Painting' Entity. When I want to search for only 'paintings' there is no need for subjectProp but when I want to search for 'Watercolor paintings' I need to filter Painting entities having a relationship to anything containing 'Watercolor'. One of the above queries does this. I'm copying it here:

MATCH(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(adjective:Entity), (subject)-[]->(subjectProp) 
where (
(adjective.name = "Painting" and (subjectProp.name = "Watercolor" or subjectProp.value =~ "(?i).*Watercolor.*")) 
or 
(adjective.name = "Watercolor Painting") 
) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

The point you make is interesting - that the second 'or' clause not having subjectProp filter. That's a big issue. So the query you provided works in 8 seconds now - the inner 'call' and with..where part is unnecessary for my case.

Maybe this kind of query can only be reduced to 8 seconds (Maybe with a machine with a lot of cores this would decrease). But for a production search app this is too much. How about the solution I mentioned in my previous response's EDIT section. I am creating an additional property for each Entity containing their connected entities' names - and I create an index on that property. This is simply manually creating a search index. How does that sound to you? I'm running this query now and I'll see how it'll perform. It seems to me the only solution for now to execute such a query in a huge graph in approximately 1 secs. Because the last query you provided also have millions of db hits.

About LIMIT, I simply perform paging with SKIP and LIMIT in my web page(SKIP is ommited here).

cuneyttyler · ‎07-24-2022

@glilienfield They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

cuneyttyler · ‎07-24-2022

@glilienfield

They are actually needed relating to the search input. If my search input is 'watercolor paintings by Pablo Picasso', then I use (subject)-[]->(subjectProp:Entity) relationship with filter subjectProp.name='watercolor' to return paintings only have a relationship to 'watercolor' entity. For simpler queries they are not used and you are right to mention that. Here is a query where I use subjectProp:

MATCH(subjectOwner:Entity)--(subject:Entity)-[:`https://www.wikidata.org/wiki/Property:P31`|`https://www.wikidata.org/wiki/Property:P279`]->(subjectIdentifier:Entity)<-[:`https://www.wikidata.org/wiki/Property:P279`*0..3]-(adjective:Entity), 
(subject)-[]->(subjectProp:Entity) 
where ((subjectIdentifier.name = "Painting" and subjectProp.name = "Watercolor") or (subjectIdentifier.name = "Watercolor Painting")) and ( (subjectOwner.name = "Paul Klee")) 
RETURN DISTINCT subject SKIP 0 LIMIT 10

This the query for the example I gave above. This query takes too much time(about 6 minutes). Without subjectProp relationship and filter it runs quickly in about 1.5 seconds. I think the reason why it takes too much time when added subjectProp is that, there are 100k paintings in database and it checkes for each of them if they have any relationship with name 'Watercolor'. I wonder if there is any way to reduce this number of checkes for my use case. My use case is described in the answer above I gave to @Cobra. It's simply doing semantic search on wikidata dataset.

Thanks

Neo4j

WITH .. WHERE exists((e1)--(e2)) takes too much time(To check any relationship exists between nodes)