Neo4j

shash · ‎11-28-2019

Do aggregations not make use of page cache?

I have a database with 4.6G of data, and 3.3G of indexes, the page cache is 10G, and is currently using 7.9G.

My queries are showing millions of db hits, even after all the db is loaded in memory?

shan · ‎11-28-2019

As far as I know, the number of dbhits has nothing to do with whether the graph is loaded into memory or not. The way you write your cypher and the number of rows produced by different components of your cypher determine the number of dbhits. A dbhit does not necessary mean an access to the disk. You can write a simple query that retrieves one single node using index. You will see that even after running the cypher once, the number of dbhits does not change although the data is in memory now. You can read more about dbhits here: https://neo4j.com/docs/cypher-manual/current/execution-plans/#execution-plans-dbhits

shash · ‎11-28-2019

Thanks a lot for the info and link @shan.

I'll go over the link now. Do you also happen to know how we can optimize a query like this to return as quickly as possible?

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) 
WHERE g.name IN ['rock', 'metal']
RETURN t.name, collect(DISTINCT g.name) AS genres, count(DISTINCT g) AS score ORDER BY score DESC LIMIT 20

I have about 7Million tracks and 2K genres.

Currently the aggregations in their expand and filter operations are hitting the db several million times, and making the query very slow.

shan · ‎11-28-2019

No problem.

I am afraid nothing comes to mind in order to be improve your cypher performance. You are hitting those 7M tracks so it'll be slow.

mike_r_black · ‎11-28-2019

If you're just needing to return aggregations, I've found better query plans and performance by using the Pattern Comprehension technique. This saves the database from actually having to read any of data that you might have told it fetch in the MATCH clause.

Here's an example of how a rewrite of you query might look, though you'll might need to adjust it.

MATCH (t:Track)
RETURN t.name, 
    SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'rock'}) ) > 0 AS has_rock,
    SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'metal'}) ) > 0 AS has_metal,
    SIZE( (t)-[:HAS_GENRE]->(g:Genre) ) AS genre_count
ORDER BY genre_count DESC 
LIMIT 20

shash · ‎11-28-2019

Thanks a lot @mike.r.black!

This is very interesting, I'll try it right now, and get back to you.
Happy thanksgiving!

shash · ‎11-28-2019

Hello @mike.r.black,

So I tried this query:

MATCH (t:Track)
RETURN t.name, 
    SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'rock'}) ) > 0 AS has_rock,
    SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'metal'}) ) > 0 AS has_metal,
    SIZE( (t)-[:HAS_GENRE]->(:Genre) ) AS genre_count
ORDER BY genre_count DESC 
LIMIT 20

Profile:

This turned out to be even more costly than the previous one unfortunately.

Maybe, I am doing a bad job explaining the problem I have. What I want basically is if I query for ['rock', 'metal'], I want to return the tracks that have both of them first, and then tracks who have only one of them.

I want the query to return in less that a few hundred milliseconds.

I have already looked at match intersection, and learned about the count method from there, which obviously is too slow.

Is there any method that might help solve this problem?

mike_r_black · ‎11-28-2019

This query would return tracks that have both genres

MATCH (g1:Genre {name: 'rock'})<-[:HAS_GENRE]-(t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
RETURN t

This would be the query for track that has one genre but not the other

MATCH (t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
WHERE NOT (:Genre {name: 'rock'})<-[:HAS_GENRE]-(t)
RETURN t

Is this what you're trying to achieve? I assume you have an index on Genre.name to help speed up the seeking of the genre nodes?

shash · ‎11-28-2019

Thanks again for the answer @mike.r.black.

I assume you have an index on Genre.name to help speed up the seeking of the genre nodes?

Yes.

Is this what you're trying to achieve?

Not exactly. A few followup questions:

If the query to find track with both genres fails, I want to find tracks that have either of the genres. These queries might not cover that use case right?
What if there are more than 2 genres I want to query on? That would definitely require something like IN right?

michael_hunger · ‎11-29-2019

It's best to access properties as late as possible.
So a simplistic optimization would be:

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) 
WHERE g.name IN ['rock', 'metal']
WITH t, collect(g) AS genres, count(g) AS score ORDER BY score DESC LIMIT 20
RETURN t.name, [g in genres | g.name] as genres, score

But I presume you actually want to have all genres of each track, not just your 2.

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) WHERE g.name IN ['rock', 'metal']
WITH t, size( (t)-[:HAS_GENRE]->() ) as score
ORDER by score DESC LIMIT 20
MATCH (t)-[:HAS_GENRE]->(g) 
RETURN t.name, collect(g.name) AS genres, score

Neo4j

Does count aggregation not use page cache?