Neo4j

aross · ‎01-12-2021

Hello,

I have, what i think is a relatively straightforward cypher query that is unexpectedly slow:

MATCH (potential_user:GraphUser)
  WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
  WHERE potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true
AND
 distance < 100000
  RETURN count(potential_user)

There are only about 120K GraphUser nodes. Here is the PROFILE:

it's on the order of 500ms. This query takes 2ms in postgres. Am I missing something?

dana_canzano · ‎01-12-2021

Do you have any indexes on :GraphUser ?

aross · ‎01-12-2021

Yes, there is an index on user_id, location, and a composite index on ["is_banned", "is_paused", "is_invisible", "profile_complete"].

clem · ‎01-12-2021

check this out:

I think what you want to do is force Cypher to do a cheap conditional filtering before doing the location filtering. If you look at the PROFILE, you see the big fat red sections occurring early which is what is slowing your query down. I'm sure if you do the profile of what's within the CALL subquery statement below, the big fat red sections will shrink.

You want to do something like this.... (I'm still at the intermediate level, so this almost certainly not exactly right. Cypher is still a bit of a mystery to me. If/when you figure it out, I'll fix my answer.)

MATCH (potential_user:GraphUser)
CALL { // filter out as many potential_users as possibly as cheaply as possible
WHERE potential_user.is_banned=false AND
              potential_user.is_paused=false AND
              potential_user.is_invisible=false AND
              potential_user.profile_complete=true
RETURN potential_user
} // this part I am not so sure about:
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) AS distance
MATCH(potential_user) WHERE distance < 100000
  RETURN count(potential_user)

I hope this helps!

tard_gabriel · ‎01-12-2021

Don't take it wrong, I think I can help a bit, but your query is a mess.
The reason why it's difficult for Neo4j to compute it properly and quickly is roughly that.

SQL has only 2 dimensions to care about, that's why it's faster on SQL.
Neo4j has a multidimensional Graph domain to care about + a projected data domain.

I'm good with the Cypher language but not quite a query tuning expert yet.
But, according to what I can read on your query profile, thank you for having it, Neo4j doesn't even use your index right now.

The first thing I would try it's to put your WHERE clause after the first match clause, at least the property you can, it should improve a lot but the query tuning won't be complete after that.

dana_canzano · ‎01-13-2021

as previously stated per above the first block from the PROFILE is a NodeScanByLabel and thus it not using the index.
And although you have not indicated what version of Neo4j, I am able to reproduce your behavior under 4.2.2 and also have a workaround so as to have the index used. For example my workaround code is

:use system
drop database graphUser;
create database graphUser;
:use graphUser;
// create single index on 4 properties, i.e. the properties in the where clause
create index index01 for (n:GraphUser) on (n.is_banned, n.is_paused, n.is_invisible, n.profile_complete);
call db.indexes();

// create some dummy data
foreach ( x in range (1,100) | create (n:GraphUser {id:x, profile_complete: true, potential_user: false, is_banned: false, is_paused: true}));
foreach ( x in range (101,300) | create (n:GraphUser {id:x, profile_complete: false, potential_user: false, is_banned: true, is_paused: true}));
foreach ( x in range (301,10000) | create (n:GraphUser {id:x, profile_complete: false, potential_user: true, is_banned: true, is_paused: false}));

profile MATCH (potential_user:GraphUser)
  WHERE potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
 where distance < 100000
  RETURN count(potential_user);

the main difference between your original post and my workaround is in the MATCH statement I have moved the WHERE clause up. With my example code the PROFILE reports its first block as a NodeIndexSeek

aross · ‎01-13-2021

Thank you guys, this very helpful! I didn't know you could move around WITH and WHERE like that.

Indeed, the modified query is now using the index. Still, it's not performing well, averaging about 180ms.

PROFILE MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

Are spatial functions just that slow or is there something else I could be doing?

clem · ‎01-13-2021

The big fat RED sections shows the slow sections. For the DISTANCE query, it is run on all 2.6M nodes.

Did you try my subquery I suggested above? I think that would help because the DISTANCE calculation would be only done on the 91K results from the boolean sub-query instead of all 2.6M nodes in the DB. I'm not sure if the query planner could be smarter, but it's being sub-optimal.

In theory, you could create a bounding box that includes a box around the geo-location results of the first boolean part of the query. Then somehow limit the distance calculation to only those points that fall within the bound box. But I've never played with Neo4J GeoLocations, so I don't know if that's possible.

aross · ‎01-13-2021

Hey @clem , thanks for the ideas.

Is this what you had in mind?

CALL {
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
RETURN potential_user }
MATCH (potential_user)
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

This query performs slightly worse, ~200ms, and has a similar plan:

clem · ‎01-13-2021

Yes, that's what I have in mind, but the way you have it set up isn't working the way I expect (but I don't know how to fix it!)

The first box with the big RED section returns 91K rows. The second box with the big red box searches 2.6M rows. It should (in theory) only search 91K rows but it's not.

I've modified it by commenting out the offending 2.6M MATCH, but I'm not sure if this works or not. Something like this:

CALL {
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
RETURN potential_user }
// MATCH (potential_user) // remove this big, bad match: matches 2.6M!!!!
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

aross · ‎01-13-2021

Thanks @clem , that returns the same query plan.

What's weird is that even if I just do a query on distance, it still has millions of db hits:

MATCH (potential_user:GraphUser)
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user)

I have an index on GraphUser(location) but it doesn't seem to be using it.

clem · ‎01-13-2021

It doesn't surprise me that indexing doesn't help with location because the query doesn't know how to limit the number of nodes searched based on the location because it has to go through a distance calculation. (In theory, some clever algo could handle this better).

Maybe try:

CALL {
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
RETURN potential_user AS pu // Try to get the 61k potential_users out of the subquery
 }
WITH pu, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), pu.location) AS distance
WHERE distance < 100000
RETURN count(pu);

I'm not clear on how to get the limited return results out of the subquery and pass it along to the second part of the query (filtering by distance.). You may have to fiddle with this some more.

I'm still at the intermediate level of Cypher and it sometimes surprises me!

dana_canzano · ‎01-13-2021

if you change

MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
WITH potential_user, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

to

with point({srid:4326, x:-140.0312186, y:39.33233141}) as point1
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
WITH potential_user, distance(point1, potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

any performance benefit? Also your query response times, is this after the 2nd run of the query. For example on a cold start of Neo4j the data to solve a query may first need to get the data from the file system , where as the 2nd run would get it from RAM, provided dbms.memory.pagecache.size is suffciently defined

aross · ‎01-13-2021

Thanks @dana.canzano ! Unfortunately, it doesn't help, still gives me the same query profile.

Is it possible I need to create the spatial index with different options?

aross · ‎01-13-2021

@clem yes, that appears to help!!

Here is what I changed it to:

PROFILE 
CALL {
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
RETURN potential_user AS pu }
WITH pu, distance(point({srid:4326, x:-140.0312186, y:39.33233141}), pu.location) as distance
WHERE distance < 100000
RETURN count(pu);

and the query plan is this:

db hits are way down on the distance calc, awesome! The query is now running about 75ms on average, so starting to get close to usable!

clem · ‎01-14-2021

Maybe try this too:

WITH -140.0312186 AS Xcenter, 39.33233141 AS Ycenter, 
pu,
0.5 AS Xslop, // where slop is some bounding box value is equivalent to +/-10000
0.5 AS Yslop // where slop is some bounding box value is equivalent to +/-10000
WHERE 
  pu.location.x < Xcenter) + Xslop AND 
  pu.location.x > Xcenter) - Xslop AND 
  pu.location.y < Ycenter) + Yslop AND 
  pu.location.y > Ycenter) - Yslop AND
  distance(point({srid:4326, x:Xcenter, y:Ycenter}), pu.location) < maxdistance

The four pu.location tests are short circuit booleans, which will avoid the expensive (unindexed) distance calculations if the pu.location is outside the bounding box.
I'm not sure if Neo4j indexes the individual values of location.x and location.y, but you could create indexes for them. With this index, it will go faster.

The Xslop and Yslop may be trickier to calculate as they are in degrees and distance will vary by what location they are at.

See this:

and

dominicvivek06 · ‎01-14-2021

Hi @aross ,

can you execute -> db.prepareForReplanning() [caution if you are running in production] and rerun @dana.canzano query below ->

with point({srid:4326, x:-140.0312186, y:39.33233141}) as point1
MATCH (potential_user:GraphUser)
WHERE (potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true)
WITH potential_user, distance(point1, potential_user.location) as distance
WHERE distance < 100000
RETURN count(potential_user);

again clear the cache, and try the below statement too

with point({srid:4326, x:-140.0312186, y:39.33233141}) as point1
MATCH (potential_user:GraphUser)
WHERE potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true AND distance(potential_user.location,point1) < 100000
return count(potential_user);

one question for composite index -> potential_user.is_banned=false AND potential_user.is_paused=false AND potential_user.is_invisible=false AND potential_user.profile_complete=true is the order of the columns name in the query same as the order of composite index ordering.

Since, you are using composite index, i believe you are using enterprise version of neo4j. Can you let us know what is the build # ?

aross · ‎01-15-2021

Hi @dominicvivek06 ,

Thanks for the suggestions. Both queries have about the same performance after running the prepareForReplanning() function. They are running around 200ms. Here is the query plan:

The only way I can get the db hits lower, is to use @clem 's suggestion of forcing the results of the index scan into a different variable. This reduces the average execution time to around 70ms.

I should point out that this really is just the beginning of a much larger query with a lot more conditions, and likely there will be a UNION with another query that deals with relationships. (The current query that is in production on a relational db has four joins). So it's a little disconcerting that this simple spatial query is performing so slowly, and it's also awkward that just adding a few more conditions to this adds tons of db hits and degrades performance.

Granted, I am very new to cypher and maybe I really need to understand the query chaining and indexing (I clearly don't), but are my expectations just not set appropriately?

FWIW, I'm running this on Desktop MacOS version 4.0.8.

Thanks for your help!

clem · ‎01-15-2021

Did you try adding an index to the location that I asked about above? I believe this will cut down a lot on the number of calculations made.

aross · ‎01-15-2021

Yes, I had added an index with:

CREATE INDEX index_user_location FOR (n:GraphUser) ON (n.location)

Do I need to create it with different options?

clem · ‎01-16-2021

I haven't done anything with geospatial before but... with my experience and book knowledge, I can ask sanity questions to help diagnose the problem...

In the very first PROFILE you show in this thread, the distance query resulted in 6M DB hits. How many total GraphUser nodes do you have in your system? If it's 6M, then the original query probably had to do a linear search (painfully slow!) over the entire GraphUser nodes.

In theory, a geospatial index should help reduce the search space as the query should be able to quickly exclude points outside of the bounding box using the index. (It would do a gathering of nodes that fell within the bounding box of lat, long ranges corresponding to a distance of 100000 ). I'm not sure if you had the geospatial index on location with the first attempt or not.

Are you using PROFILE or EXPLAIN? I think PROFILE might be better as it tells you actually what occurred.

One thing to try, is do the query with and without the index on the location. If it doesn't improve, then it's geospatial indexing is not behaving the way that I expect and you'll have to ask someone else... (You could duplicate the DB and then drop the geospatial index in the duplicated DB and rerun the query.)

I am somewhat puzzled by the PROFILE you did based on my subquery suggestion. The subquery 91,872 DB hits but the following Geospatial query has almost exactly twice as many hits: 183k. This suggests to me that geospatial search on the results of the subquery has to do a linear search twice (once for lat and once for long.). I guess this makes sense, but it's why the query is still slow.

The other thing, are there any other properties that you can narrow down the search on before passing it onto the geospatial search?

Lastly, what if you make the distance to something smaller? (e.g. 1000 instead of 100000?) If the indexing is working, then I believe the PROFILE might show far fewer DB hits. If it doesn't, then something isn't working the way that I expect.

I need to think about this some more...

dominicvivek06 · ‎01-16-2021

Hi @aross , what is your SLA on running this query ? How much time it took before ? What is the performance baseline ? Is your objective to decrease the db hits ? what is the other rdbms you are comparing it to ? After clearing the cache, how much time did you re-run it again ?

Remember having an index doesn't solve everything (it will have performance issues when writing the data). According to RDBMS theory if the optimizer has to scan all the data, it will swift from Index scan to full table index scan after 25-35% scanning the statistics, because it doesn't want to scan through all the index leaf nodes.

you mentioned RDBMS, which RDBMS and what is the hardware spec ? what is your macOS spec ? is this part of an enterprise production database farm, or it is just your local PoC.
are you migrating the data from the RDBMS for any reasons such as cost , licensing etc .

Can you verify if you are running enterprise or community version ?

Apologies to ask so many questions. I just want to understand the objective both technical and non-technical .

aross · ‎01-18-2021

Hi @dominicvivek06 , thank you for your help.

I'll take a step back and explain my motivation. I have a social app currently live in production that is using postgres as it's primary datastore (AWS/RDS). Data is spread across other databases, but currently the tables storing social data are in postgres. This decision was made for familiarity and to speed up development, and also because some business decisions in the beginning did not obviously point to graph as the best option.

We have a relatively small userbase, but due to the way social data grows, we already have some large tables (20M+ rows) that we are doing multiple joins on for some queries. While current infrastructure is fine at this level, this data will only get exponentially bigger, and we are hoping, of course, for many more users when we enter our growth phase later in the year.

So, I am exploring moving the social graph out of postgres and into an actual graph database for long-term growth. Exporting/importing these rows and relationships was quite easy and certain queries already make so much more sense in graph (dealing with established graph relationships).

But I am somewhat confused by the query I've been asking about on this thread: a simple geospatial query that is performing much slower than something similar in postgres. And this is just the first step in a much larger query that will look for both nodes with a relationship to a given user and union those with other nodes without a specific relationship (which will bring cartesian products into this query). If i can't get it to properly use the right index for this one part, I'm feeling fairly hopeless about getting the rest of it right.

Right now, I'm just testing and profiling these queries on my local Mac (however, it is a very good machine, and postgres runs very fast on it with our full production database). I am just running Neo4j Desktop but can try the enterprise or community version if you think it would be useful (the Desktop version has been convenient).

Anyway, that's the high-level view. I'd love for neo4j to work because it does make sense for this collection of data.

clem · ‎01-15-2021

There's always something new to learn in Neo4J!

I'm guessing you did not create an index on the geolocation. (I didn't know you could do this until just now!). The bounding box search is done internally with an index, when there is an index.

(I haven't tried this myself, so I hope it will work for you.)

To use it, see:

For the gory details, see:

dominicvivek06 · ‎01-15-2021

Hi @clem , there is indeed an spatial index.

Link to Neo4j 4.2 documentation -

" If there is a index on a particular :Label(property) combination, and a spatial point is assigned to that property on a node with that label, the node will be indexed in a spatial index. For spatial indexing, Neo4j uses space filling curves in 2D or 3D over an underlying generalized B+Tree. "

this index helps in Polygon In-point queries.

As of as Neo4j 3.x,4.x, it supports only Point. there are other spatial like Linestring, Polygon etc

Neo4j

Why is this geospatial search so slow?