cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to handle this path counting query that returns a lot of rows?

Rogie
Node Link

I have a schema representing users that make comments on websites:

(:Website)<-[:POSTED_IN]-(:Comment)<-[:POSTED]-(:User)

I would like to count, for each pair of websites, how many distinct users commented on both sites. I tried this:

Match (w1:Website)<-[:POSTED_IN]-(:Comment)<-[:POSTED]-(u:User)-[:POSTED]->(:Comment)-[:POSTED_IN]->(w2:Website)
RETURN w1.name, w2.name, Count(Distinct u) as weight

There are around 6000 Website nodes in my db, so around 36 million rows need to be returned, and this seems to cause the db to crash.

One workaround I tried was to do a single query for each Website in the db:

for w in websites:
    q = '''Match (w1:Website {name:$w})<-[:POSTED_IN]-(:Comment)<-[:POSTED]-(u:User)-[:POSTED]->(:Comment)-[:POSTED_IN]->(w2:Website)
    RETURN w1.name, w2.name, Count(Distinct u) as weight'''
    tx.run(q, w=w)

This gets the job done, (I think), but is till very slow.

Is there a better way to do this?

2 REPLIES 2

shan
Graph Buddy

The first issue with your query is that for every user who commented on two websites, you get repetitive rows like this:

user1, website1, website2
user1, website2, website1

you can fix that by adding a WHERE condition like this:

WHERE id(w1)>id(w2)

About how many comments does a user typically make per website? Unless the typical user only leaves a single comment per website, you may need to remodel this and your query for better performance.

If a user posts 20 comments on a website, in your query this will multiply out 20 times however many comments they posted on each other site, in terms of paths explored and rows generated that need to be processed with DISTINCT.

If you created and maintained these distinct paths: (:Website)<-[:POSTED_AT]-(:User), such that only one such relationship can exist between a user and a website, that should prevent those multiplicative path explorations and may increase your query efficiency.