Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-22-2019 01:04 AM
I am working with a large graph of Wikipedia topics having more than 16 million nodes and 400 million edges. Trying to find important topics from the set of some topics. For that, I want to use PageRank on a subgraph of the Wikipedia graph. Using PageRank by using match cypher query is taking a lot of time > 5 mins. How can it be optimized? see the first comment below
Is there a way by which I can get subgraph of the whole graph. like If I have titles of the topic, I want to get all nodes and relationship among matched nodes.
03-22-2019 01:19 AM
Currently for pagerank for a subgraph having topics Sport_utility_vehicle, Sports_car and Luxury_vehicle
I am using the below query, which is very slow.
CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title = "Sport_utility_vehicle" or t.title = "Sports_car" or t.title = "Luxury_vehicle" RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score
03-23-2019 03:15 AM
Have you tried using an index on the "title" property?
CREATE INDEX ON :Topic(title)
You can check if the index is ready or still populating by running the following cypher command ":schema"
03-24-2019 04:12 AM
Yes, I've created an index on the "title" company. But still the pagerank takes a lot of time. IDK why
03-24-2019 09:29 AM
How many node and relationships are you working with? Is the data all cached in ram? How big is the machine hosting the db?
First thing I would try to understand is if the slow bit is the data load, or the actual execution of the page rank, or the sorting done at the end.
Can you check the neo4j.log and debug.log files and share here the relevant section?
03-24-2019 09:46 AM
Hi, My whole graph has 16 million nodes and 500 million relationships. But for Pagerank I need to query on subgraph which will consist of less than 100 nodes and 10000 relationships. The machine hosting the db has 120GB of RAM. Data loading is fast, The problem mainly lies on the execution of pagerank or probably I am not able to get the subgraph efficiently. Do you know any ways to get subgraph of a few nodes efficiently.
03-24-2019 11:04 AM
Ok, that makes things clearer.
I think you could speed up by using the index on the property title
in the second query.
CALL algo.pageRank.stream(
'MATCH (t:TOPIC) WHERE t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t) as id',
'MATCH (t1:TOPIC)-[:hasLink]-(t2:TOPIC) WHERE t1.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"] RETURN id(t1) as source, id(t2) as target',
{graph:'cypher'}
) YIELD nodeId, score with algo.getNodeById(nodeId) as node, score order by score desc limit 10
RETURN node, score
Two more suggestions:
$NEO_HOME/logs/neo4j.log
, it provides timings of the various phases of the execution (loading, computing, writing)EXPLAIN
. It will show if indexes will be actually used during the query execution.03-24-2019 11:21 AM
I did the same hack to fasten up the query, Is there some other way so that I don't have to put the same conditions on edges(i.e they belong to set of nodes). I'll have a look into your suggestions tomorrow and get back to you
03-25-2019 03:13 AM
Do you know how can this query be modified to not only include the nodes with (t.title in ["Sport_utility_vehicle", "Sports_car", "Luxury_vehicle"])
but also their adjacent nodes. i.e 1 hop away
All the sessions of the conference are now available online