Qyery takes too much and fail after 2 hours of compiling
‎12-29-2021 10:17 AM
so i'm running the following query on 500k nodes:
load csv with headers from 'file:///df_keyword_urls.csv' as row
with collect({keyword:row.keyword , urls:row.urls}) as ROW
unwind ROW as k1
unwind ROW as k2
with k1.keyword as key1,apoc.convert.fromJsonList(k1.urls) as k1urls ,k2.keyword as key2 ,apoc.convert.fromJsonList(k2.urls) as k2urls
where not k1=k2
with key1,key2 , apoc.coll.intersection(k1urls,k2urls)as urls
return key1 as keyword1, key2 as keyword2, urls as urls_intersections
and even with my 64 gigabytes of ram i still can't complete the tast after completing 3 hours of compiling , any help please to optimize this query i'm still new in cypher i'de love to have instructions to follow or where to look what to learn and what to read to optimize this query .
REGARDS !
- Labels:
-
Cypher
‎12-29-2021 02:07 PM
You can add USING PERIODIC COMMIT to the first line like this.
Adding this will speed up the process.
I think it will work if you have about 2G of heap memory. Not 64G.
Neo4j Browser
:auto USING PERIODIC COMMIT
load csv with headers from 'file:///df_keyword_urls.csv' as row
Cypher Shell
USING PERIODIC COMMIT
load csv with headers from 'file:///df_keyword_urls.csv' as row
‎12-30-2021 03:23 AM
:auto USING PERIODIC COMMIT
doesn't work unless we modify in our graph model , i tried it and it didn't work .
‎01-03-2022 01:00 PM
I don't think this is a good fit for a query.
Performing collect() here means you are materializing all rows of your CSV into heap memory at once. This means you cannot use periodic commit here, or rather that it won't do you any good because of how you're handling the entirety of your data all at once.
You're also doing an UNWIND twice, meaning that you are getting the cross product of every row of your CSV against every row, and this is also happening all at once in your heap memory.
A better, less memory-intenstive approach, would be to process the CSV and save data to your graph (using something like :Keyword and :URL nodes, and ensuring you have indexes (or unique constraints as needed) on both). Avoid doing any collect() or UNWIND operations here (except maybe for UNWIND of your url lists so you can MERGE on the :URL nodes).
That may look something like this (after you've created your indexes or unique constraints):
:auto using periodic commit
load csv with headers from 'file:///df_keyword_urls.csv' as row
MERGE (k:Keyword {keyword:row.keyword})
UNWIND apoc.convert.fromJsonList(row.urls) as url
MERGE (url:URL {url:url})
MERGE (k)-[:HAS_URL]->(url) // or use CREATE if no keyword is repeated, and if urls are unique per keyword
Once your data is saved into the graph, then you can perform a separate query to use graph traversal find related keywords and intersected urls, using subqueries to scope the work needed per keyword:
MATCH (k1:Keyword)
CALL {
WITH k1
MATCH (k1)-[:HAS_URL]->(url)<-[:HAS_URL]-(k2:Keyword)
WHERE id(k1) < id(k2) // this prevents symmetric results
WITH k1, k2, collect(url) as urls
RETURN k1.keyword as key1, k2.keyword as key2, [url IN urls | url.url] as urls_intersection
}
RETURN key1, key2, urls_intersection
‎01-04-2022 09:53 AM
match (k:Keyword)-[a:APPEARS_IN]-(u:Url) // MATCH nodes
with collect(distinct u.id) as urls, k.id as k
with collect({keyword:k , urls:urls}) as all_
unwind all_ as k1
unwind all_ as k2
with k1,k2
where not k1=k2
with k1,k2, apoc.coll.intersection(k1.urls,k2.urls) as intersection
with k1,k2, apoc.coll.combinations(intersection,3,size(intersection)) as url_combinations
unwind url_combinations as comb
with comb, size(comb) as size_urls,apoc.coll.toSet(collect(k1.keyword)+collect(k2.keyword)) as keywords
with comb,size_urls,keywords,size(keywords) as size_keywords
return comb as urls,size_urls,size_keywords,keywords
order by size_keywords desc
First , Thank you so much for giving me your time and answering me i really appreciate it and thank you so much for your explanation .
so let me explain .
i'm here trying to get all the keywords Urls .
after that i get the intersections between all the Urls , and that to get where all the pairs of keywords share one or more intersection .
then i make all combinations of these intersection , because , if k1 and K2 intersect in 5 urls per example , they also intersect in less than 4,3 and 2 , and in different combinations of these cummon urls .
then i group by those different combinations to make keywords clusters .
i know my code is slow , i know i'm not using the graph properties which i should , i should use the relationships and not hard code it this way . and that's why i brake it into smaller chunks , in which i generate just the intersections and after that the combinations , but still soo slow .
‎01-04-2022 05:48 PM
The problem is that your query isn't breaking down the work into smaller chunks. Your collect() operations require materializing all of that in memory at once, and the UNWINDs after don't help.
Too much is going on in memory, and the graphy approach is going to be more efficient both in performance and memory usage.
In my previous suggestion, the MATCH I performed is effectively finding intersections one keyword at a time, finding all other keywords that have relationships with the same urls, and the subqueries used actually break down the process work per keyword.