Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
05-20-2020 03:15 PM
I'm having a hard time optimizing a query to get co-occurrence data from my graph. The data in my graph represents projects and companies working on them. I'm trying to get data on how often each company worked with each of the other companies on a project located in the same address. Here's the part of my schema that is relevant to this query:
And here's my query:
MATCH (company:Company)-[:WORKED_ON]->(project:Project)-[:LOCATED_IN]->(address:Address)
WITH DISTINCT company, address
MATCH (other_company:Company)-[:WORKED_ON]->(other_project:Project)-[:LOCATED_IN]->(address)
WHERE company <> other_company
WITH company, other_company, COUNT(DISTINCT address) AS frequency ORDER BY frequency DESC
RETURN
company.uuid,
COLLECT(other_company.uuid + ': ' + frequency) AS co_occurrence
This query should return all companies in the graph and for each return an array of other companies that worked on a project in the same address, along with a frequency of how often that occurred for that pair of companies.
Here's an example output when I run the query on a smaller subset of companies:
company.uuid, co_occurrence
5CB009264D78E758233282, [5CAFDF9568280959694542: 12,5CB00653EFB7C920999687: 3,5CAFE3F0651CD240072545: 1]
My graph size is ~25GB. I tried running the query with 30GB cache and 30GB heap allocated to Neo4j. I'm running Neo4j 3.5.1.
My graph has 1.7M company nodes, 4.9M project nodes, and 3.5M address nodes. There's about 11M WORKED_ON relationships between companies and projects. Here's the result of running EXPLAIN on my query:
I can get the query to return results within a couple of minutes when I limit the number of companies to 10k. Once I go to 100k I can't get it to return results. It runs forever and eventually brings Neo4j down.
Any suggestions for how I can optimize this query?
05-21-2020 05:58 AM
Hello @hisham,
Just to know, are you using UNIQUE CONSTRAINTS (https://neo4j.com/docs/cypher-manual/current/administration/constraints/) on your nodes?
Regards,
Cobra
05-21-2020 01:44 PM
Hi @Cobra,
Thank you for your question. Yes, for the nodes involved in this query, I have one UNIQUE constraint on the Address node:
ON ( address:Address ) ASSERT address.full_address IS UNIQUE
I also have the following indexes:
ON :Address(street_address_lower) ONLINE
ON :Address(uuid) ONLINE
ON :Company(cohort_id) ONLINE
ON :Company(name_lower) ONLINE
ON :Company(publish_score) ONLINE
ON :Company(uuid) ONLINE
ON :Project(uuid) ONLINE
ON NODE:Company(name) ONLINE
ON :Address(full_address) ONLINE (for uniqueness constraint)
Thanks,
Hisham
05-21-2020 02:31 PM
Only on Address nodes or are there also constraints on Project and Company?
Regards,
Cobra
05-21-2020 03:16 PM
Only on Address node. No constraints on Project or Company nodes.
05-21-2020 03:20 PM
Try to add unique constraints on them so and try again:)
05-21-2020 03:23 PM
Does it matter which property to add the UNIQUE constraint for? How would that help?
05-21-2020 03:35 PM
I saw there were uuids on Company and Project nodes so you can use them to create the unique constraints:)
Generally, it's a good practice to create a unique constraint on a property that will be different for each node like a product id for example or a company id:)
The unique constraint on id will help Neo4j to speed up the query but you can still add unique constraint on another property if you want them unique:)
05-22-2020 11:05 AM
Thanks, @Cobra. I added the UNIQUE constraint for the 3 nodes. It seemed to speed things up slightly when I limit the number of companies. But I still can't the query to run on the entire graph.
05-22-2020 11:21 AM
That's a good news, I have another idea:)
Pur the second MATCH clause in a apoc.cypher.run()
procedure
Doc: https://neo4j.com/docs/labs/apoc/current/cypher-execution/
Regards,
Cobra
11-02-2020 07:35 PM
I'm having similar issues with co-occurrence link creation. I am trying to create bi-partite links between any two authors that have the same publishers. When the number of items are small, the query finishes in a matter of minutes. But as you increase the numbers, the query almost never finishes. My PC already has 64gb of ram (which is only 80% used) and core i7 10th gen (16 virtual cores that is only 30% used). All drives are SSDs with 1 to 3% usage.
Is there a better way to code this query below:
Match (n:Author)-[:HasPublisher]->(p:Publisher)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND LEFT(p.Name,1)='A'
WITH n,p as pubs
UNWIND pubs as publish
// find other authors published by same publisher
Match (publish)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[:SamePublisher{Name:publish.Name}]->(a)
Notice that i already deliberately limited the returned items by filtering for only publishers with names starting with "A". and still the query never finishes!
After 8 hours of running:
11-03-2020 12:37 AM
Im now wrapping it inside a call apoc.periodic.iterate
CALL apoc.periodic.iterate(
“cypher runtime=slotted
MATCH (n:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(a:Author)
WHERE EXISTS(n.Last) AND EXISTS(n.First) AND EXISTS(a.Last) AND EXISTS(a.First) AND n.Orcid<>’NONE’ AND a.Orcid<>’NONE’ AND n.Orcid<>a.Orcid AND n.First <>a.First AND n.Last <> a.Last
Merge (n)-[r:SamePublisher]->(a)”,
“ON create set r.Name=p.Name”,
{batchsize:500, parallel:true, iterateList:true}
)
Still running as we speak.
11-03-2020 05:20 AM
Hello @wilson.chua
I assume that you are using the latest version of Neo4j:
MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)
CALL apoc.periodic.iterate('
MATCH (a:Author)-[:HasPublisher]->(p:Publisher)<-[:HasPublisher]-(b:Author)
WHERE EXISTS(a.Last) AND EXISTS(a.First) AND EXISTS(b.Last) AND EXISTS(b.First)
RETURN a, b, p
', '
MERGE (a)-[r:SamePublisher {Name: p.Name}]->(b)
', {batchSize: 1000})
What are the characteristics of your database?
Regards,
Cobra
11-03-2020 02:35 PM
Hi Cobra Maxime, Thanks for looking into this. To answer your question. I am NOT using the latest 1.3.10 (this version does not allow me to save into a different folder location). I am currently using 1.3.4.
The graph database has 1.3m nodes and 43m relationships so far. The Database folder is about 2.36GB
11-04-2020 01:24 AM
11-04-2020 04:29 AM
Hi Cobra Maxime. Neo4j Desktop is version 1.3.4.
I have 64gb ram
I did run your code, it's been over 13 hours now... and still running.
It is still running. The db size has grown from 2.36 gb to now 239 gb.
I see that relationships has grown from 43m to now 458M and growing.
11-05-2020 02:13 AM
As of this writing, the relationships has grown to over 700M and is still growing.
11-05-2020 07:00 PM
It successfully ended at 769M links! Thanks Cobra Maxime. It took almost 48 hours to run.
11-05-2020 02:24 AM
Hi there is a difference between Neo4j Database, Neo4j Desktop and Neo4j Browser. I need to know the configuration of the Neo4j Database (version, RAM, etc.). There is a difference between the computer has 64gb of RAM and the database has 64gb of RAM.
Have you used UNIQUE CONSTRAINT to create your nodes?
Regards,
Cobra
11-13-2020 04:30 PM
Do you have indexes on author first and last name?
Better yet, would be a composite index on author's first+last name:
And you can search using the composite index:
(sorry if this is too obvious...)
All the sessions of the conference are now available online