cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Query taking a very long time to create relationships

vnagpure
Node Link

I am running following query.

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
match f:sFreq) where f.slfreq=freq.fr
match (d:Date) where d.date=date(freq.dt)
CREATE (fr)-[r:STATISTICS_ON]->(d)
I created index with followng two queries---

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.fr)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.dt)

I have Date and sFreq nodes. I want to create relationships based on matching frequencies and date from csv.
CSV is having huge records (4800000 rows) and query is running for more than 8 hours now.

Is there any way to reduce this time. I have multiple csv files with which I need to create relationships in same database.

Regards
Vaishali

16 REPLIES 16

According to the query, you are searching on properties 'slfreq' and 'date', not 'fr' and 'dt'. The properties 'fr' and 'dt' and the column headings in your csv file. As such, I think you want the flowing indexes:

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.slfreq)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.date)

That may help.

You can see what indexes are being used by looking at the query plan. 

https://neo4j.com/docs/cypher-manual/current/query-tuning/query-profile/

vnagpure
Node Link

That was typo. I used slfreq and date. As I was trying various options I pasted wrong line here. 

Regards

Vaishali

How is your csv file organized? Can you group the data so that you can one frequency at a time with all its date/

Can you share the query plan?  

Actually, I can write the query to group the dates by frequency, so you don't have to search for the same frequency many times. that may help speed things up. 

Can you explain how can I write this? Grouping? I have same frequencies on multiple dates.

Regards

Vaishali

try this. I collected the dates by frequency in order to do the grouping.  Let's see what happens to the memory requirements with the many rows you have:

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
with freq.fr as freq_fr, collect(distinct freq.dt) as freq_dates
match (f:sFreq{slfrq:freq_fr})
with f, freq_dates
unwind freq_dates as freq_date
match (d:Date{date:date(freq_date)})
create (f)-[r:STATISTICS_ON]->(d)

It's also possible the grouping adds time to the query, negating any benefit.  What the query avoids is matching on the same frequency node more than once. 

I tried below query and its running fast comparatively.
CALL apoc.periodic.iterate(
"LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq RETURN freq",
"WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean
match f:sFreq) where f.slfreq=sfreq
match (d:Date) where d.date=date(fdate)
CREATE (f)-[:STATISTICS_ON{mean:toFloat(fmean)}]->(d)",{batchSize:10000, parallel:true})
 
Now issue is in csv I have 98169 rows and above query creating around 2225084 relationships. Ideally it should create one relationship for each row. I checked result with query but it is showing same relationship multiple times as below. There should be one record. Can you please let me what is wrong above query?
44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│

Since you are using ‘create’ fir the relationships, it will create a new one each time. Have you executed this script multiple times in an effort to troubleshoot it? If so, that would explain the duplicates. I agree with you that you should get one new rekationship per row in the csv file. 

No I am running only once. Also I tried using MERGE with parallel as false. But still its creating duplicate relationships.

 

I saw one post saying following can avoid creating multiple relationships.

MATCH (n), (m) WHERE ID(n) = {id1} AND ID(m) = {id2}
SET n.locked = true, m.locked = true
MERGE (n)-[r:TEST]->(m)
SET n.locked = Null, m.locked = Null

What will be my query like this?

SET  (f:sfreq).locked = true,  (d:Date).locked = true --- gives error

That will not help, as you are not setting properties on nodes n nor m. The syntax on the 'SET' clause is not correct. You don't include the label in the set operations, just the variable and the property.

Are you sure you don't have multiple rows with the same frequency and date combinations? You can change the 'create' to 'merge' so the relationships is not created twice. Just a note merge is not multi-thread safe, so running the batches in parallel could result in multiple relationships if you have multiple combinations of frequency and date. 

Also, using parallel:true when creating relationships can result in locking conditions, that may negatively impact performance. True with parallel:false as well.

Can you run the following to ensure there are not duplicate rows?

LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq
WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean, count(*) as cnt
WHERE cnt>1
RETURN *

vnagpure_0-1675699799595.png

I got this. 

I can see new issue now. I have splitted CSV with small number of records.

mean freq occup new_date
-108.163 44.475 0.813008 1/1/2017
-108.433 44.478 0.813008 1/1/2017
-107.406 44.481 0.813008 1/1/2017
-108.059 44.484 0.813008 1/1/2017
-107.676 44.487 0.813008 1/1/2017
-107.018 44.577 0.813008 1/1/2017
-107.148 44.58 0.813008 1/1/2017
      1/2/2017
-108.664 50.28 0.813008 1/3/2017
-108.369 50.283 0.813008 1/3/2017
-108.856 50.286 0.813008 1/3/2017

Now if I create relationship with following query

LOAD CSV WITH HEADERS FROM 'file:///201701.csv' AS freq
WITH freq.freq AS ffreq, freq.new_date AS fdate
MATCH (fr:sFreq {sfreq :toFloat(ffreq)})
MATCH (d:Date {date :date(fdate)})
MERGE (fr)-[:STATISTICS_ON]->(d)

For date 2017-01-01 it is creating total of 203 relationships. Actually it should create only 7 relationships with same date node.

Can you please let me know what is wrong now?

Regards

Vaishali

I think I got the error. I have duplicate date nodes which is the cause of the problem.

Thanks for all your help. Will get back if any new issue arise.

Regards

Vaishali