Neo4j

vnagpure · ‎02-04-2023

I am running following query.

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
match f:sFreq) where f.slfreq=freq.fr
match (d:Date) where d.date=date(freq.dt)
CREATE (fr)-[r:STATISTICS_ON]->(d)
I created index with followng two queries---

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.fr)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.dt)

I have Date and sFreq nodes. I want to create relationships based on matching frequencies and date from csv.
CSV is having huge records (4800000 rows) and query is running for more than 8 hours now.

Is there any way to reduce this time. I have multiple csv files with which I need to create relationships in same database.

Regards
Vaishali

glilienfield · ‎02-04-2023

According to the query, you are searching on properties 'slfreq' and 'date', not 'fr' and 'dt'. The properties 'fr' and 'dt' and the column headings in your csv file. As such, I think you want the flowing indexes:

CREATE INDEX unique_id_index FOR (n:sFreq) ON (n.slfreq)
CREATE INDEX unique_id_index1 FOR (n:Date) ON (n.date)

That may help.

You can see what indexes are being used by looking at the query plan.

https://neo4j.com/docs/cypher-manual/current/query-tuning/query-profile/

vnagpure · ‎02-04-2023

That was typo. I used slfreq and date. As I was trying various options I pasted wrong line here.

Regards

Vaishali

glilienfield · ‎02-04-2023

How is your csv file organized? Can you group the data so that you can one frequency at a time with all its date/

Can you share the query plan?

glilienfield · ‎02-04-2023

Actually, I can write the query to group the dates by frequency, so you don't have to search for the same frequency many times. that may help speed things up.

vnagpure · ‎02-04-2023

Can you explain how can I write this? Grouping? I have same frequencies on multiple dates.

Regards

Vaishali

glilienfield · ‎02-05-2023

try this. I collected the dates by frequency in order to do the grouping. Let's see what happens to the memory requirements with the many rows you have:

LOAD CSV WITH HEADERS FROM 'file:///records.csv' AS freq
with freq.fr as freq_fr, collect(distinct freq.dt) as freq_dates
match (f:sFreq{slfrq:freq_fr})
with f, freq_dates
unwind freq_dates as freq_date
match (d:Date{date:date(freq_date)})
create (f)-[r:STATISTICS_ON]->(d)

It's also possible the grouping adds time to the query, negating any benefit. What the query avoids is matching on the same frequency node more than once.

vnagpure · ‎02-05-2023

I tried below query and its running fast comparatively.

CALL apoc.periodic.iterate(

"LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq RETURN freq",

"WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean

match f:sFreq) where f.slfreq=sfreq

match (d:Date) where d.date=date(fdate)

CREATE (f)-[:STATISTICS_ON{mean:toFloat(fmean)}]->(d)",{batchSize:10000, parallel:true})

Now issue is in csv I have 98169 rows and above query creating around 2225084 relationships. Ideally it should create one relationship for each row. I checked result with query but it is showing same relationship multiple times as below. There should be one record. Can you please let me what is wrong above query?

44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│
├────────────────┼───────────────────┼─────────────────────┤
│44.475          │-108.16310501098631│{"date":"2017-01-01"}│

glilienfield · ‎02-05-2023

Since you are using ‘create’ fir the relationships, it will create a new one each time. Have you executed this script multiple times in an effort to troubleshoot it? If so, that would explain the duplicates. I agree with you that you should get one new rekationship per row in the csv file.

vnagpure · ‎02-05-2023

No I am running only once. Also I tried using MERGE with parallel as false. But still its creating duplicate relationships.

vnagpure · ‎02-06-2023

I saw one post saying following can avoid creating multiple relationships.

MATCH (n), (m) WHERE ID(n) = {id1} AND ID(m) = {id2}
SET n.locked = true, m.locked = true
MERGE (n)-[r:TEST]->(m)
SET n.locked = Null, m.locked = Null

What will be my query like this?

SET (f:sfreq).locked = true, (d:Date).locked = true --- gives error

glilienfield · ‎02-06-2023

That will not help, as you are not setting properties on nodes n nor m. The syntax on the 'SET' clause is not correct. You don't include the label in the set operations, just the variable and the property.

Are you sure you don't have multiple rows with the same frequency and date combinations? You can change the 'create' to 'merge' so the relationships is not created twice. Just a note merge is not multi-thread safe, so running the batches in parallel could result in multiple relationships if you have multiple combinations of frequency and date.

glilienfield · ‎02-06-2023

Also, using parallel:true when creating relationships can result in locking conditions, that may negatively impact performance. True with parallel:false as well.

glilienfield · ‎02-06-2023

Can you run the following to ensure there are not duplicate rows?

LOAD CSV WITH HEADERS FROM 'file:///records.csv' as freq
WITH freq.freq as sfreq, freq.new_date as fdate, freq.mean as fmean, count(*) as cnt
WHERE cnt>1
RETURN *

vnagpure · ‎02-06-2023

I got this.

vnagpure · ‎02-06-2023

I can see new issue now. I have splitted CSV with small number of records.

mean	freq	occup	new_date
-108.163	44.475	0.813008	1/1/2017
-108.433	44.478	0.813008	1/1/2017
-107.406	44.481	0.813008	1/1/2017
-108.059	44.484	0.813008	1/1/2017
-107.676	44.487	0.813008	1/1/2017
-107.018	44.577	0.813008	1/1/2017
-107.148	44.58	0.813008	1/1/2017
			1/2/2017
-108.664	50.28	0.813008	1/3/2017
-108.369	50.283	0.813008	1/3/2017
-108.856	50.286	0.813008	1/3/2017

Now if I create relationship with following query

LOAD CSV WITH HEADERS FROM 'file:///201701.csv' AS freq

WITH freq.freq AS ffreq, freq.new_date AS fdate

MATCH (fr:sFreq {sfreq :toFloat(ffreq)})

MATCH (d:Date {date :date(fdate)})

MERGE (fr)-[:STATISTICS_ON]->(d)

For date 2017-01-01 it is creating total of 203 relationships. Actually it should create only 7 relationships with same date node.

Can you please let me know what is wrong now?

Regards

Vaishali

vnagpure · ‎02-06-2023

I think I got the error. I have duplicate date nodes which is the cause of the problem.

Thanks for all your help. Will get back if any new issue arise.

Regards

Vaishali

Neo4j

Query taking a very long time to create relationships