Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
03-04-2021 01:36 AM
Dear all,
I want to merge some data from csv file into neo4j(v3.5.9). Because the label is defined in csv dynamically, the apoc is used to achieve it. After import the entities, then I import the relationships as below:
CALL apoc.load.csv($csvFile, {skip:0, header:true}) yield map AS relRow
WITH relRow, relRow['type'] as type,relRow['from_entity_type'] as from_type,relRow['to_entity_type'] as to_type
CALL apoc.cypher.doIt('MATCH (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]}),(target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]})
WITH source,target
Merge (source)-[r:' + type + ']->(target)
set r=$row,r.type=null,r.from_entity_type=null,r.to_entity_type=null,r.from_entity_id=null,r.to_entity_id=null,
r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
r.is_effective=(case when r.is_effective=\"t\" then true else false end)', {row:relRow}) yield value AS v return v
The problem is that it works pretty slow somtimes(more than 2 hours), even for CSV files with less than 80000 entries. However, it works very fast at the second time(less than 1min) with the same csv file and same scripts. Of course, index has been created for those properties of all entities.
Is there any problem in my scripts? It really bothers me for a long time. Any suggestions will be appreciated.
03-04-2021 01:53 AM
Can you do a quick test to check that the index is being used.?
PROFILE
MATCH (source:WhateverTheActualTypeIs {entity_id:1 ,job_id:2 })
RETURN source
And then if you could upload the query plan we should be able to tell.
Cheers, Mark
03-07-2021 05:10 PM
Thanks. I profiled the script as you said, the details of plan is as below,
As we can see, the operator nodeIndexSeek is used.
03-08-2021 01:45 PM
Hmm I'd say that the slowness might be because of it doing so many calls to apoc.cypher.doIt
but I'm not sure. Do you have a dummy CSV file that you can share so I can play around with the query?
03-09-2021 12:15 AM
Because I cannot export data from the production environment, I created serveral csv files, including two files for entity, and one for relationship, with only few records.In fact, e_company.csv contains less than 100k, e_email.csv contains less than 1k and r_use.csv contains less than 80k records. The dummy files are listed as below,
r_use.txt (220 Bytes) e_company.txt (260 Bytes) e_email.txt (339 Bytes)
please import the entity first,then we can check the script as I mentioned for relationship import. It is a remarkable fact that the problem is difficult to be reproduced with same csv files and script in the production environment.
In addition, there are many calls to apoc.cypher.doIt in other place, but it's OK. It really made me confused.
03-09-2021 02:42 AM
I created index for specified properties, then all of data will be deleted after a job done, will the index will be affected at the next job in some way? I felt very doubtful about this although the execution plan shows the index works. Because I found the script always worked well in the neo4j instance with some data.
03-09-2021 02:47 AM
(I'm still playing around with it but...)
When you run it and there is data there already the MERGE
statement wouldn't actually be creating a relationship because it already exists. Or did you mean that it even creates stuff that doesn't exist faster as well?
03-09-2021 02:54 AM
I will delete all the data first, then run it at the second time. And I found it works well.
03-09-2021 02:59 AM
the existed data is not related to the current job, it can be regarded as history data.
03-09-2021 02:57 AM
I've generated a file to play around with it. This file is gzipped (rels.csv.gz
) but had to rename it so that I'm allowed to upload it:
load csv with headers from "file:///rels.csv" AS relRow
with relRow
WITH relRow,
relRow['relation_type'] as type,
relRow['from_entity_type'] as from_type,
relRow['to_entity_type'] as to_type
CALL apoc.cypher.doIt(
'MERGE (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]})
MERGE (target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]})
WITH source,target
MERGE (source)-[r:' + type + ']->(target)
set r=$row,
r.relation_type=null, r.from_entity_type=null, r.to_entity_type=null, r.from_entity_id=null, r.to_entity_id=null,
r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
r.is_effective=(case when r.is_effective=\"t\" then true else false end)',
{row:relRow})
yield value AS v return v
rels.txt (978.0 KB)
And when I import it using Neo4j 4.2.3, it takes about 10 seconds:
03-09-2021 03:02 AM
Yes, the same script in many jobs works well just as you show. However, it took long time sometimes.
03-09-2021 03:04 AM
When I ran it on node labels with no index (by mistake) I found it was taking forever, so could it be there are some files with some different unindexed labels? Or maybe the index is not online yet?
You can check for that by running this query:
CALL db.indexes()
So on my database, I see:
And we want the state to be ONLINE
03-09-2021 03:14 AM
Thank you for your advices, I rechecked the index, status of all index is online. I did not check if some files have some different unindex lablels. I will check it if I can touch the production environment. One question, If it is as you said, the script run with same csv at the second time will be slowly. But it's ok actually.
I found the rels.csv in production environment contains about 80k entries and its storage is 40MB。
03-09-2021 03:25 AM
You can test whether there are indexes by running this query:
load csv with headers from "file:///rels.csv" AS relRow
WITH relRow['from_entity_type'] as from_type,
relRow['to_entity_type'] as to_type
UNWIND [from_type, to_type] AS type
WITH type, count(*) AS count
RETURN type, count, apoc.cypher.runFirstColumn(
'CALL db.indexes()
YIELD labelsOrTypes, properties, state
WHERE $label in labelsOrTypes
RETURN {labels: labelsOrTypes, properties: properties, state: state}',
{label: type}
) AS indexes
03-09-2021 03:43 AM
Ok, I will try it when I can touch the pro duction environment. Thanks for your patience.
03-11-2021 02:04 AM
The name won't make a difference. Do you have any rows that aren't mail
or company
?
03-11-2021 03:04 AM
No entries like you said in the the rels.csv. And I made some rows that aren't mail or company in the rels.csv(by mistake), it seems to work well. Such as several records like this
r_id,from_entity_id,from_entity_type,to_entity_id,to_entity_type,job_id,relation_type
id1,e1_1,unnamed_email,e2_1,unnamed_company,1,relative_to
id2,e2_2,unnamed_company,e2_2,unnamed_company,1,relative_to
I do not know how it reproduces as you said ("When I ran it on node labels with no index (about 100k entries) I found it was taking forever")
03-10-2021 10:32 PM
I called db.indexes() in my neo4j3.5.9, the labelsOrType will not returned as the api changed. So the script you provided may not be supported in my neo4j instance. I took the details of indexes by calling db.indexes(). The details is what the file show,
indexes.txt (602 Bytes)
What makes me confused is that the value for indexname shows "unnamed index" ? Is it will affect the work of index? The problem now is the same job with same data work well at most of time, but it runs for a very long time sometime.
All the sessions of the conference are now available online