Neo4j

Fano · ‎03-04-2021

Dear all,
I want to merge some data from csv file into neo4j(v3.5.9). Because the label is defined in csv dynamically, the apoc is used to achieve it. After import the entities, then I import the relationships as below:

CALL apoc.load.csv($csvFile, {skip:0, header:true}) yield map AS relRow
		WITH relRow, relRow['type'] as type,relRow['from_entity_type'] as from_type,relRow['to_entity_type'] as to_type
		CALL apoc.cypher.doIt('MATCH (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]}),(target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]}) 
		WITH source,target 
		Merge (source)-[r:' + type + ']->(target) 
		set r=$row,r.type=null,r.from_entity_type=null,r.to_entity_type=null,r.from_entity_id=null,r.to_entity_id=null,
		r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
		r.is_effective=(case when r.is_effective=\"t\" then true else false end)', {row:relRow}) yield value AS v return v

The problem is that it works pretty slow somtimes(more than 2 hours), even for CSV files with less than 80000 entries. However, it works very fast at the second time(less than 1min) with the same csv file and same scripts. Of course, index has been created for those properties of all entities.
Is there any problem in my scripts? It really bothers me for a long time. Any suggestions will be appreciated.

markhneedham · ‎03-04-2021

Can you do a quick test to check that the index is being used.?

PROFILE
MATCH (source:WhateverTheActualTypeIs {entity_id:1 ,job_id:2 })
RETURN source

And then if you could upload the query plan we should be able to tell.

Cheers, Mark

Fano · ‎03-07-2021

Thanks. I profiled the script as you said, the details of plan is as below,

As we can see, the operator nodeIndexSeek is used.

markhneedham · ‎03-08-2021

Hmm I'd say that the slowness might be because of it doing so many calls to apoc.cypher.doIt but I'm not sure. Do you have a dummy CSV file that you can share so I can play around with the query?

Fano · ‎03-09-2021

Because I cannot export data from the production environment, I created serveral csv files, including two files for entity, and one for relationship, with only few records.In fact, e_company.csv contains less than 100k, e_email.csv contains less than 1k and r_use.csv contains less than 80k records. The dummy files are listed as below,

r_use.txt (220 Bytes) e_company.txt (260 Bytes) e_email.txt (339 Bytes)

please import the entity first,then we can check the script as I mentioned for relationship import. It is a remarkable fact that the problem is difficult to be reproduced with same csv files and script in the production environment.

In addition, there are many calls to apoc.cypher.doIt in other place, but it's OK. It really made me confused.

Fano · ‎03-09-2021

I created index for specified properties, then all of data will be deleted after a job done, will the index will be affected at the next job in some way? I felt very doubtful about this although the execution plan shows the index works. Because I found the script always worked well in the neo4j instance with some data.

markhneedham · ‎03-09-2021

(I'm still playing around with it but...)

When you run it and there is data there already the MERGE statement wouldn't actually be creating a relationship because it already exists. Or did you mean that it even creates stuff that doesn't exist faster as well?

Fano · ‎03-09-2021

I will delete all the data first, then run it at the second time. And I found it works well.

Fano · ‎03-09-2021

the existed data is not related to the current job, it can be regarded as history data.

markhneedham · ‎03-09-2021

I've generated a file to play around with it. This file is gzipped (rels.csv.gz) but had to rename it so that I'm allowed to upload it:

load csv with headers from "file:///rels.csv" AS relRow
with relRow
WITH relRow, 
     relRow['relation_type'] as type, 
     relRow['from_entity_type'] as from_type,
     relRow['to_entity_type'] as to_type
CALL apoc.cypher.doIt(
  'MERGE (source:' + from_type + '{entity_id:row[\"from_entity_id\"],job_id:row[\"job_id\"]})
   MERGE (target:' + to_type + '{entity_id:row[\"to_entity_id\"],job_id:row[\"job_id\"]}) 
   WITH source,target 
   MERGE (source)-[r:' + type + ']->(target) 
   set r=$row, 
       r.relation_type=null, r.from_entity_type=null, r.to_entity_type=null, r.from_entity_id=null, r.to_entity_id=null,
       r.is_suspected=(case when r.is_suspected=\"t\" then true else false end),
		   r.is_effective=(case when r.is_effective=\"t\" then true else false end)', 
  {row:relRow}) 
yield value AS v return v

rels.txt (978.0 KB)

And when I import it using Neo4j 4.2.3, it takes about 10 seconds:

Fano · ‎03-09-2021

Yes, the same script in many jobs works well just as you show. However, it took long time sometimes.

markhneedham · ‎03-09-2021

When I ran it on node labels with no index (by mistake) I found it was taking forever, so could it be there are some files with some different unindexed labels? Or maybe the index is not online yet?

You can check for that by running this query:

CALL db.indexes()

So on my database, I see:

And we want the state to be ONLINE

Fano · ‎03-09-2021

Thank you for your advices, I rechecked the index, status of all index is online. I did not check if some files have some different unindex lablels. I will check it if I can touch the production environment. One question, If it is as you said, the script run with same csv at the second time will be slowly. But it's ok actually.
I found the rels.csv in production environment contains about 80k entries and its storage is 40MB。

markhneedham · ‎03-09-2021

You can test whether there are indexes by running this query:

load csv with headers from "file:///rels.csv" AS relRow
WITH relRow['from_entity_type'] as from_type,
     relRow['to_entity_type'] as to_type
UNWIND [from_type, to_type] AS type
WITH type, count(*) AS count
RETURN type, count, apoc.cypher.runFirstColumn(
  'CALL db.indexes() 
   YIELD labelsOrTypes, properties, state
   WHERE $label in labelsOrTypes
   RETURN {labels: labelsOrTypes, properties: properties, state: state}', 
  {label: type}
) AS indexes

Fano · ‎03-09-2021

Ok, I will try it when I can touch the pro duction environment. Thanks for your patience.

markhneedham · ‎03-11-2021

The name won't make a difference. Do you have any rows that aren't mail or company?

Fano · ‎03-11-2021

No entries like you said in the the rels.csv. And I made some rows that aren't mail or company in the rels.csv(by mistake)， it seems to work well. Such as several records like this

r_id,from_entity_id,from_entity_type,to_entity_id,to_entity_type,job_id,relation_type
id1,e1_1,unnamed_email,e2_1,unnamed_company,1,relative_to
id2,e2_2,unnamed_company,e2_2,unnamed_company,1,relative_to

I do not know how it reproduces as you said ("When I ran it on node labels with no index (about 100k entries) I found it was taking forever")

Fano · ‎03-10-2021

I called db.indexes() in my neo4j3.5.9, the labelsOrType will not returned as the api changed. So the script you provided may not be supported in my neo4j instance. I took the details of indexes by calling db.indexes(). The details is what the file show,
indexes.txt (602 Bytes)

What makes me confused is that the value for indexname shows "unnamed index" ? Is it will affect the work of index? The problem now is the same job with same data work well at most of time, but it runs for a very long time sometime.

Neo4j

Merge relationship dynamically with variable label is very slowly sometimes