Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-15-2020 01:00 AM
Dear community, dear Michael,
I have managed to load the nodes with LOAD CSV.
The issue that I am struggeling right now with is creation of the relationships accordingly. Is this a thing which is done quicker with Java due to the internal structure of neo4j?
Thank you very much.
Kindest Regards,
Malte
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
This is how the relationship files do look like:
start_id;end_id;source;status;start_date;end_date;links
605;729;RISE NETWORK;N;N;N;ALEXANDRU SODRINGA partner AQEEL MOHAMED DHUYAB
Loading relationships like this takes ages:
for 3000 created relationships something like 5 days.
indexes are online and on each node type combined node_id and source.
indexes are online and on each node type combined node_id and source.
description | indexName | tokenNames | properties | state | type | progress | provider | id | failureMessage |
---|---|---|---|---|---|---|---|---|---|
"INDEX ON :ADDRESS(node_id, source)" | "index_19" | ["ADDRESS"] | ["node_id", "source"] | "ONLINE" | "node_label_property" | 100.0 | { "version": "1.0", "key": "native-btree" } | 19 | "" |
match(n) return count(n);
"count(n)"
58570409
Thank you very much.
Kind Regards,
Malte
01-15-2020 03:07 AM
Hi @MalteR
Maybe your import takes ages because your query tries to take a too big bite at once. Currently, your limit is set to 1000.
Second thing i noticed, that your MATCH / MERGE clauses don't have labels specified. When you use MATCH (m) - it will read through the whole graph - it quite costly.
01-15-2020 03:14 AM
Hi Paul,
thank you very much for your reply.
Can you give me any advice how to get here less costly, actually with regards to labels, I guess you mean type of node, right? I might be able to specify. So my scenario would most likely work faster if I specified the node types and then execute on the database? What's your background on big data size wise?
Kindest Regards,
Malte
01-15-2020 03:17 AM
First, maybe you can describe a bit your model and data bit deeper.
From the relationship name "OFFICER_OF", I would suggest adding Officer label to (m) nodes.
After reviewing your graph model and adding labels try to run the queries like
PROFILE MATCH (m) RETURN m
versus
PROFILE MATCH (m: Officer) RETURN m
You should see the cost difference of these queries
It should speed up things.
01-15-2020 04:38 AM
@MalteR
I am using graphs to store identity access history, the current graph is at the moment not too big and I am using Community Edition, but I believe by the end of 2020 it will grow to a few million nodes - as I add more data sources.
I had a similar issue while syncing data from different data sources and my graph started growing. After i added additional labels that corresponded to each data source, I managed to speed up read/write to graph drastically. Some data loading queries that before took hours, finish now in a few minutes. Thing is that you have to keep in mind - how big dataset your query is addressing. Thinking of subgraphs helped me to improve my model, it may help you also.
I suggest reading "Cypher Tips & Tricks" post made by @michael.hunger
There is a lot of good information there
01-15-2020 07:55 AM
Hello Paul,
thank you very much for your input. My background why I am asking about big data is because of what some people think big data would be is pretty small datasets like one million entries. For me this starts not from one milion datasets but from 10 milion datasets the least.
Can I do something a constraint on label (node_type) dynamically? Is there any way to do that loaded from file within the match-part of the cypher query? See attached my labels that would have to be constrained by column of a file if possible.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///relationship_officeraa' AS row
FIELDTERMINATOR ';'
MATCH (m),(n)
WHERE m.node_id = trim(row.start_id) and n.node_id = trim(row.end_id) and n.source = trim(row.source) and m.source = trim(row.source)
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
Thank you very much in advance.
Kindest Regards,
Malte
01-15-2020 08:38 AM
Test this. (You need APOC)
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///import.csv' AS row
FIELDTERMINATOR ';'
CALL apoc.cypher.run('
MATCH
(m:`'+row.source+'` {node_id: $MNodeID}),
(n:`'+row.source+'` {node_id: $NNodeID})
RETURN m, n'
, {MNodeID: trim(row.start_id),NNodeID: trim(row.end_id)})
YIELD value
WITH value.m as m, value.n as n, row
MERGE (m)-[r:OFFICER_OF {source: row.source, status: row.status, start_date: row.start_date, end_date: row.end_date, type: row.links }]-(n)
return r;
01-15-2020 08:46 AM
@MalteR in addition to using row.source as a label. Maybe your data allows you add some labels to (m) and (n) nodes. If they are different types, then it would improve performance more.
01-16-2020 06:46 AM
Hi @MalteR,
You've created a good composite index. The problem is that it's not being used in your match. The match needs the label portion to utilize the index.
Try replacing MATCH (m),(n)
with MATCH (m:Address), (n:Address)
Also as a matter of clarity you can structure it like below (dropping the where):
MATCH (m:Address {node_id:trim(row.start_id), source: trim(row.source)})
MATCH (n:Address {node_id:trim(row.end_id), source: trim(row.source)})
Hope that helps,
-Mike
03-12-2020 09:25 AM
Thank you very much for this hint, it worked when I added the labels.
Thank you very much!
All the sessions of the conference are now available online