Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
07-18-2022 03:10 PM - edited 07-19-2022 04:30 PM
Hello everyone!
I’m looking for a solution regarding my phd research where i need to do some data wrangling, ie. transform my dataset into format suitable for graph/network analysis or to be more precise - make relationship from the data based on the list of existing nodes and their temporal properties.
Anyways, I’ve written a cypher query where I’m loading my csv into a Neo4j DB and simultaneously updating “EndOfExport” column - whenever “X” is found, today’s date is inserted. Instead of today, i would like to know how to get the last day of the current month?
LOAD CSV WITH HEADERS FROM "file:///Test - input.csv" AS csvLine FIELDTERMINATOR ';'
CREATE (e:Country {CountryID: csvLine.CountryID, StartOfExport: date(csvLine.StartOfExport), EndOfExport: date(CASE csvLine.EndOfExport WHEN "X" THEN date() ELSE csvLine.EndOfExport END)})
Furthermore, the main question lies in the following: i want to create relationships between all the nodes (countries) who traded in the same period in time.
That means that the following node list input
CountryID | StartOfExport | EndOfExport |
55 | 2008-10-16 | 2014-01-31 |
47 | 2010-04-19 | 2014-09-15 |
73 | 2010-08-09 | 2022-07-18 |
61 | 2010-08-10 | 2013-04-30 |
should result in the following relationships:
CountryID_01 | CountryID_02 | StartOfExport | EndOfExport |
55 | 47 | 2010-04-19 | 2014-01-31 |
55 | 73 | 2010-08-09 | 2014-01-31 |
55 | 61 | 2010-08-10 | 2013-04-30 |
47 | 73 | 2010-08-09 | 2014-09-15 |
47 | 61 | 2010-08-10 | 2013-04-30 |
73 | 61 | 2010-08-10 | 2013-04-30 |
That means that i need to do row by row comparison in the loop until the end of the table. Each iteration should start form the next row and make comparison with the each row until the end of table (top-bottom).
Comparison is based on the CountryID and i’m comparing
1.) StartOfExport of CountryID in the row X with the StartOfExport
of CountryID in the row X+1 where i need to take out the larger date;
2.) EndOfExport of CountryID in the row X with the EndOfExport
of CountryID in the row X+1 where i need to take out the smaller date;
The question is: should i do this kind of data wrangling before loading the data into Neo4j? In that case i would load separately both the node and relationship lists. Alternatively, is something like that possible to do with Cypher/APOC procedures and only while using the mentioned node list?
The ultimate goal would be to calculate centrality metrics in GDS for each year separately.
I’m working on a short notice so any advice/workflow would be very much appreciated!
Solved! Go to Solution.
07-22-2022 07:58 AM
From what I read, native projections don't allow conditions. It is just a list of node labels. I suggest you also add a special temporary label on the nodes you want to project. You can project just those nodes with the special label, the same as you are doing for relationships.
The following will set the nodes with a relationship that has the correct time restrictions:
MATCH (c:Country)
WHERE exists {
MATCH (c)-[e:EXPORTED]-(:Country)
WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010
}
SET c:EXPORTED_2010
If you already have set the type on the relationships, you can simplify the query:
MATCH (c:Country)
WHERE exists ( (c)-[:EXPORTED_2010]-(:Country) )
SET c:EXPORTED_2010
07-18-2022 05:26 PM - edited 07-18-2022 05:27 PM
Try this to import the data:
load csv with headers from "file:///Countries.csv" as line
merge(n:Country {countryID: line.countryID})
set n.startOfExport = date(line.startOfExport), n.endOfExport = date(line.endOfExport)
Note: if you want the countryIDs to be integers, then you will need to use 'toInteger(line.countryID)' instead.
Try this to establish the relationships:
match(n:Country)
match(m:Country)
where id(n)<id(m)
merge(n)-[r:REL]->(m)
set r.startOfExport = case when n.startOfExport > m.startOfExport then n.startOfExport else m.startOfExport end,
r.endOfExport = case when n.endOfExport < m.endOfExport then n.endOfExport else m.endOfExport end
return n.countryID, m.countryID, r.startOfExport, r.endOfExport
You can calculate the end of the month with the following:
with date() as date
return date({year: date.year, month: date.month+1, day:1})-duration({days:1}) as endOfCurrentMonth
07-19-2022 10:17 AM - edited 07-19-2022 10:19 AM
Hi @glilienfield after some minor tweaking it works like a charm, thanks!
Nevertheless, i have two more questions regarding this topic:
1.) Is it possible to load only nodes with LOAD CSV and skip relationship creation and consequentially create those relationships when loading our graph into memory with gds.graph.project.cypher using the same logic, ie. with the following query you wrote:
match(n:Country)
match(m:Country)
where id(n)<id(m)
merge(n)-[r:REL]->(m)
set r.startOfExport = case when n.startOfExport > m.startOfExport then n.startOfExport else m.startOfExport end,
r.endOfExport = case when n.endOfExport < m.endOfExport then n.endOfExport else m.endOfExport end
return n.countryID, m.countryID, r.startOfExport, r.endOfExport
2.) I've loaded my graph into memory but i need it to be UNDIRECTED. That is of course possible with the native projection. Nevertheless, i've loaded my graph with the cypher projection because i need to know when two Countries exported goods in the same (specific) year, for example this is for the year 2010:
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.StartedExporting).year <=2010 AND (w.EndedExporting).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
The downside of this approach (cypher projection memory load) is the fact that i've loaded my graph DIRECTED and therefore i need to define it UNDIRECTED while calculating certain metrics. However, only degree centrality allows that with the "orientation" property. On the contrary, Eigenvector, Closeness, etc. - not.
Ideally, i would use native projection, but how to limit years on the relationship property to the specific range?
07-20-2022 01:54 PM - edited 07-20-2022 02:01 PM
I am not a GDS user, but from what I read you will not be able to create the relationships when projecting. The projection uses existing entities to load them into memory, thus you will need to do them during or post import.
How about temporarily relabeling your relationships with a new label, so that you can leverage the native projections? Write a query that finds all the relationships that meet the date range and set them to a new label, then use native projects to load them using the new label. You can set the labels back after your analysis.
07-20-2022 02:36 PM
I did some testing. It seems that the following cypher projection loads UNDIRECTED graph:
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.StartedExporting).year <=2010 AND (w.EndedExporting).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
...and the following loads DIRECTED graph:
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]->(c2:Country) WHERE (e.StartedExporting).year <=2010 AND (w.EndedExporting).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
Afterwards i did some calculations and it seems ok. So yeah, seems that this works.
07-20-2022 04:05 PM
That does not align with what I just read. See below screenshot.
https://neo4j.com/docs/graph-data-science/current/graph-project-cypher/
From what I read about native projections, is that it loads each relationship in each direction, thus making it undirected.
Well, if it works, then that is what counts.
07-20-2022 04:58 PM - edited 07-20-2022 04:59 PM
please check this link as well - https://github.com/neo4j/graph-data-science/issues/113
I guess somebody encountered the same problem.
I have one final question for you. The query you wrote works perfectly good:
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
However, as a result, i get nodes without any relationships as well. How to write query to list only nodes with the relationships?
Tnx!
07-20-2022 05:54 PM
But didn't that person get incorrect results by using a cypher query without specifying the direction?
I assume you have Country nodes that do not have relationships to other countries. You can filter these out in the node query.
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) WHERE exists((c)--()) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
07-21-2022 03:17 AM - edited 07-21-2022 03:26 AM
@glilienfield
The following query unfortunately doesn't work, i still get all the nodes, not only those with specified relationship, any thoughts?
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) WHERE exists((c)--()) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
Finally, i've tried numerous times, but couldn't get any result whatsoever:
How about temporarily relabeling your relationships with a new label, so that you can leverage the native projections? Write a query that finds all the relationships that meet the date range and set them to a new label, then use native projects to load them using the new label. You can set the labels back after your analysis.
How to set a new relationship type for the following query?
MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010 RETURN id(c1) AS source, id(c2) AS target
07-21-2022 07:58 AM - edited 07-21-2022 08:00 AM
Here is the query to change the type on the relationship.
MATCH (c1:Country)-[e:EXPORTED]-(c2:Country)
WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010
SET e:NEW_LABEL
Do you have test data I can use?
07-22-2022 01:12 AM - edited 07-22-2022 01:20 AM
Hi @glilienfield it's not working 😞
I get the following:
Type mismatch: expected Node but was Relationship (line 2, column 5 (offset: 140))
"SET e:NEW_LABEL"
^
This is used for the load:
LOAD CSV WITH HEADERS FROM "file:///CountryExport.csv" AS csvLine FIELDTERMINATOR ';'
CREATE
(
c:Country {CountryID: csvLine.CountryID,
startOfExport: date(csvLine.startOfExport),
endOfExport: date(CASE csvLine.endOfExport WHEN "X" THEN date() ELSE csvLine.endOfExport END)}
)
This is used for the default relationships:
match(n:Country)
match(m:Country)
where id(n)<id(m)
merge(n)-[r:REL]->(m)
set r.startOfExport = case when n.startOfExport > m.startOfExport then n.startOfExport else m.startOfExport end,
r.endOfExport = case when n.endOfExport < m.endOfExport then n.endOfExport else m.endOfExport end
return n.countryID, m.countryID, r.startOfExport, r.endOfExport
and i want to add new relationship labels with the following:
MATCH (n:Country)-[r:EXPORTED]-(m:Country)
WHERE (r.startOfExport).year <=2010 AND (r.endOfExport).year >=2010
SET r:NEW_LABEL
This is the test data:
CountryID;startOfExport;endOfExport
55;2008-10-16;2014-01-31
47;2010-04-19;2014-09-15
73;2010-08-09;X
61;2010-08-10;2013-04-30
74;2010-09-02;X
62;2010-09-13;2013-04-10
51;2011-01-11;2014-06-05
75;2011-01-24;X
41;2011-02-07;2014-12-21
57;2011-03-01;2013-10-17
28;2011-03-08;2016-05-28
64;2011-03-08;2012-10-12
34;2011-04-01;2015-12-10
52;2011-07-11;2014-05-23
63;2012-01-09;2012-12-26
56;2012-01-16;2014-01-14
76;2012-02-01;X
58;2012-02-24;2013-09-18
39;2012-04-16;2015-04-10
07-22-2022 02:13 AM - edited 07-22-2022 02:24 AM
I have never tried changing relationship types. I assumed it could be done like node labels. Apparently you can’t do it. As a work around, you could create a new one, copy the relationship properties, then delete the old one. You can also use APOC library to do it instead:
CALL apoc.refactor.setType(r, 'NEW-TYPE')
this made me think of a new approach. Since this relationship is being use only to facilitate your analysis and would need to be changed back anyways, I suggest you create a new relationship with the new label and copy the relationship properties from the existing relationship. You can then use native projection with this new relationship type. You can delete the extra relationship once you don’t need it for projecting. You were going to return the label type back in the original approach, so there no extra work with the new approach.
I will play with the data
07-22-2022 02:38 AM - edited 07-22-2022 02:38 AM
Hi,
i've tried that approach already (at least one part of it) - i've created new relationships, that is fine:
MATCH (n:Country)-[r:EXPORTED]-(m:Country)
WHERE (r.startOfExport).year <=2010 AND (r.endOfExport).year >=2010
MERGE (n)-[r2:EXPORTED_2010]->(m)
But the following virtualization method:
CALL gds.graph.project('graph', 'Country', {EXPORTED_2010: {orientation: 'UNDIRECTED'}})
loads desirable relationships, but also ALL nodes from the network which is not acceptable. I only want nodes with "EXPORTED_2010" relationship, not nodes without any edges.
How to avoid that at least?
07-22-2022 07:58 AM
From what I read, native projections don't allow conditions. It is just a list of node labels. I suggest you also add a special temporary label on the nodes you want to project. You can project just those nodes with the special label, the same as you are doing for relationships.
The following will set the nodes with a relationship that has the correct time restrictions:
MATCH (c:Country)
WHERE exists {
MATCH (c)-[e:EXPORTED]-(:Country)
WHERE (e.Started).year <=2010 AND (e.Ended).year >=2010
}
SET c:EXPORTED_2010
If you already have set the type on the relationships, you can simplify the query:
MATCH (c:Country)
WHERE exists ( (c)-[:EXPORTED_2010]-(:Country) )
SET c:EXPORTED_2010
07-25-2022 10:21 AM - edited 07-25-2022 10:21 AM
@glilienfield i've tested it and run some optimizations, it is definitely the best way to do it.
So, in the first step we define relationship labels, than based on that we define node labels and use native projection with undirected relationship type.
Although I've assumed that the logical process would be to define some extra relationship labels and use more flexible cypher projection, that is not the way to go (at least in this case).
Thanks!
07-25-2022 03:05 PM
That is great it all worked out. You can delete the extra relationships when you are done projecting with the following query:
MATCH ()-[r:EXPORTED_2010]->()
DELETE r
07-19-2022 05:10 PM - edited 07-19-2022 05:11 PM
btw. @glilienfield
would the mere fact that i'm not loading direction
MATCH (c1:Country)-[e:EXPORTED]-(c2:Country)
with the query mean that the graph is going to be UNDIRECTED?
This is the complete query from above:
CALL gds.graph.project.cypher
(
'graph',
'MATCH (c:Country) RETURN id(c) AS id',
'MATCH (c1:Country)-[e:EXPORTED]-(c2:Country) WHERE (e.StartedExporting).year <=2010 AND (w.EndedExporting).year >=2010 RETURN id(c1) AS source, id(c2) AS target'
)
All the sessions of the conference are now available online