Neo4j

lingvisa · ‎11-18-2021

My csv file has two columns:

nid embedding
001 [0.001, 0.002, 0.003]
002 [0.004, 0.005, 0.006]

My current load statement in Python driver is:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'	'}) 
yield map as row", 
"MERGE (m:Topic {nid: row.nid})
ON CREATE SET m += row 
ON MATCH SET m += row 
RETURN count(m) as mcount", {batchSize:1000, iterateList:true, parallel:true})

However, this will load the 2nd column embedding as a whole string, not an array of floats in Neo4j. How to load them as an array of floats? I can also change my embedding column format if it make things easier.

I am thinking something like this:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'	'}) 
yield map as row", 
"
**WITH row.embedding = apoc.convert.fromJsonList(row.embedding)**
MERGE (m:Topic {nid: row.nid})
ON CREATE SET m += row 
ON MATCH SET m += row 
RETURN count(m) as mcount", {batchSize:1000, iterateList:true, parallel:true})

But the added WITH before MERGE is disallowed. How to update the row's value before MERGE?

giuseppe_villan · ‎11-19-2021

@lingvisa
I might be wrong, but the sep:' ' with multiple spaces looks weird to me,
because is equivalent to a sep:' '(with a single space) because the load.csv doesn't support multi char separator, so it considers only first char.

The provided file is a .tsv, so it should be separated by tabs, and in this case the correct separator is sep:'TAB'. Because with separator ' ', would be considered also spaces between 0.001, and 0.002.

That is, if I have a csv file like yours, with tab character (\t) after nid, 001 and 002
and other spaces are "real" whitespaces, I could be something like this (I used the apoc.map.setKey to update a key):

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'TAB'}) 
  yield map as row", 
  "
  WITH row
  WITH apoc.map.setKey(row, 'embedding', apoc.convert.fromJsonList(row.embedding)) as row
  MERGE (m:Topic {nid: row.nid})
  ON CREATE SET m += row 
  ON MATCH SET m += row 
  RETURN count(m) as mcount", 
{batchSize:1000, iterateList:true, parallel:true})

View solution in original post

giuseppe_villan · ‎11-19-2021

@lingvisa
I might be wrong, but the sep:' ' with multiple spaces looks weird to me,
because is equivalent to a sep:' '(with a single space) because the load.csv doesn't support multi char separator, so it considers only first char.

The provided file is a .tsv, so it should be separated by tabs, and in this case the correct separator is sep:'TAB'. Because with separator ' ', would be considered also spaces between 0.001, and 0.002.

That is, if I have a csv file like yours, with tab character (\t) after nid, 001 and 002
and other spaces are "real" whitespaces, I could be something like this (I used the apoc.map.setKey to update a key):

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'TAB'}) 
  yield map as row", 
  "
  WITH row
  WITH apoc.map.setKey(row, 'embedding', apoc.convert.fromJsonList(row.embedding)) as row
  MERGE (m:Topic {nid: row.nid})
  ON CREATE SET m += row 
  ON MATCH SET m += row 
  RETURN count(m) as mcount", 
{batchSize:1000, iterateList:true, parallel:true})

lingvisa · ‎11-19-2021

The sep:' ' is a display effect. When I compose the cypher statement from my code, delimiter is defined as '\t' in Python, but it shows as 2 white spaces on screen. I don't know the string 'TAB' can be used in here. Does it internally convert 'TAB' into '\t'?

It's good to know apoc.map.setKey() syntax, which is great. I also found another solution by modifying the apoc.load.csv() command:
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:' \t', mapping:{embedding:{array:true, arraySep:',', type:'float'}} })

The mapping parameter seems working well.

giuseppe_villan · ‎11-22-2021

Exactly, the 'TAB' is converted to '\t '

Neo4j

How to load float array values by apoc.load.csv() from csv file?