cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

How to load float array values by apoc.load.csv() from csv file?

lingvisa
Graph Fellow

My csv file has two columns:

nid embedding
001 [0.001, 0.002, 0.003]
002 [0.004, 0.005, 0.006]

My current load statement in Python driver is:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'	'}) 
yield map as row", 
"MERGE (m:Topic {nid: row.nid})
ON CREATE SET m += row 
ON MATCH SET m += row 
RETURN count(m) as mcount", {batchSize:1000, iterateList:true, parallel:true})

However, this will load the 2nd column embedding as a whole string, not an array of floats in Neo4j. How to load them as an array of floats? I can also change my embedding column format if it make things easier.

I am thinking something like this:

CALL apoc.periodic.iterate("
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'	'}) 
yield map as row", 
"
**WITH row.embedding = apoc.convert.fromJsonList(row.embedding)**
MERGE (m:Topic {nid: row.nid})
ON CREATE SET m += row 
ON MATCH SET m += row 
RETURN count(m) as mcount", {batchSize:1000, iterateList:true, parallel:true})

But the added WITH before MERGE is disallowed. How to update the row's value before MERGE?

1 ACCEPTED SOLUTION

@lingvisa
I might be wrong, but the sep:' ' with multiple spaces looks weird to me,
because is equivalent to a sep:' '(with a single space) because the load.csv doesn't support multi char separator, so it considers only first char.

The provided file is a .tsv, so it should be separated by tabs, and in this case the correct separator is sep:'TAB'. Because with separator ' ', would be considered also spaces between 0.001, and 0.002.

That is, if I have a csv file like yours, with tab character (\t) after nid, 001 and 002
and other spaces are "real" whitespaces, I could be something like this (I used the apoc.map.setKey to update a key):

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'TAB'}) 
  yield map as row", 
  "
  WITH row
  WITH apoc.map.setKey(row, 'embedding', apoc.convert.fromJsonList(row.embedding)) as row
  MERGE (m:Topic {nid: row.nid})
  ON CREATE SET m += row 
  ON MATCH SET m += row 
  RETURN count(m) as mcount", 
{batchSize:1000, iterateList:true, parallel:true})

View solution in original post

3 REPLIES 3

@lingvisa
I might be wrong, but the sep:' ' with multiple spaces looks weird to me,
because is equivalent to a sep:' '(with a single space) because the load.csv doesn't support multi char separator, so it considers only first char.

The provided file is a .tsv, so it should be separated by tabs, and in this case the correct separator is sep:'TAB'. Because with separator ' ', would be considered also spaces between 0.001, and 0.002.

That is, if I have a csv file like yours, with tab character (\t) after nid, 001 and 002
and other spaces are "real" whitespaces, I could be something like this (I used the apoc.map.setKey to update a key):

CALL apoc.periodic.iterate("
  CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:'TAB'}) 
  yield map as row", 
  "
  WITH row
  WITH apoc.map.setKey(row, 'embedding', apoc.convert.fromJsonList(row.embedding)) as row
  MERGE (m:Topic {nid: row.nid})
  ON CREATE SET m += row 
  ON MATCH SET m += row 
  RETURN count(m) as mcount", 
{batchSize:1000, iterateList:true, parallel:true})

The sep:' ' is a display effect. When I compose the cypher statement from my code, delimiter is defined as '\t' in Python, but it shows as 2 white spaces on screen. I don't know the string 'TAB' can be used in here. Does it internally convert 'TAB' into '\t'?

It's good to know apoc.map.setKey() syntax, which is great. I also found another solution by modifying the apoc.load.csv() command:
CALL apoc.load.csv('topic.tsv', {nullValues:['','na','NAN',false], sep:' \t', mapping:{embedding:{array:true, arraySep:',', type:'float'}} })

The mapping parameter seems working well.

Exactly, the 'TAB' is converted to '\t '