cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Encoding issue with space characters

mojo2go
Graph Buddy

I have pulled in data from separate data sources. I am of course de-silo-ing the data by making connections across them. However I found a problem. I ran a DISTINCT on the names and got 2 seemingly identical results. Well, okay, to the eye they are identical:

But their space characters are encoded differently. In the example image, the upper name's spaces all have the byte id of 32, and the lower name's spaces all are constructed of the two byte ids of 194 followed by 160.

Here is an article on the exact problem: UTF-8 encoded space (194 160) problem - Programmer All

I need to homogenize these two datasets so that these names are actually comparable. Is there an APOC or other approach, like using apoc.text.replace() to find one kind of space and convert it to the other to homogenize these?

Note: I try to copy and paste the spaces from the Neo4j desktop back into a replace function but I think the encoding gets homogenized when rendered to my screen.

One more thing: I have loaded all the data from the two datasets from csv files that came from large Excel files. Possibly I can change something in Excel or in my Visual Code editor to make them comparable.

1 ACCEPTED SOLUTION

Hi @mojo2go

Created a node for Keanu Reeves with 1 byte space and 2 bytes space.

CREATE (:Person {name:'Keanu Reeves', code:'ascii'});
CREATE (:Person {name:'Keanu Reeves', code:'utf8'});

This query will only retrieve nodes that have a space of 1 byte.

MATCH (n:Person)
  WHERE n.name = 'Keanu Reeves'
RETURN n;

This query will retrieve both.
It's hard to see the contents of the replace function, but the first character is a UTF8 space, and the next is an ASCII space.

MATCH (n:Person)
  WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;

If I were you, I would create a separate name for the search.

MATCH (n:Person)
SET n.searchName = replace(n.name,' ',' ');

MATCH (n:Person)
  WHERE n.searchName = 'Keanu Reeves'
RETURN n;

Is this answer helpful?

View solution in original post

4 REPLIES 4

Hi @mojo2go

Created a node for Keanu Reeves with 1 byte space and 2 bytes space.

CREATE (:Person {name:'Keanu Reeves', code:'ascii'});
CREATE (:Person {name:'Keanu Reeves', code:'utf8'});

This query will only retrieve nodes that have a space of 1 byte.

MATCH (n:Person)
  WHERE n.name = 'Keanu Reeves'
RETURN n;

This query will retrieve both.
It's hard to see the contents of the replace function, but the first character is a UTF8 space, and the next is an ASCII space.

MATCH (n:Person)
  WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;

If I were you, I would create a separate name for the search.

MATCH (n:Person)
SET n.searchName = replace(n.name,' ',' ');

MATCH (n:Person)
  WHERE n.searchName = 'Keanu Reeves'
RETURN n;

Is this answer helpful?

Hi Koji, I finally was able to find the one instance of the name from my source file that had a different encoding for 2 of its 3 spaces. That name, when compared with an equal sign ( = ) is considered different when compared to all other instances of the name. I learned this through extensive trial and error, copyng and pasting from the original source file into Neo4j. Once I found it, then I was able to build the replace query in the Neo4j browser as you described. I was hoping that there could be a way for me to use Neo4j alone to locate the problem character and fix it...like by using a conversion function. But I didn't find that for UTF8. I tried the only reasonable conversion function I could find: convert to string:

RETURN apoc.convert.toString(freakyspace) = apoc.convert.toString(' ')

But it did not homogenize the two spaces ; they did not become equal so the above query returned false.

Microsoft SQL Server has a lot of conversion options CAST and CONVERT (Transact-SQL) - SQL Server | Microsoft Docs . I don't know if those would help here.

I think what has caused this is a source Microsoft Excel file where a value with a different encoding was pasted into that file, so that there were differnent encodings in the same document...if that's even possible? I exported the excel to a .csv file, and that anomaly was preserved. I would love to be corrected.

Hi @mojo2go

I noticed that c2a0 is not a 2-byte space in UTF8, but a Non-breaking Space.
Non-breaking space can be entered on a Mac with option + space.
The first space in the following is option + space
The second one is space.

MATCH (n:Person)
  WHERE replace(n.name,' ',' ') = 'Keanu Reeves'
RETURN n;

I didn't even know that existed. I had to test in in my editor and it worked perfectly, gluing two adjacent words together. Thanks @koji

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online