Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
12-04-2020 03:39 AM
Hello everyone!
I have a database with non-English characters (like accents, namely: “é, í, â, à”) and I'm having issues when trying to filter fields containing these characters. For instance, imagine a node called “Ángeles Martínez”. I attempted the following:
<MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.Name =~ '(?i)angeles martinez' RETURN p/>
The issue is I want to filter those names having non-English characters in the database without explicitly writing them on the query (i.e.: I would like to write “angeles martinez” and, then, Neo should retrieve the node called “Ángeles Martínez”).
I have implemented the following solutions with no success at all:
I have recently seen that a user defined function (UDF) can be created and it may solve the issues with the filtering. However, I’m planning to use Python to query the Neo4J database and these UDFs seem to work only for Java.
Does anyone know how to address this issue?
Many thanks in advance
12-04-2020 05:41 AM
How about separating Person's Name at CREATE?
The Cypher is like this.
CREATE (:Person {
name: $name,
englishName : replace(replace(replace(replace(replace($name,'é','e'),'í','i'),'â','a'),'à','a'),'Á','A')
})
You can search the englishName.
12-04-2020 08:32 AM
I would store a "DisplayName" Property which includes Unicode string and a "NornalizedName" Property which has the removed diacritical marks.
Then you can query:
MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = 'angeles martinez' RETURN p
If you don't know which one, you can do both:
MATCH (p:PERSON)-[r:APPEARS_IN]->(p2:ARTICLE) WHERE p.NornalizedName = searchname OR p.DisplayName = searchname RETURN p
There are functions in various languages to remove diacritical marks for Unicode. Unfortunately, these functions aren't in APOC (yet):
You may have to write a UDF using the Java function.
I have made a PR for a function that will remove diacritical marks:
This does have the disadvantage of taking up more storage space, but it will be faster and more flexible.
12-04-2020 10:05 AM
Use apoc.text.clean:
with "é, í, â, à" as s1
return apoc.text.clean(s1) as s2
Result: "eiaa"
12-04-2020 11:59 AM
WITH "Ángeles Martínez" AS s1
RETURN apoc.text.clean(s1) AS s2
The result is "angelesmartinez".
It’s good for search.
I've tried Japanese katakana as well.
WITH "アンジェルス・マルティネス" AS s1
RETURN apoc.text.clean(s1) AS s2
The result is "アンシェルスマルティネス".
I found that the voicing diacritic mark (little dash) is gone from the name.
(from ジ to シ)
WITH "はひふへほ ぱぴぷぺぽ ばびぶべぼ" AS s1
RETURN apoc.text.clean(s1) AS s2
The result is "はひふへほはひふへほはひふへほ".
The voicing diacritic (little dash) and p-sound mark (little circle) are converted.
The "apoc.text.clean" can be used in Japanese processing as well.
All the sessions of the conference are now available online