Neo4j

guilherme_junqu · ‎10-26-2018

Guys,

I am trying to run the following code and it is giving me a **"non-related" error:

USING PERIODIC COMMIT 3000 LOAD CSV WITH HEADERS
FROM "file:///tse-votacao_candidato_municipio_zona-municipal-2014-db.07.facts.csv" AS row
FIELDTERMINATOR ';'

MATCH
  (c:City {tse_code: toInteger(row.cod_municipio_tse)}),
  (:Publication {auto_name: row.publication})<-[:is_present_on]-(m:Metric {auto_name: row.cod_metrica}),
  (:State {acronym: row.sigla_uf})<-[:belongs_to]-(z:ElectoralZone {code: row.cod_zona_eleitoral}),
  (:Election {year: date(row.ano_eleicao), auto_name: row.cod_descricao_eleicao})<-[:round_of]-
    (:ElectionRound {number: row.num_turno})<-[:runs_in]-(cand:Candidate {code: row.sq_candidato})
CREATE
  (afe:Measurement)
SET
  afe.value = toInteger(row.total_votos),
  afe.unit  = 'votes',
  afe.date  = date(row.data_arquivo)
WITH
  m, cand, afe, c, z
CREATE UNIQUE
   (afe)-[:taken_from]->(c),
   (afe)-[:taken_of]->(m),
   (afe)-[:filtered_by]->(cand),
   (afe)-[:filtered_by]->(z);

I consider the error as non-related because I have tried running the code above without the updating part for each of the MATCHs clauses and it runs without problem. Apparently, the problem occurs only when I put all of them together.

The first 5 lines of the file (more than 7M lines on the file) are listed below:

publication;cod_metrica;cod_descricao_eleicao;ano_eleicao;num_turno;sigla_uf;cod_zona_eleitoral;cod_municipio_tse;sq_candidato;data_arquivo;total_votos
votacao_candidato_municipio_zona-municipal-2014;votacao-nominal-por-canditato-por-eleicao-e-zona;eleicoes-gerais-2014;2014;1;AC;9;01007;10000000003;2018-05-17;1508
votacao_candidato_municipio_zona-municipal-2014;votacao-nominal-por-canditato-por-eleicao-e-zona;eleicoes-gerais-2014;2014;1;AC;9;01007;10000000001;2018-05-17;3027
votacao_candidato_municipio_zona-municipal-2014;votacao-nominal-por-canditato-por-eleicao-e-zona;eleicoes-gerais-2014;2014;1;AC;9;01007;10000000048;2018-05-17;0
votacao_candidato_municipio_zona-municipal-2014;votacao-nominal-por-canditato-por-eleicao-e-zona;eleicoes-gerais-2014;2014;1;AC;9;01007;10000000146;2018-05-17;21
votacao_candidato_municipio_zona-municipal-2014;votacao-nominal-por-canditato-por-eleicao-e-zona;eleicoes-gerais-2014;2014;1;AC;9;01007;10000000152;2018-05-17;2540

I get the following error:

Neo.DatabaseError.General.UnknownError: unknown value: (2014-01-01) of type class java.time.LocalDate)

What might be going on here?

Thanks in advance,

guilherme_junqu · ‎10-26-2018

Guys, I would like to "increase" my suspicion that this is a bug.

Since I am stuck with this error, I started trying different approaches to solve my problem (mainly refactoring my query). When I tried this query, the java.time.LocalDate error vanished!

USING PERIODIC COMMIT 1000 LOAD CSV WITH HEADERS
FROM "file:///tse-votacao_candidato_municipio_zona-municipal-2014-db.07.facts.csv" AS row
FIELDTERMINATOR ';'

MATCH
  (c:City),
  (p:Publication)<-[:is_present_on]-(m:Metric),
  (s:State)<-[:belongs_to]-(z:ElectoralZone),
  (e:Election)<-[:round_of]-(er:ElectionRound)<-[:runs_in]-(cand:Candidate)
WHERE
  c.tse_code      = toInteger(row.cod_municipio_tse)
  and p.auto_name = row.publication
  and m.auto_name = row.cod_metrica
  and s.acronym   = row.sigla_uf
  and z.code      = row.cod_zona_eleitoral
  and e.year      = date(row.ano_eleicao)
  and e.auto_name = row.cod_descricao_eleicao
  and er.number   = row.num_turno
  and cand.code   = row.sq_candidato
CREATE
  (afe:Measurement)
SET
  afe.value   = toInteger(row.total_votos),
  afe.unit    = 'votes',
  afe.date    = date(row.data_arquivo)
WITH
  m, cand, afe, c, z
MERGE
  (afe)-[:taken_from]->(c)
MERGE
  (afe)-[:taken_of]->(m)
MERGE
  (afe)-[:filtered_by]->(cand)
MERGE
  (afe)-[:filtered_by]->(z);

Now I am struggling with OutOfMemoryError, but at least I know to what this is related...

stefan_armbrust · ‎10-26-2018

I suspect you're suffering from the well known "eager" Problem, see https://markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/

guilherme_junqu · ‎10-26-2018

Stefan,

I profiled my query earlier and I found no 'Eager' on the plan I saw.

It is more likely that my memory constraints are the responsible here...

But, the main question remains: why did the LocalDate error vanish with the code refactoring?

stefan_armbrust · ‎10-27-2018

I've just tried your second statement exlcuding PERIODIC COMMIT and prefixed it with EXPLAIN. The query plan indeed does contain an eager. So the whole csv import will be run in a single transaction which is the cause for the OOM.
Split the action into multiple smaller ones not showing eager and iterate multiple times over the large file.
Regarding the date error: I couldn't reproduce this.

guilherme_junqu · ‎10-27-2018

Stefan,

I was able to run the statement without OOM Error decreasing my batch size on the periodic commit.

As I said, I checked here and I did not find the eager step when I profiled (not explained) the statement.

(If I am not mistaken, profile runs the query, but explain just a guesses what would happen).

Thank you for website you sent (good material!!) but I would like to focus on the other error, if possible.

Best regards,

guilherme_junqu · ‎10-27-2018

Just some additional thoughts on why we have different outputs:

I used profile, not explain.
The query optimizer takes a lot of info when choosing how to run the query. I guess the presence of my indexes and some statistics plays a important role here.
Although the article you sent is really interesting, it is for a older version of Neo4j. I don't know if this eager step is currently as common as it was before.

Regards,

michael_hunger · ‎10-27-2018

It is not as common anymore but still shows up and if it does it disables periodic commit effectively.

Perhaps the localdate issue came up b/c it got further in the data?

Do you have a value of (2014-01-01) in your data file (with the parenthesis)

guilherme_junqu · ‎10-27-2018

No Michael,

The only dates with 2014 are related to :Election in the Match part of the statement.

They were previously imported to Neo4j with zero problems previously. That's why I double checked the Match statements one a one.

Thanks,

michael_hunger · ‎10-27-2018

you can switch to

call apoc.periodic.iterate(
'LOAD CSV ... AS row RETURN row',
'MATCH ...', 
{batchSize:10000, iterateList:true});

that should get rid of the OOM

guilherme_junqu · ‎10-29-2018

Hi @michael.hunger,

Can you please clarify the differences between the suggested apoc.periodic.iterate and the previous LOAD CSV ?

By the way, should I keep the periodic commit in the inner statement or is it useless in this approach?

Thanks!

michael_hunger · ‎11-21-2018

The OOM happens b/c of too large memory size of the transaction. Periodic iterate batches it up.

Cypher itself has more strict guarantees about visibility that's why it's possible to accidentally disable the PERIODIC COMMIT.

Neo4j

Unknown value error for class java.time.LocalDate when importing data