Neo4j

takuya · ‎02-07-2020

Hi there.
I tried neo4j-etl cli tool V1.3.1.

This is very fast for importing only nodes, but it seems to be very slowly when relations import.

Is it check the connection target table in the process of generating the CSV?

Details are as follows,

Environment

Neo4j 3.5.11
Neo4j ETL Tool V1.3.1
PostgreSQL
Server Memory 8g

Situation

RDB(PostgreSQL) has one table for node and relation table like under ER.
There are 3 sets it same structural tables and actually more columns has it.

I edited the mapping.json like this.

[ {
  "name" : "NODE1_db1",
  "schema" : "db1",
  "graph-object-type" : "Node",
  "sql" : "SELECT id FROM db1.company AS company",
  "mappings" : [ {
    "column" : {
      "type" : "CompositeColumn",
      "table" : "company",
      "schema" : "db1",
      "role" : "PrimaryKey",
      "columns" : [ {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "id",
        "alias" : "id",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
      } ]
    },
    "field" : {
      "type" : "Id",
      "name" : "",
      "id-space" : "db1.company"
    }
  }]}, {
  "name" : "Relation_db1",
  "schema" : "db1",
  "graph-object-type" : "Relation",
  "sql" : "SELECT start_id, end_id, relation_type FROM (SELECT start_id, end_id, relation_type FROM db1.company AS start_company JOIN db1.relation AS relation ON start_company.id = relation.start_id JOIN db1.company AS end_company ON end_company.id = relation.end_id) AS company_relation",
  "mappings" : [ {
    "column" : {
      "type" : "SimpleColumn",
      "role" : "Data",
      "table" : "company",
      "schema" : "db1",
      "name" : "start_id",
      "alias" : "start_id",
      "sql-data-type" : "INTEGER",
      "column-value-selection-strategy" : "SelectColumnValue"
    },
    "field" : {
      "type" : "startId",
      "name" : "start_id",
      "id-space" : "db1.company_relation"
    }},{
    "column" : {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "end_id",
        "alias" : "end_id",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
    },
    "field" : {
        "type" : "endId",
        "name" : "end_id",
        "id-space" : "db1.company_relation"
    }}, {
      "column" : {
        "type" : "SimpleColumn",
        "role" : "Data",
        "table" : "company",
        "schema" : "db1",
        "name" : "relation_type",
        "alias" : "relation_type",
        "sql-data-type" : "INTEGER",
        "column-value-selection-strategy" : "SelectColumnValue"
      },
      "field" : {
        "type" : "RELATION_TYPE",
        "name" : "relation_type",
        "id-space" : "db1.company_relation"
      }
  }]
  }
]

The number of result records is 700,000 nodes and 2,500,000 relations.

Result

10300s, about 3 hours..

When I deleted the description of Relation and tried to execute it only on the node, it ended in about 20 seconds. Searching the RDB with SQL alone takes only a few seconds, so I think that there is something in the Neo4j ETL logic.

please let me know if you have any informations
Thanks.

Thomas_Silkjaer · ‎02-07-2020

Importing relationships is very very slow. When creating a relationship between two nodes, the database locks both nodes. So it is not even possible to speed it up with parallel processing unless it can be guaranteed that no two processes run at the same time, is making a relationship with an overlapping node.

I don't have any experience with the ETL tool, but if you are injecting data in a new database, I would definitely look into using the neo4j-admin import tool and make exports from the PostgreSQL database to CSV, preprocess it with a scripting language of your choice (make sure there are no duplicates, escape quotes, write csv files as documented …), and then run the import.

View solution in original post

Thomas_Silkjaer · ‎02-07-2020

Importing relationships is very very slow. When creating a relationship between two nodes, the database locks both nodes. So it is not even possible to speed it up with parallel processing unless it can be guaranteed that no two processes run at the same time, is making a relationship with an overlapping node.

I don't have any experience with the ETL tool, but if you are injecting data in a new database, I would definitely look into using the neo4j-admin import tool and make exports from the PostgreSQL database to CSV, preprocess it with a scripting language of your choice (make sure there are no duplicates, escape quotes, write csv files as documented …), and then run the import.

takuya · ‎02-07-2020

Lock!!!! That's the point!

As you say, admin-import is very fast it alone, so I'm going to go that way CSV export from PostgreSQL.

Your explanation was really helpful!
Thanks a lot

Neo4j

I want to understand the logic for neo4j-etl cli export

Environment

Situation

Result