Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-13-2019 11:15 AM
I am doing an A/B testing to measure the throughput for the Creation of nodes in Neo4J. And I find throughput for creation of nodes decreases significantly as the number of properties increase.
Setup: Neo4j cluster 3.5.7 (3 core instances where one is the leader and the rest two are followers). Tried the same experiment in a single node as well and I observe the same behavior. But all the results below ran on the 3 node cluster.
TestA: Is to measure the throughput for creation of nodes in Neo4j where each node has 20 properties.
TestB: Is to measure the throughput for creation of nodes in Neo4j cluster 3.5.7 where each node has 40 properties.
Result: Throughput for TestB = 1/2 * Throughput for TestA
Below is the code I used to generate the load and measure the throughput.
import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class UnwindCreateNodes {
Driver driver;
static int start;
static int end;
public UnwindCreateNodes(String uri, String user, String password) {
Config config = Config.build()
.withConnectionTimeout(10, TimeUnit.SECONDS)
.toConfig();
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
}
private void addNodes() {
List<Map<String, Object>> listOfProperties = new ArrayList<>();
for (int inner = start; inner < end; inner++) {
Map<String, Object> properties = new HashMap<>();
properties.put("name", "Jhon " + inner);
properties.put("last", "Alan" + inner);
properties.put("id", 2 + inner);
properties.put("key", "1234" + inner);
properties.put("field5", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field6", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field7", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field8", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field9", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field10", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field11", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field12", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field13", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field14", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field15", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field16", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field17", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field18", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field19", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field20", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field21", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field22", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field23", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field24", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field25", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field26", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field27", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field28", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field29", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field30", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field31", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field32", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field33", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field34", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field35", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field36", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field37", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field38", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field39", "kfhc iahf uheguehuguaeghuszjxcb sd");
properties.put("field40", "kfhc iahf uheguehuguaeghuszjxcb sd");
listOfProperties.add(properties);
}
int noOfNodes = 0;
for (int i = 0; i < listOfProperties.size() / 5000; i++) {
List<Map<String, Object>> events = new ArrayList<>();
for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
events.add(listOfProperties.get(noOfNodes));
}
Map<String, Object> apocParam = new HashMap<>();
apocParam.put("events", events);
String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
Instant startTime = Instant.now();
try (Session session = driver.session()) {
session.writeTransaction((tx) -> tx.run(query, apocParam));
}
Instant finish = Instant.now();
long timeElapsed = Duration.between(startTime, finish).toMillis();
System.out.println("######################--timeElapsed NODES--############################");
System.out.println("no of nodes per batch " + events.size());
System.out.println(timeElapsed);
System.out.println("############################--NODES--############################");
}
}
public void close() {
driver.close();
}
public static void main(String... args) {
start = 200001;
end = 400001;
if (args.length == 2) {
start = Integer.valueOf(args[0]);
end = Integer.valueOf(args[1]);
}
UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
unwindCreateNodes.addNodes();
unwindCreateNodes.close();
}
}
Below is the graph.
It takes 3.5 seconds to insert 5000 nodes where each node has 40 properties
It takes 1.8 seconds to insert 5000 nodes where each node has 20 properties
This is a significant slowdown and 40 isn't a big number for the number of properties. I have a requirement until 100 properties but if I it does not scale for 40 I am not sure how I can scale for 100?
Other approaches I tried are using apoc.periodic.iterate
With taking out UNWIND
and without UNWIND
just using CREATE
etc but the behavior persists.
I don't want to store properties in some external store like RDBMS etc because that complicates things for me as I am building a generic application where I have no idea what properties are going to be used.
I cannot use CSV tool either because my data is coming from Kafka and also it is not structured in a way CSV tool wants. so no CSV tool for me.
Any idea to speed this up?
08-13-2019 07:01 PM
Do you have any indexes on any of those properties for :Label nodes? If so, you'll need to consider that the index needs to be updated too with inserts. Keep in mind also that for a cluster you're dealing with network I/O for Raft transactions, there is a latency there for consensus commits.
Seems like when you double the data that you're inserting, you're getting roughly double the execution time. That seems like a linear scaling, is this unexpected?
We are always looking to improve, and we are definitely eyeing some changes to our property store with an aim to improve efficiency sometime after the next major 4.0 release. I'm not sure if there's anything else we can spot here, but if we see anything relevant we'll let you know.
08-13-2019 08:52 PM
There are no indices. Moreover I tried with indices and without indices and the slow down is very negligible.
"Seems like when you double the data that you're inserting, you're getting roughly double the execution time. That seems like a linear scaling, is this unexpected?"
To be precise when I double the properties for each node the execution time doubles which means the throughput is halved. This is a significant slowdown and This is certainly not expected by me coz I thought neo4j would be able to handle it just fine.
When can we expect 4.0 release? Don't need any exact data but will it happen in Fall, Spring, Winter?
08-15-2019 11:44 AM
The properties often take up the most room in the store, so by effectively doubling the properties on the node you're basically doubling the data that needs to be inserted, when using the same number of nodes. Seeing about double the insertion time seems to be a natural consequence.
As previously mentioned, there are improvements scheduled on our store files, notably the properties store, that should see some improvement.
The 4.0 release is looking like late Winter 2019. The store improvements won't come with the 4.0 release, we're more likely to see them in 2020.
08-16-2019 02:41 AM
@andrew.bowman ok, I modified my benchmark code. Now I kept the total data size same when comparing the throughput between
10 properties of type long
vs 20 properties of type int
Code for 10 properties of type long
import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class UnwindCreateNodes {
Driver driver;
static int start;
static int end;
public UnwindCreateNodes(String uri, String user, String password) {
Config config = Config.build()
.withConnectionTimeout(10, TimeUnit.SECONDS)
.toConfig();
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
}
private void addNodes() {
List<Map<String, Object>> listOfProperties = new ArrayList<>();
for (int inner = start; inner < end; inner++) {
Map<String, Object> properties = new HashMap<>();
properties.put("field1", 2L);
properties.put("field2", 2L);
properties.put("field3", 2L);
properties.put("field4", 2L);
properties.put("field5", 2L);
properties.put("field6", 2L);
properties.put("field7", 2L);
properties.put("field8", 2L);
properties.put("field9", 2L);
properties.put("field10", 2L);
listOfProperties.add(properties);
}
int noOfNodes = 0;
for (int i = 0; i < listOfProperties.size() / 5000; i++) {
List<Map<String, Object>> events = new ArrayList<>();
for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
events.add(listOfProperties.get(noOfNodes));
}
Map<String, Object> apocParam = new HashMap<>();
apocParam.put("events", events);
String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
Instant startTime = Instant.now();
try (Session session = driver.session()) {
session.writeTransaction((tx) -> tx.run(query, apocParam));
}
Instant finish = Instant.now();
long timeElapsed = Duration.between(startTime, finish).toMillis();
System.out.println("######################--timeElapsed NODES--############################");
System.out.println("no of nodes per batch " + events.size());
System.out.println(timeElapsed);
System.out.println("############################--NODES--############################");
}
}
public void close() {
driver.close();
}
public static void main(String... args) {
start = 200001;
end = 400001;
if (args.length == 2) {
start = Integer.valueOf(args[0]);
end = Integer.valueOf(args[1]);
}
UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
unwindCreateNodes.addNodes();
unwindCreateNodes.close();
}
}
Code for 20 properties of type int
import org.neo4j.driver.v1.*;
import java.time.Duration;
import java.time.Instant;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class UnwindCreateNodes {
Driver driver;
static int start;
static int end;
public UnwindCreateNodes(String uri, String user, String password) {
Config config = Config.build()
.withConnectionTimeout(10, TimeUnit.SECONDS)
.toConfig();
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password), config);
}
private void addNodes() {
List<Map<String, Object>> listOfProperties = new ArrayList<>();
for (int inner = start; inner < end; inner++) {
Map<String, Object> properties = new HashMap<>();
properties.put("field1", 2);
properties.put("field2", 2);
properties.put("field3", 2);
properties.put("field4", 2);
properties.put("field5", 2);
properties.put("field6", 2);
properties.put("field7", 2);
properties.put("field8", 2);
properties.put("field9", 2);
properties.put("field10", 2);
properties.put("field11", 2);
properties.put("field12", 2);
properties.put("field13", 2);
properties.put("field14", 2);
properties.put("field15", 2);
properties.put("field16", 2);
properties.put("field17", 2);
properties.put("field18", 2);
properties.put("field19", 2);
properties.put("field20", 2);
listOfProperties.add(properties);
}
int noOfNodes = 0;
for (int i = 0; i < listOfProperties.size() / 5000; i++) {
List<Map<String, Object>> events = new ArrayList<>();
for (; noOfNodes < (i + 1) * (5000) && noOfNodes < listOfProperties.size(); noOfNodes++) {
events.add(listOfProperties.get(noOfNodes));
}
Map<String, Object> apocParam = new HashMap<>();
apocParam.put("events", events);
String query = "UNWIND $events AS event CREATE (a:Label) SET a += event";
Instant startTime = Instant.now();
try (Session session = driver.session()) {
session.writeTransaction((tx) -> tx.run(query, apocParam));
}
Instant finish = Instant.now();
long timeElapsed = Duration.between(startTime, finish).toMillis();
System.out.println("######################--timeElapsed NODES--############################");
System.out.println("no of nodes per batch " + events.size());
System.out.println(timeElapsed);
System.out.println("############################--NODES--############################");
}
}
public void close() {
driver.close();
}
public static void main(String... args) {
start = 200001;
end = 400001;
if (args.length == 2) {
start = Integer.valueOf(args[0]);
end = Integer.valueOf(args[1]);
}
UnwindCreateNodes unwindCreateNodes = new UnwindCreateNodes("bolt+routing://x.x.x.x:7687", "neo4j", "neo4j");
unwindCreateNodes.addNodes();
unwindCreateNodes.close();
}
}
The same behavior persists which is throughput decreases by half in the number of properties double although the total size is between two experiments are same.
I also tried different types like string
10 properties of type string
length 6 vs 20 properties of type string
length 3
The same behavior persists. so this clearly says properties store need to be redesigned. But if it takes until 2020 thats a bit too long.
08-16-2019 07:26 AM
Just to note, Cypher only works with 64-bit numeric types (see the Cypher type system mappings), so what you're doing between using Integers vs Longs makes no difference when going through Cypher, it will convert to 64-bit longs, this is why you aren't seeing a difference.
Integer types (and others) can be used when using embedded Neo4j via the core API. I think you might be able to use them in custom procedures as well, if using the core API to write the properties.
08-16-2019 07:13 AM
Did you try to parallelize your work?
08-16-2019 12:03 PM
@andrew.bowman converting integers to long doesn't sound efficent at all. but thats another problem in itself. As I said in my previous post I also did another experiment. which is
10 properties of type string
length 6 vs 20 properties of type string
length 3
you can see the blue vs red lines. The same behavior persists. Again this clearly says properties store need to be redesigned.
What other datatypes should I use to prove more and more that the same behavior exists regardless of the data size?
should I run another experiment with 10 properties of type int
vs 80 properties of type boolean
? (80 because I hear java reserves 1 byte for type boolean but only uses 1 bit)
All the sessions of the conference are now available online