Neo4j

NewGraphGuy · ‎11-07-2022

Hi,

I am working with Twitter data and creating nodes for Tweets. More specifically, each row in my CSV has 2 tweets, some tweet properties (likes, retweets, and such) and a similarity score created earlier in python. I am then connecting the tweets based on their score (since scores below a certain level were removed from the dataset, I am creating a relationship for every row in the CSV).

I do the following:

Delete everything to start clean
Create nodes for every tweet in column TweetA (5000 nodes, 50k properties, 20 mins)
Create nodes for every tweet in column TweetB (very little changes if any, 20 mins)
Finally, load the csv for a third time and draw a line based on TweetA and TweetBs token.

The very last step is taking insanely long. I've tried a few times now and I get a web socket error (I think my PC falls asleep and the Neo4J browser disconnects). Steps 2 and 3 take about 20 minutes each. The dataset has 230k rows and is 200 MB. Any ideas how to optimize my query?

//Delete Everything

MATCH (n)

DETACH DELETE n;

//Create Tweets from TweetA

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserA is null

MERGE (tweet:Tweet

{

TweetID:row.TweetTokenA,

Tweet:row.text_x,

AuthorToken:row.AuthorTokenA,

AuthorHandle:row.UserA,

CreatedAt:row.a_created_at,

Retweets:row.a_rt_cnt,

Replies:row.a_reply_cnt,

Likes:row.a_like_count,

Quotes:row. a_qt_count,

Aspect:row.AspectsA

});

//Create Tweets from TweetB

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserB is null

MERGE (tweet:Tweet

{

TweetID:row.TweetTokenB,

Tweet:row.text_y,

AuthorToken:row.AuthorTokenB,

AuthorHandle:row.UserB,

CreatedAt:row.b_created_at,

Retweets:row.b_rt_cnt,

Replies:row.b_reply_cnt,

Likes:row.b_like_count,

Quotes:row.b_qt_count,

Aspect:row.AspectsB

});

//Create Tweet to Tweet relationship

LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row

WITH row where not row.UserB is null

MATCH (a:Tweet {TweetID:row.TweetTokenA}),(b:Tweet {TweetID:row.TweetTokenB})

CREATE (a) -[:Similar{Score:row.SimilarityScore}]-> (b);

Cobra · ‎11-08-2022

Hello @NewGraphGuy 😊

You must have a UNIQUE CONSTRAINT on id property.

I also updated your queries:

// Delete Everything
CALL apoc.periodic.iterate("
	MATCH (n) RETURN n
	", "
	DETACH DELETE n
	", {batchSize: 1000, parallel: false});

// Create unique constraints
CREATE CONSTRAINT constraint_Tweet_id IF NOT EXISTS FOR (tweet:Tweet) REQUIRE tweet.TweetID IS UNIQUE;

// Create Tweets from TweetA
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE NOT row.UserA IS NULL 
MERGE (tweet:Tweet {TweetID: row.TweetTokenA}) 
SET tweet += {
	Tweet: row.text_x, 
	AuthorToken: row.AuthorTokenA, 
	AuthorHandle: row.UserA, 
	CreatedAt: row.a_created_at, 
	Retweets: row.a_rt_cnt, 
	Replies: row.a_reply_cnt, 
	Likes: row.a_like_count, 
	Quotes: row. a_qt_count, 
	Aspect: row.AspectsA
};

// Create Tweets from TweetB
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE NOT row.UserB IS NULL 
MERGE (tweet:Tweet {TweetID: row.TweetTokenB}) 
SET tweet += {
	Tweet: row.text_y, 
	AuthorToken: row.AuthorTokenB, 
	AuthorHandle: row.UserB, 
	CreatedAt: row.b_created_at, 
	Retweets: row.b_rt_cnt, 
	Replies: row.b_reply_cnt, 
	Likes: row.b_like_count, 
	Quotes: row.b_qt_count, 
	Aspect: row.AspectsB
};

//Create Tweet to Tweet relationship
LOAD CSV WITH HEADERS FROM 'file:///Tweet2Tweet.csv' AS row 
WITH row 
WHERE not row.UserB IS NULL 
MATCH (a:Tweet {TweetID: row.TweetTokenA}) 
MATCH (b:Tweet {TweetID: row.TweetTokenB}) 
MERGE (a)-[r:SIMILAR_TO]-(b) 
SET r += {
	Score: row.SimilarityScore
};

Regards,
Cobra

NewGraphGuy · ‎11-11-2022

Thanks @Cobra. I got this to work before your response by adding this line prior to each merge statement.

:auto USING PERIODIC COMMIT 1000

I got it to create the database very quickly, however we removed a few duplicate rows from our dataset and I tried to rerun everything and it's incredibly slow now. Any idea why that could be? The dataset is smaller and the query is the exact same.

I tried your solution too. I needed to install APOC. However it errors on the constraint statement. Any ideas?

Neo.ClientError.Statement.SyntaxError: Invalid input 'F': expected whitespace, comment or ON (line 1, column 53 (offset: 52))
"CREATE CONSTRAINT constraint_Tweet_id IF NOT EXISTS FOR (tweet:Tweet) REQUIRE tweet.TweetID IS UNIQUE"

Cobra · ‎11-12-2022

Make sure you have UNIQUE CONSTRAINTS on id properties. Moreover you should use my queries and add what you did front of them.

For the error, I don't see anything wrong so I don't know 😕

Neo4j

Creating Relationships takes a very long time