Neo4j

radhika · ‎02-03-2021

Hi,

So I have a dataset where we have uploaded a large number of user nodes (with userid as property) and Ip address Nodes (with IP address as property). Multiple users are using the same IP address and we have connected them through the IP relationship. So multiple users are connected to, lets say, one Ip address node. However the users are connected to each other through the IP node so Im not able to get the desired output. And since our dataset was large, we had to upload data through the neo4j admin bulk import tool.

We want to save this data on S3 for our customer support team who cannot use Neo4j. So we want the output in the following format (table format):

User Node (user id). Connected users (through IP address)
38994772 [4899244, 800374, 89922774, 28899472]
74999583 [3902223, 4800243, 59002784, 8900023, 2900139884, 289001342]

Basically we want to save the relationships (the connected components) on S3 in a table format. Can anybody help me with a query in which i get get the connected users in a set? Also these users are not directly connected to each other, they are connected to each other through the IP node.

Attaching screenshot below

radhika · ‎02-03-2021

The output that i want basically is like this :

andrew_bowman · ‎02-03-2021

Cobra's query will work.

However, if id is a unique property of :userid nodes (and if it SHOULD be but isn't, please create a unique constraint here), then it would make more sense to aggregate in a WITH using a as the grouping key, and only after that do your RETURN and project out a.id (this cuts out redundant work). Additionally, it would be better to use a singular, User, for a or a.id, since it's a single user per row, but keep the plural for the other term, as Connected_Users is a list.

If you're using at least Neo4j 4.1.x, then we can make this more efficient with subqueries, where we perform more aggregations (one per user), but each is doing far less work, as compared to a single aggregation over the entire result set.

MATCH (a:userid)
CALL {
  WITH a
  MATCH (a)-[:userid_ip*2]-(b:userid)
  WITH collect(DISTINCT b) AS users
  WHERE size(users) > 0
  RETURN [user IN users | user.id] as Connected_Users
}
RETURN a.id as User, Connected_Users

I'm using collect(DISTINCT b) here in case one user is connected to the same other user multiple times, via multiple IPs. If that is an impossibility in your graph, then you can remove the DISTINCT from there, and maybe change from user a subquery to using a pattern comprehension instead.

Cobra · ‎02-03-2021

Hello @radhika and welcome to the Neo4j community

This query should give you the desired output:

MATCH (a:userid)-[:userid_ip]->()<-[:userid_ip]-(b:userid)
RETURN a.id AS Users, collect(b.id) AS Connected Users

Regards,
Cobra

radhika · ‎02-04-2021

@Cobra & @andrew.bowman

Thankyou so much!! both the solutions worked. I have one more question

So I also have a graph like this :

^ Here we have user nodes & they are connected to each other through referral. (some user has referred the other user so its the parentid of that user). So basically we are trying to identify referral fraud in our company and so we need the details of the user who has referred the other user on our app and that user has referred more users and theres a chain of referrals where it starts with 1 parent id and builds as he refers more users and those same users then refer other users..

I want an output like this -

It seems to be extremely complicated but I tried this-

However what this does is, it returns parentid, then first generation (only 2 random users), then 2nd generation (children of only one of the 2 children) .

Is there any way we can get the parent id, then children, then children of ALL children, then children of ALL children of ALL children...& so on in sets or will it be too complicated.. if all the generations is going to be too complicated can we at least get till 3-4 generations?

radhika · ‎02-04-2021

& also for my first query regarding the IP address & users.. i have a really large dataset on a linux instance neo4j browser. so When i run those queries, it takes forever to give results plus it crashes down too. I just wanted to check the results so can i just take this for some users? like use limit 10?

Neo4j

Need to return connected user nodes through IP address in a set to save data in table on S3