Neo4j

luiseduardo · ‎06-02-2022

Hi community!

I have a requirement to build a timeline-like series of events based on timestamps. Currently our model has events, users, and other characteristics associated to the event, including the date and time of the event. There's quite a few events per user, and about 20k users.

So, we need a per user per day timeline of events with one event having at best one relationship to the next event; although, events can happen at the same hour and minute.

I have tried with something like this:

MATCH (u:User)<-[:HAS_USER]-(e:Event)-[:HAS_DATE]->(d:Date),
(u)<-[:HAS_USER]-(e2:Event)-[:HAS_DATE]->(d),
(e)-[:HAS_HOUR]->(h:Time),
(e2)-[:HAS_HOUR]->(h2:Time)
WHERE h2.hour >= h.hour AND h2.minute > h.minute
MERGE (e)-[:PRECEDES]->(e2)

But, of course, events relate to every other one which has greater time than the first one. I can delete the extra relationships, but it takes quite a long time. I have indexes on every used property.

So, is there a better way to do this maybe in only one step?

luiseduardo · ‎07-06-2022

Hi! For anyone wondering I found this code that performed well and created the needed relationships, hope this helps!

CALL apoc.periodic.iterate(
'MATCH (l:Login) RETURN l',
'MATCH (l)-[:DOES]->(e:Event),
(e)-[r:HAS_DATE]->(d:EventDate)
WITH e, d
ORDER BY d.eventDate, r.hour, r.minute
WITH collect(e) AS events
WITH events, events[1..] AS nextEvents
UNWIND range(0, size(nextEvents)-1, 1) AS index
WITH events[index] AS first, nextEvents[index] AS second
CREATE (first)-[:NEXT]->(second)
',{iterateList:false}
);

View solution in original post

glilienfield · ‎06-02-2022

I think the data model depends on how you want to query the data. Can you describe how you will use the data?

The following model may make sense. It allows you to quickly find the events that occurred for a person on a specific date or range of dates. A Date node exists for each day that contains an event (for all users). They have a date property equal to the date. The Event nodes have a timestamp property. Timestamps usually have resolution to a millisecond. This can help with ordering the events, since it will be less probable to have two with the same timestamp. Even so, if you do have less resolution in the timestamp and two events have equal timestamps, you can put the two events in random order. Will that affect your application?

Once you get the events for a given user and date, you can sort the events by timestamp to retrieve them in order (ascending or depending). You wouldn't need the relationships for this purpose, which may improve performance.

We can refine the model if you can describe what querying capabilities you need.

Just a note, the logic in your 'where' clause is not correct. For one time to be greater than another time, it is not necessarily true that the minutes will be greater (as in your query). The order of the minutes is only relevant if the hours are equal. This is why using timestamps is better, then comparing the individual components of time. A proper implementation of timestamp comparison logic will also account for differing time zones. You would have to have something like:

where (h2.hour > h.hour) or ((h2.hour = h.hour) and (h2.minutes >= h.minutes))

luiseduardo · ‎06-03-2022

Thank you, @glilienfield ! Yes, in fact the model looks just like that, with more characteristics around the Events, but for this question I've focused on these elements.

Our queries are quite diverse given the other characteristics, it's sort of a cybersecurity use case, so a lot of different types of queries are going to be neede. But, for this particular use of the graph we would like to know the timeline of events that a user carried on, given the day, relate it to some other user's activity, drill down on the entities around the events. Basically investigate around the details, by providing the ordering of events of a given user, on a given day.

Thank you for that note, you're totally right!

My question goes more about the actual cypher to optimally create the PRECEEDS relationship, if I do the creation with the conditions you proposed I'd get one rel for every event that has a greater or equal timestamp, and if in one day a single user has 2k events I'd have 1999 + 1998 + 1997... relationships at least! So, the creation can take some time, and the deletion of the unused rels can be troublesome, I did it like this:

MATCH (e:Event)-[:PRECEEDS]->(e2:Event),
(e)-[:HAS_HOUR]->(t:Hour),
(e2)-[:HAS_HOUR]->(t2:Hour)
WITH e, MIN(t.hour + t.minute/100) AS MinH
MATCH (e)-[r:PRECEEDS]->(e2:Event),
(e2)-[:HAS_HOUR]->(t:Hour)
WHERE t.hour + t.minute/100 > MinH
DELETE r

I'm asking if there's a better way to do this, since creation and deletion takes, with test data, about 40 minutes, and I would need to do this with every load of data that will be carried on daily. As of now this goes to every user, on every day to create the first rels; and then, we go to every event to gather the first ones on the preceding order.

Hope this clarifies! Thanks!

glilienfield · ‎06-04-2022

If you are looking for the list of ordered events for a specific day, you could at that with the following query with model above. You could change the Date node to an Hour node if you had that relationship and you wanted an hour. That approach seems restricted, as you could not get events that cross hour boundaries. Another approach is to perform a range search on an event timestamp. This should be fast if indexed. The code below gets the users events on a specific date, orders them by time and passed the list with a 'with' clause, so they can be joined with other related data to perform additional analysis around the events.

match(u:User) where u.id = 1000
match(d:Date) where d.date = Date('2022-01-02')
match(u)<-[HAS_USER]-(e:Event)-[OCCURED_ON]->(d)
with e
order by e.timestamp
match(e)-->(x:OTHER_STUFF)
return e, x

The above is just an idea, but let's concentrate on your cypher. Sorry, I am still confused on the data model and your need. It sounds like you have a list of user events that are ingested each day and this process is time consuming. I assumed each event had a time associated with it and that you were adding these events and wanted to form a 'linked list' of the events ordered by time sequence. As such, don't you have a new relationship for each new event? If all the PRECEEDS relationships are an issue, can you do without them? You can get the order of the events with a timestamp (which should be indexed). Having them linked would be useful if you wanted to traverse the list of events from a specific event in either direction; otherwise, it seems an event timestamp would provide the functionality to order a list of events. Do you have a need to traverse a list of connected?

I need to understand the query a little more in order to assist. I am confused at what you are trying to achieve. I assume each event has just one HAS_HOUR relation to one Hour node, as an event happens within a specific hour. If true, the 'with' clause on line 4 will only have one row resulting from the match on line 2; therefore, the 'min' aggregation will be over just one Hour node and the result will be equal to the hour/minute associated with event 'e'. Since the 'with' clause is grouping by 'e' and the aggregation is on 't', the values of 'e2' and 't2' are not used and those matched can be removed. Is my understanding of the relationships between 'e' and 't' correct or incorrect?

The match on line 5 also confuses me. Here you are getting the pairs [e, e2] that are related by a PRECEEDS relationship, then constraining 'e2' to those nodes that have an hour/minute value greater than that of 'e'. I thought the fact that 'e' and 'e2' had a PRECEEDS relationship implied that 'e2' occurred after 'e', thus the 'where' clause would always be true and there is only one possible 'e2' event related to 'e' in this manner.

If this query is run in isolation, from a high level, it seems it gets each pair of [e, e2] nodes connected to each other, then returns the value of 'e' on line 4, then matches it to the same value of 'e2' that would have been matched one line 1, then deletes the relationships between the two. At the end, wouldn't all the PRECEEDS relationships be removed?

I think I am missing some understanding of your data mode. I am glad to try to help if you want to take the time get me to a shared understanding. I am sorry it is taking time and lots of questions.

Just a note, you need to divide by '60' instead of '100' if you want to convert the {hour, minute} pair to an hour value with a fraction that accounts for the minutes.

luiseduardo · ‎06-14-2022

No worries! Thank you for your help!

This is our current model, the Char nodes are characteristics associated to the event, not relevant to this part of the use case. To see the events of the user I have to go from the user to the events, then to the dates and finally extract the data

I do have a requirement to traverse mainly backwards on events, so maybe it would make more sense to have a SUCCEEDS reltaionship lol.

You're right I made a mistake in the Cypher, it should be t2 instead of t in the MIN function, and we could spare the t relationship. The process creates the events but doesn't create the PRECCEDES relationships, and I don't have a FIRST or LAST relationship, either (maybe this is the way). So, currently I want to create only the relevant PRECEDES/SUCCEEDS relationships, i.e only the least greater event related to each event, (or the greatest smaller event, with SUCCEEDS). I have created them with the first query, but as I said, that query creates much more rels that I need, for every event that happened after each event on the same day by the same user. So, I used the second query to delete every non-relevant relationship, taking the minimum sum of hour plus minutes divided by 100 (By 60 it could make more sense, but I didn't wanted to lose decimals by truncating or flooring) for each event and deleting every relationship with an event that has a greater sum than that of the minimum. It's quite symmetric if I try with SUCCEEDS, of course.

So, my question is about the best way to create this relationships, after the initial load, if it is the best way. And if not, how would I do it during load time? I create nodes first and rels after, maybe adding the FIRST and LAST and moving them for each event that happens after the greatest event on the same day...I'm not sure lol.

Thanks a lot! I've been a bit busy on some other things, but I really appreciate your help, bouncing ideas is really helpful.

Edit: Also, "new" events not necessarily succeed the loaded events, i.e, sometimes newly loaded events happened before what is already loaded.

luiseduardo · ‎07-06-2022

Hi! For anyone wondering I found this code that performed well and created the needed relationships, hope this helps!

CALL apoc.periodic.iterate(
'MATCH (l:Login) RETURN l',
'MATCH (l)-[:DOES]->(e:Event),
(e)-[r:HAS_DATE]->(d:EventDate)
WITH e, d
ORDER BY d.eventDate, r.hour, r.minute
WITH collect(e) AS events
WITH events, events[1..] AS nextEvents
UNWIND range(0, size(nextEvents)-1, 1) AS index
WITH events[index] AS first, nextEvents[index] AS second
CREATE (first)-[:NEXT]->(second)
',{iterateList:false}
);

Neo4j

Best way to create timeline from events?