cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Interval search on partially known dates

Hi,

Let me start with the problem introduction. There is a node that represents a person and multiple associated with that that person nodes that represent events (for example, birth). The exact date and time of the event might not be known exactly, it is possible that we know only part of the exact date (for example, born in Jan 1990). I need help with the query optimization that retrieves all persons that participated in the event of interest and that happened between two given dates. With current approach indexes can not be applied and the query is executed slow .

Now the time information is stored inside relationship connecting a person with some event. See the create query for the format:

    CREATE (p: Person {id: {id}}),
           (e: Event {value: {value}}),
           (p) -[:RELATED_TO {l: {l}, u: {u}, dayFixed: {dayFixed}, monthFixed: {monthFixed}, yearFixed: {yearFixed}}]-> (e)

The template requires the following values:

  • l - The lower bound of the date, datetime
  • u - The upper bound of the date, datetime
  • dayFixed - If day is unknown, boolean
  • monthFixed - If month is unknown, boolean
  • yearFixed - If year is unknown, boolean

The query that does the strict search between two given dates. In square brackets the query does the non-strict search, where the partially known dates are added when there is match by year-month or year:

MATCH (p: Person) -[r:RELATED_TO]-> (e: Event {value: 'birth'})
WHERE datetime('2005-06-14') <= r.l AND r.u <= datetime('2005-06-16') [OR (r.dayFixed = FALSE AND datetime('2005-05-01') <= r.l AND  r.u <= datetime('2005-05-31')) OR (r.monthFixed = FALSE AND datetime('2005-01-01') <= r.l AND r.u <= datetime('2005-12-31') )]
RETURN p
LIMIT 20;

I would be glad to any suggestions on how to optimize the query or the schema of the data.

1 ACCEPTED SOLUTION

Hi @g.emelyanov,

I guess that query optimizer use the fact that p is already known so it's preferred just travers with the relation in order to know d (and then filter) instead of using the index. If you feel like :Date index is preferred, you may try filtering by date before using the full text index.

How's this performing?

CALL db.index.fulltext.queryNodes("testIndex", 'Viurts N') YIELD node
with collect(node) as n
MATCH (node) <-[:HAS_NAME]-(p:Person) -[:DATE_OF_BIRTH]-> (d: Date)
WHERE datetime('2005-06-14') < d.date < datetime('2005-06-16')
and node in n
RETURN DISTINCT p
LIMIT 20

H

View solution in original post

12 REPLIES 12

Bennu
Graph Fellow

Hi @g.emelyanov,

I suggest you to move l and u to the node :Event and Index both properties. I'm not sure abot what you are trying to achieve with the day/month/year fixed boolean (somehow it derived from l and u). About that, you can consider adding a Label to :Event based on this knowledge and eventually querying with multiple labels. This will speed up a lot.

Lemme know how it goes!

H

Thanks for reply!

The problem with indexing both properties that define interval boundaries is that neo4j can not aggregate result of multiple range index lookups. The optimizer will do the lookup on the first property range operation, but the second property range operation will be regular filter. That is better than nothing, but in the worst case that still have complexity O(n).

About dayFixed, etc, the goal is to support different ways to do the search. With partially known dates it is problematic to do the regular search between two dates, for example a person who do the search might know the exact date of birth, but in the data there is only month and year. And that is why there is strict and non-strict modes of search. By default, if there is match by month and year or only year with event that has partially known date, the associated node will be included with low rank to the output, this is non-strict search. If there is no need to include events with not known dates, the search will be done strictly between two dates, this is strict search. The usage of labels might speed up the non-strict search, but that also means it is needed to generate a label for every month-year pair and every year, that seems to be a lot.

Hi @g.emelyanov,

By labels I meant the same Boolean dayFixed, monthFixed, yearFixed you already have. This way you will make a SearchByLabelScan on this typeof search. About the agreggation, you can pipeline the 4 different condition while agreggationg them 'manually' so you have a NodeIndexSeekByRange for each one of them.

i.e.

MATCH (p: Person) -[:RELATED_TO]-> (e: Event {value: 'birth'})
WHERE datetime('2005-06-14') <= e.l AND e.u <= datetime('2005-06-16') 
WITH collect(p) as out
MATCH (p: Person) -[:RELATED_TO]-> (e: Event:NotDayFixed  {value: 'birth'})
WHERE AND datetime('2005-05-01') <= e.l AND  e.u <= datetime('2005-05-31')
WITH apoc.coll.intersection(out,collect(p)) as out
MATCH (p: Person) -[r:RELATED_TO]-> (e: NotMonthFixed {value: 'birth'})
WHERE datetime('2005-01-01') <= e.l AND e.u <= datetime('2005-12-31') 
WITH apoc.coll.intersection(out,collect(p)) as out
UNWIND out as p
RETURN p LIMIT 20;

H

Hello @g.emelyanov and welcome to the Neo4j community

We had a use case with the same problem, in some case the date was not complete so we decided to complete the date, for example:

  • 2020 become 20200101
  • 202001 become 20200101

Regards,
Cobra

Thanks for reply! @Cobra @Bennu

Your ideas helped me a lot! I combined the unknown date completion and a label to indicate unknown part of the date, that seems to work just fine. Now only a single date is stored and labels indicate whether a day or month is unknown.

However, that leads to the other issues with the details I haven't considered. A person node might have several relations to nodes that represent name and before doing the filter by date property of particular event, there is a search by name with full-text index.

It looks like the cypher optimizer does not use index on date property to do filtering of the result matched with full-text index. It seems really weird, but to do a match by name with full-time index and after take intersection with the result of a match by date interval takes significantly less time than do a single query.

Here is an example of that query:

CALL db.index.fulltext.queryNodes("testIndex", 'name') YIELD node
MATCH (p: Person) -[:HAS_NAME]-> (node)
MATCH (p) -[:HAS_DATE_OF_BIRTH]-> (d: Date)
WHERE datetime('2005-06-14') < d.date < datetime('2005-06-16')
RETURN DISTINCT p
LIMIT 20;

I've tried to add a hint to use index on date property, but there was no improvement.

Hi @g.emelyanov!

Why are you creating another node with the name property? Is not better just using it inside :Person?

What happens if you send it this way?

CALL db.index.fulltext.queryNodes("testIndex", 'name') YIELD node
MATCH (p: Person) -[:HAS_NAME]-> (node)
WITH p
MATCH (p) -[:HAS_DATE_OF_BIRTH]-> (d: Date)
WHERE datetime('2005-06-14') < d.date < datetime('2005-06-16')
RETURN DISTINCT p
LIMIT 20;

I'm trying to create a dummy version in order to test this. If you have any script to recreate part of your data may be good

H

Hello @Bennu!

There are several reasons why name is represented as node. A name might have a bunch of associated properties, there can be multiple associated names, etc.

I'm using a python script to generate a sample data. Here is the link to the gist you can use to produce a sample data https://gist.github.com/goshaQ/2af2f55f75414e782f0b6a64246e1b31

An example of the full-text search query is "Вюрц" OR "Viurts"

Maybe problem can be related to a result of procedure call being a stream? I have no ideas what happens with query optimizer, but there is no difference with and without index on Date.

That's because the index cannot be applied on partial data as the data set (d) is limited now.

you could try changing it to traversal

MATCH (d: Date)
WHERE datetime('2005-06-14') < d.date < datetime('2005-06-16')
WITH collect(id(d)) as dates
CALL db.index.fulltext.queryNodes("testIndex", 'name') YIELD node
MATCH (p: Person) -[:HAS_NAME]-> (node)
WITH p, dates
MATCH (p) -[:HAS_DATE_OF_BIRTH]->(d) WHERE id(d) in dates
RETURN DISTINCT p
LIMIT 20;

This avoids the property lookup db hits for the dates.

Hello @g.emelyanov

Could you the show us the result of the PROFILE with your query?
Can you show us the list of indexes you created?

Regards,
Cobra

Hello @Cobra!

Sure

Indexes

| "INDEX ON :Date(date)"                                                                                            | "Unnamed index" | ["Date"]   | ["date"]                                                                                               | "ONLINE" | "node_label_property"  | 100.0    | {version: "1.0", key: "native-btree"} | 31 | ""             |
| "INDEX ON NODE:Test(name)"                                                                                        | "testIndex"     | ["Test"]   | ["name"]                                                                                               | "ONLINE" | "node_fulltext"        | 100.0    | {version: "1.0", key: "fulltext"}     | 21 | ""             |

Profile

+-----------------------------------------------------------------------------------------+
| Plan      | Statement   | Version      | Planner | Runtime       | Time | DbHits | Rows |
+-----------------------------------------------------------------------------------------+
| "PROFILE" | "READ_ONLY" | "CYPHER 3.5" | "COST"  | "INTERPRETED" | 3348 | 493261 | 20   |
+-----------------------------------------------------------------------------------------+


+-----------------+----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator        | Estimated Rows | Rows  | DB Hits | Cache H/M | Identifiers              | Other                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+-----------------+----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProduceResults |              0 |    20 |       0 |       0/0 | anon[100], p, node, r, d |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Limit          |              0 |    20 |       0 |       0/0 | anon[100], p, node, r, d | 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter         |              0 |    20 |  222335 |       0/0 | anon[100], p, node, r, d | AndedPropertyInequalities(Variable(d),Property(Variable(d),PropertyKeyName(date)),LessThan(Property(Variable(d),PropertyKeyName(date)),ResolvedFunctionInvocation(datetime,Some(datetime(input  =  DEFAULT_TEMPORAL_ARGUMENT :: ANY?) :: DATETIME?),Vector(CoerceTo(Parameter(  AUTOSTRING2,String),Any)))), GreaterThan(Property(Variable(d),PropertyKeyName(date)),ResolvedFunctionInvocation(datetime,Some(datetime(input  =  DEFAULT_TEMPORAL_ARGUMENT :: ANY?) :: DATETIME?),Vector(CoerceTo(Parameter(  AUTOSTRING3,String),Any))))) |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Expand(All)    |              0 | 54185 |  108370 |       0/0 | anon[100], p, node, r, d | (p)-[r:DATE_OF_BIRTH]->(d)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Filter         |             60 | 54185 |   54185 |       0/0 | anon[100], node, p       | p:Person                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +Expand(All)    |             60 | 54185 |  108370 |       0/0 | anon[100], node, p       | (node)<-[anon[100]:HAS_NAME]-(p)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| |               +----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| +ProcedureCall  |          10000 | 54185 |       1 |       0/0 | node                     | db.index.fulltext.queryNodes(CoerceTo(Parameter(  AUTOSTRING0,String),String), CoerceTo(Parameter(  AUTOSTRING1,String),String)) :: (node :: Node)                                                                                                                                                                                                                                                                                                                                                                                         |
+-----------------+----------------+-------+---------+-----------+--------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Hi @g.emelyanov,

I guess that query optimizer use the fact that p is already known so it's preferred just travers with the relation in order to know d (and then filter) instead of using the index. If you feel like :Date index is preferred, you may try filtering by date before using the full text index.

How's this performing?

CALL db.index.fulltext.queryNodes("testIndex", 'Viurts N') YIELD node
with collect(node) as n
MATCH (node) <-[:HAS_NAME]-(p:Person) -[:DATE_OF_BIRTH]-> (d: Date)
WHERE datetime('2005-06-14') < d.date < datetime('2005-06-16')
and node in n
RETURN DISTINCT p
LIMIT 20

H

Hello @Bennu,

Indeed, the optimizer will not go to WHERE condition first to fetch related nodes using index (with Date label) and instead would decide to iterate over each node from the stream to filter out not satisfying the condition. That seems to be the right decision, as in the first scenario (my initial though) there is no optimization, it still will need to iterate over each node.

I suppose the only way to achieve proper usage of indexes is to build an index with Solr or Elasticsearch, with neo4j there is a trade off between complexity of the data structure and speed of query execution.

I would mark your last response as solution, because in the end I come up with similar query.
Thanks to everyone who replied!