Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
08-26-2021 01:37 PM
I'm trying an experiment that consists of ingesting a text file that has numbered questions and answers in the following format:
1.) Question : Answer
2.) Question : Answer
...
I'm able to read the TXT file using the LOAD CSV function just fine. I can separate the answer from the question with the SPLIT function (e.g., split(row, ':') as newrow. The question is how can I do a second split to extract just the question number as well? I tried the following:
unwind(row) as newrow
with split(newrow, ':') as myrow
unwind(myrow) as thisrow
with myrow, split(thisrow, ".)") as q
return q, myrow[0], myrow[1]
But I seem to be looping within myrow constraints as I get q returning with the question number & question, then the answer in the next row, etc.
Any suggestions? I would really like to separate out the actual question number and then the answer to further build the graph. The question text doesn't really matter (consistent number and questions).
Solved! Go to Solution.
09-22-2021 10:45 AM
You have a typo first line : LOAD CSV FROM "file:///mbandor.txt' as line
instead of LOAD CSV FROM 'file:///mbandor.txt' as line
08-26-2021 03:51 PM
Hi @mbandor!
Have you tried without the unwind?
Something like
with split(row, ':') as myrow
with myrow, split(myrow[0], ".)") as q
return q[0], q[1], myrow[1]
Bennu
09-14-2021 06:43 AM
Unfortunately split() only works with string values. At this point, row is a list and an error is thrown about the type mismatch.
09-18-2021 12:50 AM
Hello @mbandor
Here is a try:
LOAD CSV FROM 'file:///test.txt' AS line
UNWIND line AS item
WITH split(item, ":") AS elements
WITH split(elements[0], ".)") AS questions, elements[1] AS answer
RETURN questions[0] AS question_number, questions[1] AS question, answer
Regards,
Cobra
09-22-2021 08:06 AM
I'm still getting errors. This is the code:
LOAD CSV FROM "file:///abc.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], ".)") as questions, elements[1] as answer
return questions[0] as q_nbr, questions[1] as q, answer
It doesn't like the second split and throws the following error:
Invalid input ')': expected whitespace or a property key name (line 4, column 27 (offset: 133))
"WITH split(elements[0], ".)") as questions, elements[1] as answer"
09-22-2021 08:11 AM
My query was working with my text file, can you send me yours?
09-22-2021 08:17 AM
Unfortunately I can't, however here are the first 3 lines in the text file:
1.) Is this a follow-on report? : Yes
2.) Enter report number of initial report. : TBN-17356-002
3.) Has this information been shared with any other agency? : No
The pattern (question #,), question, : response) are pretty consistent throughout the rest of the file.
Ultimately, I would like to use the apoc.do.case to address each question and perform additional processing based on the question number. That is why I'm trying to extract the question number along with the response. I've managed to separate the response but not the question number.
09-22-2021 09:33 AM
With this data, Cobra's code is working fine. Check line 4 in your .csv file. I used Neo4j 4.3.1 and desktop 1.4.8
09-22-2021 10:44 AM
Still no luck. I created a new txt file with only the 3 lines I posted in it. This is the current code I'm trying to execute:
LOAD CSV FROM "file:///mbandor.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], '.)') as questions, elements[1] as answer
RETURN questions[0] as q_nbr, questions[1] as q, answer;
The error now is a Lexical error at line 5, col 57 (Encountered after : ""
09-22-2021 10:44 AM
Yeah, same, I don't have any errors.
09-22-2021 10:45 AM
You have a typo first line : LOAD CSV FROM "file:///mbandor.txt' as line
instead of LOAD CSV FROM 'file:///mbandor.txt' as line
09-22-2021 10:46 AM
Thanks! That is what was causing the error! I appreciate everyone's input. Now to parse the rest of the file and create some nodes and relationships.
09-22-2021 11:04 AM
Let's take this a slightly different direction using the apoc.load.csv call. The reason I'm heading that direction is some of the lines in the text file contain commas (throws off the processing). If I use the apoc.load.csv('mbandor.txt', {sep:':'}) it handles the semi colon without the need for the first UNWIND (from previous example) and gives me the additional benefit of automatically handling lines that have commas in it (either as part of the question and/or part of the answer). This is the prototype code that still needs to separate the question number of the result for further processing:
CALL apoc.load.csv('mbandor.txt',
{sep:':'})
YIELD list
RETURN list[0] as Question, list[1] as Response;
I also noticed the first line of the file gets skipped. That isn't really an issue as the real file has 2 lines with the report heading before the actual questions, so I can use the SKIP function to address that.
Alternatively I just discovered the FIELDTERMINATOR option for changing the delimiter as an option.
09-23-2021 06:20 AM
@mbandor this is NOT directed at you. It's a comment on APOC and how it's scope has continued to creep over built in cypher capabilities.
As context: my work project isn't allowed to use APOC, so I'm a little sensitive to "just use apoc"
Comment:
This raises one of the serious issues with APOC. There are now functions that make things more complicated than the original cypher method.
FIELDTERMINATOR
as you mentioned is the better way to do it, apoc really should be for filling in gaps in cypher's capabilities, not recreating the wheel.
09-22-2021 02:46 PM
I have a secondary question related to double-splitting the strings. Is there a limit as to how many values can be checked with apoc.case()? I have about 75 and I'm getting lexical errors, but lesser amounts seem to work fine. Each line in the file is unique and has to be parsed.
09-22-2021 11:02 PM
What is the error? Maybe there is a typo?
09-23-2021 06:09 AM
I tried an experiment and the limit appears to be about 50 conditions. That just means I have to make an additional pass or two to completely parse each file.
All the sessions of the conference are now available online