cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.

Which NLP process to use for full text extraction and best practices?

So I'm trying to self train as a data scientist. So i am very new at a lot of this.

My current project has me trying to get full news articles from multiple sources into a KG to form a more full picture of who is saying what and when with the ability to compare them side by side.

I currently am using trafilatura
as the scraper to get both links from the site map and to scrape full articles and meta data. Im currently getting the data as a dict of dicts. Other options like json are available from tool. I can get the entire article with meta data into a csv on one line.

My question is the best way to extract "facts" or sentences from the full text article and get them into KG with each fact having the meta data attached to it. The long term goal is to be able to point and say "every one says there is a glass of water, cnn said its half full, fox said its half empty."

i know i will need NLP but graphaware, Apoc and probably a few other all do this, im looking for best practice. Should i do the fact extraction before, after or during upload. Which library should i use and why. HOW? code examples are great, tutorials are great too. Iv been trying to use this [here] as a start point but the actual extraction of facts wasn't as detailed as i would like.

After that its matching like facts together to see the whole story in one place from multiple sources.

Step 37 is get it to write its own wiki pages and news articles. All while showing misinformation and which news sources regularly redact full stories less then a week later due to lack of research, credible sources or just blatant lying.

well that is the hope.

here

0 REPLIES 0
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

All the sessions of the conference are now available online