Head's Up! These forums are read-only. All users and content have migrated. Please join us at community.neo4j.com.
01-06-2021 09:02 PM
Trying to import a conference proceedings which is in XML. I've studied the APOC in detail and checked other resources, but just can't get it to do the right thing. At times it's giving results that don't make sense.
It wouldn't be smart to post the data file here, even the stripped down test data is too large.
Solved! Go to Solution.
01-11-2021 06:02 AM
Hello @jeg
In your code you do an unwind which creates N (aka 3) rows
but then you don't use the values from the unwind, but do another filter on the collection
You have to decide which of the two it is:
CALL apoc.load.xml(XML file) YIELD value
WITH [item in value._children WHERE item._type="conference_rec"] as conference_rec
return conference_rec;
or
CALL apoc.load.xml(XML file) YIELD value
UNWIND value._children AS conference
WITH conference WHERE conference._type="conference_rec"
return conference AS conference_rec;
Elaine
01-07-2021 11:28 AM
At least a couple of things could be going on...
Maybe also try it on a very small subset and debug that.
Maybe also try converting XML to CSV but use TABs to separate the values. A lot less can go wrong with CSV and it's usually easier to fix broken CSV than XML or JSON.
It's kind of hard to tell what's wrong without more info. (I haven't tried importing with XML or JSON so I can't vouch for their quality.)
01-07-2021 01:12 PM
Hello @jeg ,
These are the steps / processes I would recommend -
One of the first step to take is to understand the XML Structure (i.e. root, elements, attributes and text).
--> XML -> XML Tree
If the file is too big, there are several tools. My favorite is XML Notepad from Microsoft (FREE!!!) -> XmlNotepad. (Note - if you know the opening and closing of parent tab, then use a small subset, otherwise don't copy and paste to a random lines. )
Go through this detailed documentation -> Load XML - APOC Documentation
Try to load to root, elements , attributes one at a time and slowly build your data model (examples -> Load XML - APOC Documentation
You don't have to convert to JSON/ CSV to check the validity. You have to convert to key:value pair in JSON, or pivot to columns. Instead open in Visual Code editor (with XML formatter).
Note -> apoc.xml.import doesn't have any bugs, but deprected. Instead use apoc.load.xml -> Load XML - APOC Documentation
I (we) are currently using apoc xml in production, and we don't see any bugs.
One of the most popular XML tool used in enterprise is from Altova. They are priced, but there are trial versions available. ( I wouln't recommend you to buy it if your use case is just once. Companies buy it because they build XML Schemas)
Alternate solution used by Data Engineers is to use python with xml.etree.ElementTree module and Minidom. (Python XML Parser Tutorial | ElementTree and Minidom Parsing | Edureka)
01-07-2021 04:39 PM
Instead of posting the test data, here's an overview of the document
<?xml version="1.0" encoding="ISO-8859-1"?>
<proceeding ver="6.0" ts="04/10/2010">
<X></X>
<Y></Y>
<Z></Z>
</proceeding>
X,Y,Z are complex, with several layers of nesting.
This statement:
CALL apoc.load.xml(*XML file*)
YIELD value UNWIND value._children AS conference
WITH [item in value._children WHERE item._type="conference_rec"] as conference_rec
return conference_rec;
returns X three times, where X is "conference_rec", the first chunk of data in the file. The exact same thing, repeated. I don't understand that.
01-07-2021 06:27 PM
I think you really need to provide one example where it's not working.
I presume lots of people have used XML importing for simpler cases without problems, so I'm guessing your simplified example won't demonstrate the problem you are encountering.
(If you want to redact the data, that's OK.)
01-07-2021 07:49 PM
Yeah, definitely. I never thought it was a problem with the APOC, just a problem with saying the right thing to make it do what I need. Here is the test file that produces the results in my last posting:
<?xml version="1.0" encoding="ISO-8859-1"?>
<proceeding ver="6.0" ts="04/10/2010">
<conference_rec>
<conference_date>
<start_date>07/27/1776</start_date>
<end_date>07/31/1776</end_date>
</conference_date>
<conference_loc>
<city><![CDATA[San Diego]]></city>
<state>California</state>
</conference_loc>
</conference_rec>
<proceeding_rec>
<conference_number>2003</conference_number>
<pages>142</pages>
</proceeding_rec>
<content>
<section>
<section_id>965401</section_id>
<section_title><![CDATA[some sectiontitle]]></section_title>
<section_page_from>1</section_page_from>
<article_rec>
<title><![CDATA[Peanuts and other goobers]]></title>
<abstract>
<par><![CDATA[Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vel turpis nunc eget lorem dolor sed viverra ipsum.]]></par>
</abstract>
<authors>
<au>
<first_name><![CDATA[Mickey]]></first_name>
<last_name><![CDATA[Mouse]]></last_name>
<affiliation><![CDATA[University of Southern North Dakota at Hoople]]></affiliation>
<role><![CDATA[Author]]></role>
</au>
<au>
<first_name><![CDATA[Donald]]></first_name>
<last_name><![CDATA[Duck]]></last_name>
<role><![CDATA[Author]]></role>
</au>
</authors>
</article_rec>
<article_rec>
<title><![CDATA[Efficient extraction of oil from peanuts]]></title>
<authors>
<au>
<first_name><![CDATA[Minnie]]></first_name>
<last_name><![CDATA[Mouse]]></last_name>
<affiliation><![CDATA[D Entertainment]]></affiliation>
<role><![CDATA[Author]]></role>
</au>
<au>
<first_name><![CDATA[Daisy]]></first_name>
<last_name><![CDATA[Duck]]></last_name>
<affiliation><![CDATA[Imagineers]]></affiliation>
<role><![CDATA[Author]]></role>
</au>
</authors>
</article_rec>
</section>
</content>
</proceeding>
01-11-2021 06:02 AM
Hello @jeg
In your code you do an unwind which creates N (aka 3) rows
but then you don't use the values from the unwind, but do another filter on the collection
You have to decide which of the two it is:
CALL apoc.load.xml(XML file) YIELD value
WITH [item in value._children WHERE item._type="conference_rec"] as conference_rec
return conference_rec;
or
CALL apoc.load.xml(XML file) YIELD value
UNWIND value._children AS conference
WITH conference WHERE conference._type="conference_rec"
return conference AS conference_rec;
Elaine
01-12-2021 11:48 PM
Thank you, that has gotten me going.
All the sessions of the conference are now available online