-
Notifications
You must be signed in to change notification settings - Fork 34
Description
While writing unit tests for XmlAssembler, I ran into a couple of issues. At first I've set up a chain reading only one GML file with three FeatureMember elements. In my config I wanted to write an etree doc for every two elements. I'm expecting two documents in this case, one with two elements, and one with only one element (the last one). I was surprised that no doc was written (to stdout). Here is my config:
# Config file for unit testing XmlAssembler.
[etl]
chains = input_glob_file|parse_xml_file|xml_assembler|output_std
[input_glob_file]
class = inputs.fileinput.GlobFileInput
file_path = tests/data/dummy.gml
# The source input file producing XML elements
[parse_xml_file]
class = filters.xmlelementreader.XmlElementReader
element_tags = FeatureMember
# Assembles etree docs gml:featureMember elements, each with "max_elements" elements
[xml_assembler]
class = filters.xmlassembler.XmlAssembler
max_elements = 2
container_doc = <?xml version="1.0" encoding="UTF-8"?>
<gml:FeatureCollectionT10NL
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:top10nl="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2"
xmlns:brt="http://www.kadaster.nl/schemas/imbrt/brt-alg/1.0"
xmlns:gml="http://www.opengis.net/gml/3.2"
xsi:schemaLocation="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2 http://www.kadaster.nl/schemas/top10nl/vyyyymmdd/TOP10NL_1_2.xsd">
</gml:FeatureCollectionT10NL >
element_container_tag = FeatureCollectionT10NL
[output_std]
class = outputs.standardoutput.StandardOutput
I was suspecting this check in XmlAssembler.consume_element:
if element is None or packet.is_end_of_stream() is True:
(Note that the is True
is redundant, but that doesn't matter.)
It turned indeed out that packet.is_end_of_stream was true. I think it is already caused by the GlobFileInput. I've just added this input class yesterday. It could be the case that I'm not understanding properly when is_end_of_stream should be set to true, but I'm wondering whether a filter which can return multiple packets based on one input packet (for example when an XML file is being parsed using XmlElementReader) should actually reset is_end_of_stream or is_end_of_doc.
When I skip this check, so I'm only checking for element is None
, then a new XML document is generatedfor every XML element, so I was getting 3 documents, instead of the expected 2.
When I'm reading all GML files in my test data directory (currently 3 files), by setting file_path to tests/data/*.gml in input_glob_file, I'm getting either 6 documents (while checking for packet.is_end_of_stream()
) or 9 documents. With 3 files I'm actually expecting 6 documents (3 x 2), namely a doc with 2 elements followed by a doc with 1 element, three times. However, each document contains only one element, only of the first 2 GML files. When disabling the aforementioned check I'm getting 9 docs, each with one element.
So, my question is how packet.is_end_of_stream and packet.is_end_of_doc should actually behave. Should they be reset when one input packet result in multiple output packets for the particular component? Or is there more to it?
I've attached my unit test file. The method test_execute is just a work-in-progress.
test_xml_assembler.zip