Adding a subelement iteratively with lxml
I want to add subelements iteratively e. g. from a list to a specific position of an XML path using XPath and a tag. Since in my original code I have many subnodes, I don't want to use other functions such as etree.Element or tree.SubElement.
For my provided example, I tried:
from lxml import etree
tree = etree.parse('example.xml')
root = tree.getroot()
new_subelements = ['<year></year>', '<trackNumber></trackNumber>']
destination = root.xpath('/interpretLibrary/interprets/interpretNames/interpret[2]/information')[2]
for element in new_subelements:
add_element = etree.fromstring(element)
destination.insert(-1, add_element)
But this doesn't work.
The initial example.xml file:
<interpretLibrary>
<interprets>
<interpretNames>
<interpret>
<information>
<name>Queen</name>
<album></album>
</information>
<interpret>
<interpret>
<information>
<name>Michael Jackson</name>
<album></album>
</information>
<interpret>
<interpret>
<information>
<name>U2</name>
<album></album>
</information>
</interpret>
</interpretNames>
</interprets>
</interpretLibrary>
The output example.xml I want to produce:
<interpretLibrary>
<interprets>
<interpretNames>
<interpret>
<information>
<name>Queen</name>
<album></album>
</information>
<interpret>
<interpret>
<information>
<name>Michael Jackson</name>
<album></album>
</information>
<interpret>
<interpret>
<information>
<name>U2</name>
<album></album>
<year></year>
<trackNumber></trackNumber>
</information>
<interpret>
</interpretNames>
</interprets>
</interpretLibrary>
Is there any better solution?
1 answer
-
answered 2022-04-28 10:52
Jack Fleeting
Your sample xml in the question is not well formed (the closing
<interpret>
elements should be closed -</interpret>
). Assuming you fixed that, you are almost there - though some changes are necessary:new_subelements = ['<year></year>', '<trackNumber></trackNumber>'] for element in new_subelements: destination = root.xpath('//interpretNames/interpret[last()]/information/*[last()]')[0] add_element = etree.fromstring(element) destination.addnext(add_element) print(etree.tostring(root).decode())
The output should be your sample expected output.
do you know?
how many words do you know
See also questions close to this topic
-
Python File Tagging System does not retrieve nested dictionaries in dictionary
I am building a file tagging system using Python. The idea is simple. Given a directory of files (and files within subdirectories), I want to filter them out using a filter input and tag those files with a word or a phrase.
If I got the following contents in my current directory:
data/ budget.xls world_building_budget.txt a.txt b.exe hello_world.dat world_builder.spec
and I execute the following command in the shell:
py -3 tag_tool.py -filter=world -tag="World-Building Tool"
My output will be:
These files were tagged with "World-Building Tool": data/ world_building_budget.txt hello_world.dat world_builder.spec
My current output isn't exactly like this but basically, I am converting all files and files within subdirectories into a single dictionary like this:
def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree
Right now, my dictionary looks like this:
key:''
.In the following function, I am turning the empty values
''
into empty lists (to hold my tags):def empty_str_to_list(d): for k,v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v)
When I run my entire code, this is my output:
hello_world.dat ['World-Building Tool'] world_builder.spec ['World-Building Tool']
But it does not see
data/world_building_budget.txt
. This is the full dictionary:{'data': {'world_building_budget.txt': []}, 'a.txt': [], 'hello_world.dat': [], 'b.exe': [], 'world_builder.spec': []}
This is my full code:
import os, argparse def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree def empty_str_to_list(d): for k, v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v) parser = argparse.ArgumentParser(description="Just an example", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("--filter", action="store", help="keyword to filter files") parser.add_argument("--tag", action="store", help="a tag phrase to attach to a file") parser.add_argument("--get_tagged", action="store", help="retrieve files matching an existing tag") args = parser.parse_args() filter = args.filter tag = args.tag get_tagged = args.get_tagged current_dir = os.getcwd() files_dict = fs_tree_to_dict(current_dir) empty_str_to_list(files_dict) for k, v in files_dict.items(): if filter in k: if v == []: v.append(tag) print(k, v) elif isinstance(v, dict): empty_str_to_list(v) if get_tagged in v: print(k, v)
-
Actaully i am working on a project and in it, it is showing no module name pip_internal plz help me for the same. I am using pycharm(conda interpreter
File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\Scripts\pip.exe\__main__.py", line 4, in <module> File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_internal\__init__.py", line 4, in <module> from pip_internal.utils import _log
I am using pycharm with conda interpreter.
-
Looping the function if the input is not string
I'm new to python (first of all) I have a homework to do a function about checking if an item exists in a dictionary or not.
inventory = {"apple" : 50, "orange" : 50, "pineapple" : 70, "strawberry" : 30} def check_item(): x = input("Enter the fruit's name: ") if not x.isalpha(): print("Error! You need to type the name of the fruit") elif x in inventory: print("Fruit found:", x) print("Inventory available:", inventory[x],"KG") else: print("Fruit not found") check_item()
I want the function to loop again only if the input written is not string. I've tried to type return Under print("Error! You need to type the name of the fruit") but didn't work. Help
-
Wikipedia scrapping problem, how I can get content all tags inside one
I got a problem while trying to scrap wikipedia page. I want to get definition of title, but inside the tag I'm interested in there are many other tags, and I don't understand how I can obtain content all these tags.
May be exist more simple way to scrape definition, but I don't know how.
def connect(subject: str): subject = subject.replace(' ', '_') responce = requests.get(f'https://ru.wikipedia.org/wiki/{subject}') if responce.status_code == 404: print('404 not found') with open('Scrab/test.html', 'w+', encoding='utf-8') as f: f.write(responce.text) tree = html.fromstring(responce.text) definition = tree.xpath('//div[@class="mw-parser-output"]/p') print(definition[0].text)
This is way how I tryed get content tag
<p>
-
extract attributes from span
I have a lxml file and I need content from there. The file structure looks like this:
<span class="ocr_line" id="line_1_1" title="bbox 394 185 1993 247"> <span class="ocrx_word" id="word_1_1" title="bbox 394 191 535 242; x_entity company_name 0 ; baseline 394 242.21 535 242.21; x_height 208.14; x_style sansSerif bold none">1908</span>
I want to extract all but just <span class ="ocrx_word" I got this line already with:
with open("/home/neichfel/Documents/test.xml", "r") as file: # Read each line in the file, readlines() returns a list of lines content = file.readlines() # Combine the lines in the list into a string content = "".join(content) bs_content = bs(content, "lxml") ocrx_words = bs_content.findAll("span", {"class": "ocrx_word"}) print(ocrx_words)
Now I'm struggling since days with the rest. I need from this (ocrx_words) list the element "title" with the content of X_entity and the text inside from span. Sometimes x_entity is empty and sometimes there is something inside. The text from span I found already with
lines_structure = [] for line in ocrx_words: line_text = line.text.replace("\n", " ").strip() lines_structure.append(line_text) print(lines_structure)
But what I wanna have in the end is a list with
x_entity | text afterwards I convert it into a df, but this I already know how to do. Its just extracting this x_entity :(
Sorry, for maybe mess information I'm new in programming but maybe you can help me out! Thanks
-
Iterating through XMLs, making dataframes from nodes and merging them with a master dataframe. How should I optimize this code?
I'm trying to iterate through a lot of xml files that have ~1000 individual nodes that I want to iterate through to extract specific attributes (each node has 15 or so attributes, I only want one). In the end, there should be about 4 million rows. My code is below, but I have a feeling that it's not time efficient. What can I optimize about this?
import os, pandas as pd, xml.etree.ElementTree as xml #init master df as accumulator of temp dfs master_df = pd.DataFrame( columns = [ 'col1', 'col2', 'col3', 'col4' ]) dir = 'C:\\somedir' #iterate through files for file in os.listdir(dir): #init xml handle and parse file = open(str(dir+"{}").format('\\'+file) parse = xml.parse(file) root = parse.getroot() #var assignments with desired data parent_node1 = str(root[0][0].get('pn1')) parent_node2 = str(root[0][1].get('pn2')) #resetting iteration dependent variables count = 0 a_dict = {} #iterating through list of child nodes for i in list(root[1].iter())[1:]: child_node1 = str(i.get('cn1')) child_node2 = str(i.get('cn2')) a_dict.update({ count: { "col1" : parent_node1, 'col2': child_node1, "col3": parent_node2, "col4" : child_node2 }}) count = count+1 temp_df = pd.DataFrame(a_dict).T master_df = pd.merge( left = master_df, right = temp_df, how = 'outer' )
-
Writing edited XML file (python)
For a project I want to edit a XML file and save it under a new name. In the past I was succesfull but now for some kind of reason I got an error. This is the code I wrote:
i = 0 while i < len(dict_Lines["Columns"]): LineNumber = str(dict_Lines["Columns"][i]) LineID = int(LineNumber[1:]) j = 0 while j < len(List_Points_X): Name = 'Row' + str(j + 1) if LineNumber in dict_Rows_Columns[Name]: Spacing = dict_Spacing[Name] j = j + 1 mytree = ET.parse(Pad_XML_Model) myroot = mytree.getroot() myroot[8][0].remove(myroot[8][0][LineID]) m = str(Pad_Structuren_Kolomverlies + '\\' + 'Structuur' + Linenumber + 'ZK' + '.xml') ET.indent(mytree, space="\t", level=0) mytree.write(m, encoding="utf-8") k = 0 while k < len(dict_Points_On_Column[LineNumber]): Point = dict_dict_Points_On_Column[LineNumber][k] PointID = int(Point[1:]) XCoord = int(dict_Points[Point][0]) ZCoord1 = dict_Points[Point][2] Dsplacement = float(0.2 * Spacing) ZCoord = int(ZCoord1 - Displacement) print(myroot[7][0][PointID].attrib) myroot[7][0][PointID].clear() myroot[7][0][PointID].set('id', PointID) myroot[7][0][PointID].set('nm', Point) myroot = ET.SubElement(myroot[7][0][PointID], 'p0') myroot.set('v', Point) myroot = ET.SubElement(myroot[7][0][PointID], 'p1') myroot.set('v', XCoord) myroot = ET.SubElement(myroot[7][0][PointID], 'p2') myroot.set('v', '0') myroot = ET.SubElement(myroot[7][0][PointID], 'p3') myroot.set('v', ZCoord) k = k + 1 m = str(Pad_Pad_Structuren_Knooppuntsverplaatsing + '\\' + 'Structuur' + '.xml') ET.indent(mytree, space="\t", level=0) mytree.write(m, encoding="utf-8") i = i+1
But for some kind of reason I get the error child index out of range on the line of
myroot[7][0][PointID]
. Even if I just put myroot[7] I get the error but on my other projects it works fine. I checked the XML multiple times and there should be myroot[7][0]. PointID is just an integer number from 1 to 15.It looks like myroot isn't working anymore. When I delete every line with myroot.something. I get the folowwing error:
Traceback (most recent call last): File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 762, in _get_writer write = file_or_filename.write AttributeError: 'str' object has no attribute 'write' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "C:\Users\niels\PycharmProjects\ScriptUitrekenen\Script\ ScriptAanpassenStructurenDictionary.py", line 313, in <module> mytree.write(m, encoding="utf-8") File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 748, in write serialize(write, self._root, qnames, namespaces, File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 913, in _serialize_xml _serialize_xml(write, e, qnames, None, File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 913, in _serialize_xml _serialize_xml(write, e, qnames, None, File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 913, in _serialize_xml _serialize_xml(write, e, qnames, None, File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 906, in _serialize_xml v = _escape_attrib(v) File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 1075, in _escape_attrib _raise_serialization_error(text) File "C:\Users\niels\AppData\Local\Programs\Python\Python310\lib\xml\etree\ElementTree.py", line 1029, in _raise_serialization_error raise TypeError( TypeError: cannot serialize 2 (type int)
I think something is wrong with myroot or mytree but I can't find the solution. The problem occurs after the last while. Before the last while the myroot.write works fine but after the while I get this error. Can someone help me with this?
Thanks in advance, Niels
EDIT The XML file I want to edit looks like this:
<container id="{39A7F468-A0D4-4DFF-8E5C-5843E1807D13}" t="EP_DSG_Elements.EP_StructNode.1"> <table id="7B376840-9267-4296-B40D-30897D7E33D3" t="EP_DSG_Elements.EP_StructNode.1" name="Node"> <h> <h0 t="Name"/> <h1 t="Coord X"/> <h2 t="Coord Y"/> <h3 t="Coord Z"/></h> <obj id="1" nm="K1"> <p0 v="K1"/> <p1 v="0"/> <p2 v="0"/> <p3 v="0"/></obj> <obj id="2" nm="K2"> <p0 v="K2"/> <p1 v="0"/> <p2 v="0"/> <p3 v="3.6000000000000001"/></obj> <obj id="3" nm="K3"> <p0 v="K3"/> <p1 v="0"/> <p2 v="0"/> <p3 v="7.2000000000000002"/></obj>
It is the 8th container (so the 7th index number) and I just want to edit the p1 and p3 of some objects. But which one I want to edit depends and thats why I used indeces to select the right object. I can look foor the object id or nm but it is not the only place these names are used in the XML, how can I make sure I only change them here?
-
Python - using element tree to get data from specific nodes in xml
I have been looking around and there are a lot of similar questions, but none that solved my issue sadly.
My XML file looks like this
<?xml version="1.0" encoding="utf-8"?> <Nodes> <Node ComponentID="1"> <Settings> <Value name="Text Box (1)"> SettingA </Value> <Value name="Text Box (2)"> SettingB </Value> <Value name="Text Box (3)"> SettingC </Value> <Value name="Text Box (4)"> SettingD </Value> <AdvSettings State="On"/> </Settings> </Node> <Node ComponentID="2"> <Settings> <Value name="Text Box (1)"> SettingA </Value> <Value name="Text Box (2)"> SettingB </Value> <Value name="Text Box (3)"> SettingC </Value> <Value name="Text Box (4)"> SettingD </Value> <AdvSettings State="Off"/> </Settings> </Node> <Node ComponentID="3"> <Settings> <Value name="Text Box (1)"> SettingG </Value> <Value name="Text Box (2)"> SettingH </Value> <Value name="Text Box (3)"> SettingI </Value> <Value name="Text Box (4)"> SettingJ </Value> <AdvSettings State="Yes"/> </Settings> </Node> </Nodes>
With Python I'm trying to get the Values of text box 1 and text box 2 for each Node that has "AdvSettings" set on ON.
So in this case I would like a result like
ComponentID State Textbox1 Textbox2 1 On SettingA SettingB 3 On SettingG SettingH
I have done some attempts but didn't get far. With this I managed to get the AdvSettings tag, but that's as far as I got:
import xml.etree.ElementTree as ET tree = ET.parse('XMLSearch.xml') root = tree.getroot() for AdvSettingsin root.iter('AdvSettings'): print(AdvSettings.tag, AdvSettings.attrib)
-
For loop can't generate xml elements with varying contents
I was trying to make a ProPresenter 6 Presentation file (or
.pro6
) generator from either a short or long string input. For some reason, the for loop did not do its job correctly. The for loop should generate another xml tag named "RVDisplaySlide
" with the same tags and contents parsed from a template.pro6
file, then replaces the contents with the string input.For a short string input, it works as it would generate one "slide" tag. However, for long string inputs, which would get split to a list of "fit-able" strings for the "textbox element", it generated the same tags WITH THE SAME content as the first one did.
The full code can be found in this hastebin link: https://www.toptal.com/developers/hastebin/hidefogizi.py
For now, to simplify how the problem looked like, I commented the codeblock that would generate the content and left the code that should generate different uuid values to the elements. The output is similar still.
Here's an example of what I meant:
>>> a = ToPro6("""[Connection Terminated] I'm sorry to interrupt you, Elizabeth, if you still even remember that name, but I'm afraid you've been misinformed. You are not here to receive a gift, nor have you been called here by the individual you assume, although, you have indeed been called. You have all been called here, into a labyrinth of sounds and smells, misdirection and misfortune. A labyrinth with no exit, a maze with no prize. You don't even realize that you are trapped. Your lust for blood has driven you in endless circles, chasing the cries of children in some unseen chamber, always seeming so near, yet somehow out of reach, but you will never find them. None of you will. This is where your story ends. And to you, my brave volunteer, who somehow found this job listing not intended for you, although there was a way out planned for you, I have a feeling that's not what you want. I have a feeling that you are right where you want to be. I am remaining as well. I am nearby. This place will not be remembered, and the memory of everything that started this can finally begin to fade away, as the agony of every tragedy should. And to you monsters trapped in the corridors, be still and give up your spirits. They don't belong to you. For most of you, I believe there is peace and perhaps more waiting for you after the smoke clears. Although, for one of you, the darkest pit of Hell has opened to swallow you whole, so don't keep the devil waiting, old friend. My daughter, if you can hear me, I knew you would return as well. It's in your nature to protect the innocent. I'm sorry that on that day, the day you were shut out and left to die, no one was there to lift you up into their arms the way you lifted others into yours, and then, what became of you. I should have known you wouldn't be content to disappear, not my daughter. I couldn't save you then, so let me save you now. It's time to rest - for you, and for those you have carried in your arms. This ends for all of us. [End Communication]""", "fnaf6_speech") >>> a.save("..") <Element 'array' at 0x0000018847E9BA10> 'Succeed'
XML output:
... <RVSlideGrouping name="" color="1 1 1 0" uuid="709FF810-7A39-46AD-8A4E-03E592A0AFB1"> <array rvXMLIvarName="slides"> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> </array> </RVSlideGrouping> ...
As you can see, it generated the same uuid for each element. Is there a way to fix this?
-
Extract panda dataframe from .xml file
I have a .xml file with following contents:
<detailedreport xmlns:xsi="http://"false"> <severity level="5"> <category categoryid="3" categoryname="Buffer Overflow" pcirelated="false"> <cwe cweid="121" cwename="Stack-based Buffer Overflow" pcirelated="false" sans="120" certc="1160"> <description> <text text="code."/> </description> <staticflaws> <flaw severity="5" categoryname="Stack-based Buffer Overflow" count="1" issueid="6225" module="Jep" type="strcpy" description="This call to strcpy() contains a buffer overflow. The source string has an allocated size of 80 bytes " note="" cweid="121" remediationeffort="2" exploitLevel="0" categoryid="3" pcirelated="false"> <exploitability_adjustments> <exploitability_adjustment score_adjustment="0"> </exploitability_adjustment> </exploitability_adjustments> </flaw> </staticflaws> </cwe> </category> </severity> </detailedreport>
Below is the python program to extract some of the fields from the .xml file under the "flaw" tag. But when I print the fields in python program, they are empty.
from lxml import etree root = etree.parse(r'fps_change.xml') xroot = root.getroot() df_cols = ["categoryname", "issueid", "module"] rows = [] for node in xroot: #s_name = node.attrib.get("name") s_categoryname = node.find("categoryname") s_issueid = node.find("issueid") s_module = node.find("module") rows.append({"categoryname": s_categoryname, "issueid": s_issueid, "module": s_module}) out_df = pd.DataFrame(rows, columns=df_cols) print(out_df) #this prints empty.
Expected Output:
Stack-based Buffer Overflow 6225 Jep
What changes should I do in my program to get my expected output.