Extract panda dataframe from .xml file
I have a .xml file with following contents:
<detailedreport xmlns:xsi="http://"false">
<severity level="5">
<category categoryid="3" categoryname="Buffer Overflow" pcirelated="false">
<cwe cweid="121" cwename="Stack-based Buffer Overflow" pcirelated="false" sans="120" certc="1160">
<description>
<text text="code."/>
</description>
<staticflaws>
<flaw severity="5" categoryname="Stack-based Buffer Overflow" count="1" issueid="6225" module="Jep" type="strcpy" description="This call to strcpy() contains a buffer overflow. The source string has an allocated size of 80 bytes " note="" cweid="121" remediationeffort="2" exploitLevel="0" categoryid="3" pcirelated="false">
<exploitability_adjustments>
<exploitability_adjustment score_adjustment="0">
</exploitability_adjustment>
</exploitability_adjustments>
</flaw>
</staticflaws>
</cwe>
</category>
</severity>
</detailedreport>
Below is the python program to extract some of the fields from the .xml file under the "flaw" tag. But when I print the fields in python program, they are empty.
from lxml import etree
root = etree.parse(r'fps_change.xml')
xroot = root.getroot()
df_cols = ["categoryname", "issueid", "module"]
rows = []
for node in xroot:
#s_name = node.attrib.get("name")
s_categoryname = node.find("categoryname")
s_issueid = node.find("issueid")
s_module = node.find("module")
rows.append({"categoryname": s_categoryname,
"issueid": s_issueid, "module": s_module})
out_df = pd.DataFrame(rows, columns=df_cols)
print(out_df) #this prints empty.
Expected Output:
Stack-based Buffer Overflow 6225 Jep
What changes should I do in my program to get my expected output.
1 answer
-
answered 2022-04-26 05:45
onyambu
from bs4 import BeautifulSoup html_obj = BeautifulSoup(string) flaw = html_obj.find('flaw') [flaw[key] for key in df_cols] ['Stack-based Buffer Overflow', '6225', 'Jep']
string = ''' <detailedreport xmlns:xsi="http://"false"> <severity level="5"> <category categoryid="3" categoryname="Buffer Overflow" pcirelated="false"> <cwe cweid="121" cwename="Stack-based Buffer Overflow" pcirelated="false" sans="120" certc="1160"> <description> <text text="code."/> </description> <staticflaws> <flaw severity="5" categoryname="Stack-based Buffer Overflow" count="1" issueid="6225" module="Jep" type="strcpy" description="This call to strcpy() contains a buffer overflow. The source string has an allocated size of 80 bytes " note="" cweid="121" remediationeffort="2" exploitLevel="0" categoryid="3" pcirelated="false"> <exploitability_adjustments> <exploitability_adjustment score_adjustment="0"> </exploitability_adjustment> </exploitability_adjustments> </flaw> </staticflaws> </cwe> </category> </severity> </detailedreport>'''
do you know?
how many words do you know
See also questions close to this topic
-
Python File Tagging System does not retrieve nested dictionaries in dictionary
I am building a file tagging system using Python. The idea is simple. Given a directory of files (and files within subdirectories), I want to filter them out using a filter input and tag those files with a word or a phrase.
If I got the following contents in my current directory:
data/ budget.xls world_building_budget.txt a.txt b.exe hello_world.dat world_builder.spec
and I execute the following command in the shell:
py -3 tag_tool.py -filter=world -tag="World-Building Tool"
My output will be:
These files were tagged with "World-Building Tool": data/ world_building_budget.txt hello_world.dat world_builder.spec
My current output isn't exactly like this but basically, I am converting all files and files within subdirectories into a single dictionary like this:
def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree
Right now, my dictionary looks like this:
key:''
.In the following function, I am turning the empty values
''
into empty lists (to hold my tags):def empty_str_to_list(d): for k,v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v)
When I run my entire code, this is my output:
hello_world.dat ['World-Building Tool'] world_builder.spec ['World-Building Tool']
But it does not see
data/world_building_budget.txt
. This is the full dictionary:{'data': {'world_building_budget.txt': []}, 'a.txt': [], 'hello_world.dat': [], 'b.exe': [], 'world_builder.spec': []}
This is my full code:
import os, argparse def fs_tree_to_dict(path_): file_token = '' for root, dirs, files in os.walk(path_): tree = {d: fs_tree_to_dict(os.path.join(root, d)) for d in dirs} tree.update({f: file_token for f in files}) return tree def empty_str_to_list(d): for k, v in d.items(): if v == '': d[k] = [] elif isinstance(v, dict): empty_str_to_list(v) parser = argparse.ArgumentParser(description="Just an example", formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("--filter", action="store", help="keyword to filter files") parser.add_argument("--tag", action="store", help="a tag phrase to attach to a file") parser.add_argument("--get_tagged", action="store", help="retrieve files matching an existing tag") args = parser.parse_args() filter = args.filter tag = args.tag get_tagged = args.get_tagged current_dir = os.getcwd() files_dict = fs_tree_to_dict(current_dir) empty_str_to_list(files_dict) for k, v in files_dict.items(): if filter in k: if v == []: v.append(tag) print(k, v) elif isinstance(v, dict): empty_str_to_list(v) if get_tagged in v: print(k, v)
-
Actaully i am working on a project and in it, it is showing no module name pip_internal plz help me for the same. I am using pycharm(conda interpreter
File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\Scripts\pip.exe\__main__.py", line 4, in <module> File "C:\Users\pjain\AppData\Local\Programs\Python\Python310\lib\site-packages\pip\_internal\__init__.py", line 4, in <module> from pip_internal.utils import _log
I am using pycharm with conda interpreter.
-
Looping the function if the input is not string
I'm new to python (first of all) I have a homework to do a function about checking if an item exists in a dictionary or not.
inventory = {"apple" : 50, "orange" : 50, "pineapple" : 70, "strawberry" : 30} def check_item(): x = input("Enter the fruit's name: ") if not x.isalpha(): print("Error! You need to type the name of the fruit") elif x in inventory: print("Fruit found:", x) print("Inventory available:", inventory[x],"KG") else: print("Fruit not found") check_item()
I want the function to loop again only if the input written is not string. I've tried to type return Under print("Error! You need to type the name of the fruit") but didn't work. Help
-
how do I dissable debian python path/recursion limit
so, as of late, I've been having path length limit and recursion limit issues, so I really need to know how to disable these.
I can't even install modules like discord.py!!!!
-
TypeError: 'float' object cannot be interpreted as an integer on linspace
TypeError Traceback (most recent call last) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 15' in <cell line: 1>() -->1 plot_freq(signal, sample_rate) d:\website\SpeechProcessForMachineLearning-master\SpeechProcessForMachineLearning-master\speech_process.ipynb Cell 10' in plot_freq(signal, sample_rate, fft_size) 2 def plot_freq(signal, sample_rate, fft_size=512): 3 xf = np.fft.rfft(signal, fft_size) / fft_size ----> 4 freq = np.linspace(0, sample_rate/2, fft_size/2 + 1) 5 xfp = 20 * np.log10(np.clip(np.abs(xf), 1e-20, 1e100)) 6 plt.figure(figsize=(20, 5)) File <__array_function__ internals>:5, in linspace(*args, **kwargs) File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\numpy\core\function_base.py:120, in linspace(start, stop, num, endpoint, retstep, dtype, axis) 23 @array_function_dispatch(_linspace_dispatcher) 24 def linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, 25 axis=0): 26 """ 27 Return evenly spaced numbers over a specified interval. 28 (...) 118 119 """ --> 120 num = operator.index(num) 121 if num < 0: 122 raise ValueError("Number of samples, %s, must be non-negative." % num) TypeError: 'float' object cannot be interpreted as an integer
What solution about this problem?
-
IndexError: list index out of range with api
all_currencies = currency_api('latest', 'currencies') # {'eur': 'Euro', 'usd': 'United States dollar', ...} all_currencies.pop('brl') qtd_moedas = len(all_currencies) texto = f'{qtd_moedas} Moedas encontradas\n\n' moedas_importantes = ['usd', 'eur', 'gbp', 'chf', 'jpy', 'rub', 'aud', 'cad', 'ars'] while len(moedas_importantes) != 0: for codigo, moeda in all_currencies.items(): if codigo == moedas_importantes[0]: cotacao, data = currency_api('latest', f'currencies/{codigo}/brl')['brl'], currency_api('latest', f'currencies/{codigo}/brl')['date'] texto += f'{moeda} ({codigo.upper()}) = R$ {cotacao} [{data}]\n' moedas_importantes.remove(codigo) if len(moedas_importantes) == 0: break # WITHOUT THIS LINE, GIVES ERROR
Why am I getting this error? the list actually runs out of elements, but the code only works with the if
-
Any efficient way to compare two dataframes and append new entries in pandas?
I have new files which I want to add them to historical table, before that, I need to check new file with historical table by comparing its two column in particular, one is
state
and another one isdate
column. First, I need to checkmax (state, date)
, then check those entries withmax(state, date)
in historical table; if they are not historical table, then append them, otherwise do nothing. I tried to do this in pandas bygroup-by
on new file and historical table and do comparison, if any new entries from new file that not in historical data, then add them. Now I have issues to append new values to historical table correctly in pandas. Does anyone have quick thoughts?My current attempt:
import pandas as pd src_df=pd.read_csv("https://raw.githubusercontent.com/adamFlyn/test_rl/main/src_df.csv") hist_df=pd.read_csv("https://raw.githubusercontent.com/adamFlyn/test_rl/main/historical_df.csv") picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
I want to check
picked_rows
inhist_df
where I need to check bystate
andyyyy_mm
columns, so only add entries frompicked_rows
wherestate
hasmax
value or recent dates. I created desired output below. I tried inner join orpandas.concat
but it is not giving me correct out. Does anyone have any ideas on this?Here is my desired output that I want to get:
import pandas as pd desired_output=pd.read_csv("https://raw.githubusercontent.com/adamFlyn/test_rl/main/output_df.csv")
-
How to bring data frame into single column from multiple columns in python
I have data format in these multiple columns. So I want to bring all 4 columns of data into a single column.
YEAR Month pcp1 pcp2 pcp3 pcp4 1984 1 0 0 0 0 1984 2 1.2 0 0 0 1984 3 0 0 0 0 1984 4 0 0 0 0 1984 5 0 0 0 0 1984 6 0 0 0 1.6 1984 7 3 3 9.2 3.2 1984 8 6.2 27.1 5.4 0 1984 9 0 0 0 0 1984 10 0 0 0 0 1984 11 0 0 0 0 1984 12 0 0 0 0
-
Exclude Japanese Stopwords from File
I am trying to remove Japanese stopwords from a text corpus from twitter. Unfortunately the frequently used nltk does not contain Japanese, so I had to figure out a different way.
This is my MWE:
import urllib from urllib.request import urlopen import MeCab import re # slothlib slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt" sloth_file = urllib.request.urlopen(slothlib_path) # stopwordsiso iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt" iso_file = urllib.request.urlopen(iso_path) stopwords = [line.decode("utf-8").strip() for line in iso_file] stopwords = [ss for ss in stopwords if not ss==u''] stopwords = list(set(stopwords)) text = '日本語の自然言語処理は本当にしんどい、と彼は十回言った。' tagger = MeCab.Tagger("-Owakati") tok_text = tagger.parse(text) ws = re.compile(" ") words = [word for word in ws.split(tok_text)] if words[-1] == u"\n": words = words[:-1] ws = [w for w in words if w not in stopwords] print(words) print(ws)
Successfully Completed: It does give out the original tokenized text as well as the one without stopwords
['日本語', 'の', '自然', '言語', '処理', 'は', '本当に', 'しんどい', '、', 'と', '彼', 'は', '十', '回', '言っ', 'た', '。'] ['日本語', '自然', '言語', '処理', '本当に', 'しんどい', '、', '十', '回', '言っ', '。']
There is still 2 issues I am facing though:
a) Is it possible to have 2 stopword lists regarded? namely
iso_file
andsloth_file
? so if the word is either a stopword fromiso_file
orsloth_file
it will be removed? (I tried to use line 14 asstopwords = [line.decode("utf-8").strip() for line in zip('iso_file','sloth_file')]
but received an error as tuple attributes may not be decodedb) The ultimate goal would be to generate a new text file in which all stopwords are removed.
I had created this MWE
### first clean twitter csv import pandas as pd import re import emoji df = pd.read_csv("input.csv") def cleaner(tweet): tweet = re.sub(r"@[^\s]+","",tweet) #Remove @username tweet = re.sub(r"(?:\@|http?\://|https?\://|www)\S+|\\n","", tweet) #Remove http links & \n tweet = " ".join(tweet.split()) tweet = ''.join(c for c in tweet if c not in emoji.UNICODE_EMOJI) #Remove Emojis tweet = tweet.replace("#", "").replace("_", " ") #Remove hashtag sign but keep the text return tweet df['text'] = df['text'].map(lambda x: cleaner(x)) df['text'].to_csv(r'cleaned.txt', header=None, index=None, sep='\t', mode='a') ### remove stopwords import urllib from urllib.request import urlopen import MeCab import re # slothlib slothlib_path = "http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt" sloth_file = urllib.request.urlopen(slothlib_path) #stopwordsiso iso_path = "https://raw.githubusercontent.com/stopwords-iso/stopwords-ja/master/stopwords-ja.txt" iso_file = urllib.request.urlopen(iso_path) stopwords = [line.decode("utf-8").strip() for line in iso_file] stopwords = [ss for ss in stopwords if not ss==u''] stopwords = list(set(stopwords)) with open("cleaned.txt",encoding='utf8') as f: cleanedlist = f.readlines() cleanedlist = list(set(cleanedlist)) tagger = MeCab.Tagger("-Owakati") tok_text = tagger.parse(cleanedlist) ws = re.compile(" ") words = [word for word in ws.split(tok_text)] if words[-1] == u"\n": words = words[:-1] ws = [w for w in words if w not in stopwords] print(words) print(ws)
While it works for the simple input text in the first MWE, for the MWE I just stated I get the error
in method 'Tagger_parse', argument 2 of type 'char const *' Additional information: Wrong number or type of arguments for overloaded function 'Tagger_parse'. Possible C/C++ prototypes are: MeCab::Tagger::parse(MeCab::Lattice *) const MeCab::Tagger::parse(char const *)
for this line:
tok_text = tagger.parse(cleanedlist)
So I assume I will need to make amendments to thecleanedlist
?I have uploaded the cleaned.txt on github for reproducing the issue: [txt on github][1]
Also: How would I be able to get the tokenized list that excludes stopwords back to a text format like cleaned.txt? Would it be possible to for this purpose create a df of ws? Or might there even be a more simple way?
Sorry for the long request, I tried a lot and tried to make it as easy as possible to understand what I'm driving at :-)
Thank you very much! [1]: https://gist.github.com/yin-ori/1756f6236944e458fdbc4a4aa8f85a2c
-
separate datetime column in R while keeping time accurate
4/12/2016 12:00:00 AM I have dates in the format above and have tried to use separate() to create two columns in the data frame where the data is present. When I do the columns are created but AM/PM so the times just become numbers or worse appear as "12H 0M 0S". Can anyone help me out, pretty new to data analysis as a whole and would be much appreciated!
-
How do I implement rank function for nearest values for a column in dataframe?
df.head(): run_time match_datetime country league home_team away_team 0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka 1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19 2 2021-08-07 00:04:36.326391 2021-08-08 World Club Friendly Alaves Al Nasr 3 2021-08-07 00:04:36.326391 2021-08-09 China Jia League Chengdu Rongcheng Shenyang Urban FC 4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger 5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19 6 2021-08-06 00:04:36.326391 2021-08-08 Russia Youth League Konoplev Academy U19 Rubin Kazan U19 7 2021-08-06 00:04:36.326391 2021-08-09 World Club Friendly Real Sociedad Eibar
desired df
run_time match_datetime country league home_team away_team 0 2021-08-07 00:04:36.326391 2021-08-06 Russia FNL 2 - Group 2 Yenisey 2 Lokomotiv-Kazanka 1 2021-08-07 00:04:36.326391 2021-08-07 Russia Youth League Ural U19 Krylya Sovetov Samara U19 4 2021-08-06 00:04:36.326391 2021-08-06 China Super League Wuhan FC Tianjin Jinmen Tiger 5 2021-08-06 00:04:36.326391 2021-08-07 Czech Republic U19 League Sigma Olomouc U19 Karvina U19
How do i use
rank
function to filter only the 2 nearestmatch_datetime
dates for everyrun_time
value. i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2match_datetime
values for everyrun_time
-
For loop can't generate xml elements with varying contents
I was trying to make a ProPresenter 6 Presentation file (or
.pro6
) generator from either a short or long string input. For some reason, the for loop did not do its job correctly. The for loop should generate another xml tag named "RVDisplaySlide
" with the same tags and contents parsed from a template.pro6
file, then replaces the contents with the string input.For a short string input, it works as it would generate one "slide" tag. However, for long string inputs, which would get split to a list of "fit-able" strings for the "textbox element", it generated the same tags WITH THE SAME content as the first one did.
The full code can be found in this hastebin link: https://www.toptal.com/developers/hastebin/hidefogizi.py
For now, to simplify how the problem looked like, I commented the codeblock that would generate the content and left the code that should generate different uuid values to the elements. The output is similar still.
Here's an example of what I meant:
>>> a = ToPro6("""[Connection Terminated] I'm sorry to interrupt you, Elizabeth, if you still even remember that name, but I'm afraid you've been misinformed. You are not here to receive a gift, nor have you been called here by the individual you assume, although, you have indeed been called. You have all been called here, into a labyrinth of sounds and smells, misdirection and misfortune. A labyrinth with no exit, a maze with no prize. You don't even realize that you are trapped. Your lust for blood has driven you in endless circles, chasing the cries of children in some unseen chamber, always seeming so near, yet somehow out of reach, but you will never find them. None of you will. This is where your story ends. And to you, my brave volunteer, who somehow found this job listing not intended for you, although there was a way out planned for you, I have a feeling that's not what you want. I have a feeling that you are right where you want to be. I am remaining as well. I am nearby. This place will not be remembered, and the memory of everything that started this can finally begin to fade away, as the agony of every tragedy should. And to you monsters trapped in the corridors, be still and give up your spirits. They don't belong to you. For most of you, I believe there is peace and perhaps more waiting for you after the smoke clears. Although, for one of you, the darkest pit of Hell has opened to swallow you whole, so don't keep the devil waiting, old friend. My daughter, if you can hear me, I knew you would return as well. It's in your nature to protect the innocent. I'm sorry that on that day, the day you were shut out and left to die, no one was there to lift you up into their arms the way you lifted others into yours, and then, what became of you. I should have known you wouldn't be content to disappear, not my daughter. I couldn't save you then, so let me save you now. It's time to rest - for you, and for those you have carried in your arms. This ends for all of us. [End Communication]""", "fnaf6_speech") >>> a.save("..") <Element 'array' at 0x0000018847E9BA10> 'Succeed'
XML output:
... <RVSlideGrouping name="" color="1 1 1 0" uuid="709FF810-7A39-46AD-8A4E-03E592A0AFB1"> <array rvXMLIvarName="slides"> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> <RVDisplaySlide backgroundColor="0 0 0 1" highlightColor="" drawingBackgroundColor="false" enabled="true" hotKey="" label="" notes="" UUID="0380C4D4-FD26-4768-9612-28AEE0C05894" chordChartPath="" /> </array> </RVSlideGrouping> ...
As you can see, it generated the same uuid for each element. Is there a way to fix this?
-
Adding a subelement iteratively with lxml
I want to add subelements iteratively e. g. from a list to a specific position of an XML path using XPath and a tag. Since in my original code I have many subnodes, I don't want to use other functions such as etree.Element or tree.SubElement.
For my provided example, I tried:
from lxml import etree tree = etree.parse('example.xml') root = tree.getroot() new_subelements = ['<year></year>', '<trackNumber></trackNumber>'] destination = root.xpath('/interpretLibrary/interprets/interpretNames/interpret[2]/information')[2] for element in new_subelements: add_element = etree.fromstring(element) destination.insert(-1, add_element)
But this doesn't work.
The initial example.xml file:
<interpretLibrary> <interprets> <interpretNames> <interpret> <information> <name>Queen</name> <album></album> </information> <interpret> <interpret> <information> <name>Michael Jackson</name> <album></album> </information> <interpret> <interpret> <information> <name>U2</name> <album></album> </information> </interpret> </interpretNames> </interprets> </interpretLibrary>
The output example.xml I want to produce:
<interpretLibrary> <interprets> <interpretNames> <interpret> <information> <name>Queen</name> <album></album> </information> <interpret> <interpret> <information> <name>Michael Jackson</name> <album></album> </information> <interpret> <interpret> <information> <name>U2</name> <album></album> <year></year> <trackNumber></trackNumber> </information> <interpret> </interpretNames> </interprets> </interpretLibrary>
Is there any better solution?