My XPath with double slash always selects same element from HtmlAgilityPack result
I'm trying to scrap the list of music for a given date from the billboard website. When I obtain a list of div for each music I'm trying to get the title and the artist for each div, but in my foreach loop I get always the same return values, is like my foreach loop doesn't go to the next div.
using HtmlAgilityPack;
using System;
using System.Collections.Generic;
namespace Billboard_Scraping
{
internal class Program
{
static void Main(string[] args)
{
var billBoardMusic = GetMusicList("https://www.billboard.com/charts/hot-100/2000-08-12");
}
static List<Music> GetMusicList(string url)
{
var musics = new List<Music>();
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load(url);
HtmlNodeCollection linknode = document.DocumentNode.SelectNodes("//div[contains(@class,\"o-chart-results-list-row-container\")]");
foreach (var link in linknode)
{
var music = new Music();
var titleXPath = "//h3[contains(@class,\"c-title\")]";
var artistXPath = "//span[contains(@class,\"c-label a-no-trucate\")]";
music.Title = link.SelectSingleNode(titleXPath).InnerText.Trim();
music.Autor = link.SelectSingleNode(artistXPath).InnerText.Trim();
musics.Add(music);
}
return musics;
}
}
public class Music
{
public string Title { get; set; }
public string Autor { get; set; }
}
}
do you know?
how many words do you know
See also questions close to this topic
-
C# - Adding condition to func results in stack overflow exception
I have a func as part of specification class which sorts the given iqueryable
Func<IQueryable<T>, IOrderedQueryable<T>>? Sort { get; set; }
When i add more than one condition to the func like below , it results in stack overflow exception.
spec.OrderBy(sc => sc.Case.EndTime).OrderBy(sc => sc.Case.StartTime);
The OrderBy method is implemented like this
public ISpecification<T> OrderBy<TProperty>(Expression<Func<T, TProperty>> property) { _ = Sort == null ? Sort = items => items.OrderBy(property) : Sort = items => Sort(items).ThenBy(property); return this; }
Chaining or using separate lines doesn't make a difference.
This problem gets resolved if I assign a new instance of the specification and set it's func, but i don't want to be assigning to a new instance everytime. Please suggest what am i missing here and how to reuse the same instance (if possible).
-
How to projection fields for a dictionary (C#, MongdoDB)
I am trying my luck here, I have a model which is like the following
public class RowData : BaseBsonDefinition { . [BsonExtraElements] [BsonDictionaryOptions(DictionaryRepresentation.ArrayOfDocuments)] public Dictionary<string, object> Rows { get; set; } = new(StringComparer.OrdinalIgnoreCase); . }
In result, the schema in the MongoDB looks like
{ "_id": { "$binary": { "base64": "HiuI1sgyT0OZmcgGUit2dw==", "subType": "03" } }, "c1": "AAA", "c8": "Fully Vac", "c10": "", }
Those c1, c8 and c10 fields are keys from the dictionary, my question is how to dynamic project those fields?
I tried
Builders<RowData>.Projection.Exclude(p => "c1")
It seems the MongoDB driver can not handle a value directly.
Anyone could point me in the correct direction?
Thanks,
-
How do I add new DataSource to an already Databinded CheckBoxList
i'm building a web form that show Database's item(Tables, Rows, FK,...)
I have a CheckBoxList of Tables (
chkListTable
) which will show a new CheckBoxList of Rows (chkListRow
) everytime I SelectedIndexChanged fromchkListTable
. The problem is i can show the items fromchkListTable
with 1 selected item. But i don't know how to showchkListRow
if multiple item fromchkListTable
are selected.Here are my codes:
aspx
:<div> <asp:Label ID="Label2" runat="server" Text="Table: "></asp:Label> <asp:CheckBoxList ID="chkListTable" runat="server" DataTextField="name" DataValueFeild="name" AutoPostBack="true" OnSelectedIndexChanged="chkListTable_SelectedIndexChanged"> </asp:CheckBoxList> </div> <div> <asp:CheckBoxList ID="chkListRow" runat="server" DataTextField="COLUMN_NAME" DataValueField="COLUMN_NAME" RepeatDirection="Horizontal"> </asp:CheckBoxList> </div>
aspx.cs
:protected void chkListTable_SelectedIndexChanged(object sender, EventArgs e) { tableName.Clear(); foreach (ListItem item in chkListTable.Items) { if(item.Selected) { tableName.Add(item.Text.Trim()); } } for(int i = 0; i < tableName.Count; i++) { String query = "USE " + dbname + " SELECT * FROM information_schema.columns" + " WHERE table_name = '" + tableName[i] + "'" + " AND COLUMN_NAME != 'rowguid'"; chkListRow.DataSource = Program.ExecSqlDataReader(query); chkListRow.DataBind(); Program.conn.Close(); } }
Program.cs
:public static bool Connect() { if (Program.conn != null && Program.conn.State == ConnectionState.Open) Program.conn.Close(); try { Program.conn.ConnectionString = Program.constr; Program.conn.Open(); return true; } catch (Exception e) { return false; } } public static SqlDataReader ExecSqlDataReader(String query) { SqlDataReader myreader; SqlCommand sqlcmd = new SqlCommand(query, Program.conn); sqlcmd.CommandType = CommandType.Text; if (Program.conn.State == ConnectionState.Closed) Program.conn.Open(); try { myreader = sqlcmd.ExecuteReader(); return myreader; myreader.Close(); } catch (SqlException ex) { Program.conn.Close(); return null; } }
I want my display to be like this:
[x]Table1 [x]Table2 [ ]Table3 [ ]Row1(Table1) [ ]Row2(Table1) [ ]Row3(Table1) [ ]Row1(Table2) [ ]Row2(Table2)
-
ValueError: All arrays must be of the same length when scraping
I try to input different zip codes and scrape information for Target products. However, results in this error:ValueError: All arrays must be of the same length and there is nothing in my CSV file. I guess because I did not successfully scrap e all the information. Can anyone give me some suggestions on how to improve the code? I appreciate any help. Thanks.
Following is my code:
#Target Url list urlList = [ 'https://www.target.com/p/pataday-once-daily-relief-extra-strength-drops-0-085-fl-oz/-/A-83775159?preselect=81887758#lnk=sametab', 'https://www.target.com/p/kleenex-ultra-soft-facial-tissue/-/A-84780536?preselect=12964744#lnk=sametab', 'https://www.target.com/p/claritin-24-hour-non-drowsy-allergy-relief-tablets-loratadine/-/A-80354268?preselect=14351285#lnk=sametab', 'https://www.target.com/p/opti-free-pure-moist-rewetting-drops-0-4-fl-oz/-/A-14358641#lnk=sametab', 'https://www.target.com/p/allegra-24-hour-allergy-relief-tablets-fexofenadine-hydrochloride/-/A-15068699?preselect=14042732#lnk=sametab', 'https://www.target.com/p/nasacort-allergy-relief-spray-triamcinolone-acetonide/-/A-15143450?preselect=15503329#lnk=sametab', 'https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab', 'https://www.target.com/p/zyrtec-24-hour-allergy-relief-tablets-cetirizine-hcl/-/A-15075280?preselect=79847258#lnk=sametab', 'https://www.target.com/p/pataday-twice-daily-eye-allergy-itch-and-redness-relief-drops-0-17-fl-oz/-/A-78780978#lnk=sametab', 'https://www.target.com/p/systane-gel-drops-lubricant-eye-gel-0-33-fl-oz/-/A-14523072#lnk=sametab'] zipCodeList = [3911,4075,4467,96970,96960,49220,49221,49224,48001,49227,48101,48002,48003,48004] while(True): priceArray = [] nameArray = [] zipCodeArray =[] GMTArray = [] TCIN = [] UPC = [] def ScrapingTarget(url): wait_imp = 10 CO = webdriver.ChromeOptions() CO.add_experimental_option('useAutomationExtension', False) CO.add_argument('--ignore-certificate-errors') CO.add_argument('--start-maximized') wd = webdriver.Chrome(r'D:\chromedriver\chromedriver_win32new\chromedriver_win32 (2)\chromedriver.exe',options=CO) wd.get(url) wd.implicitly_wait(wait_imp) # needed to click onto the "Show more" to get the tcin and upc xpath = '//*[@id="tabContent-tab-Details"]/div/button' element_present = EC.presence_of_element_located((By.XPATH, xpath)) WebDriverWait(wd, 5).until(element_present) showMore = wd.find_element(by=By.XPATH, value=xpath) sleep(3) showMore.click() # showMore = wd.find_element(by=By.XPATH, value="//*[@id='tabContent-tab-Details']/div/button") # sleep(2) #showMore.click() soup = BeautifulSoup(wd.page_source, 'html.parser') # gets a list of all elements under "Specifications" try: # gets a list of all elements under "Specifications" div = soup.find("div", {"class": "styles__StyledCol-sc-ct8kx6-0 iKGdHS h-padding-h-tight"}) list = div.find_all("div") for a in range(len(list)): list[a] = list[a].text # locates the elements in the list tcin = [v for v in list if v.startswith("TCIN")] upc = [v for v in list if v.startswith("UPC")] except: tcin = "Error" upc = "Error" TCIN.append(tcin) UPC.append(upc) for zipcode in zipCodeList: try: #click the delivery address address = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[1]/button[2]") address.click() #click the Edit location editLocation = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[2]/button") editLocation.click() except: #directly click he Edit location editLocation = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div[1]/div/div[1]/button") editLocation.click() #input ZipCode inputZipCode = wd.find_element(by=By.XPATH, value="//*[@id='enter-zip-or-city-state']") inputZipCode.clear() inputZipCode.send_keys(zipcode) #click submit clickSubmit = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[4]/div/div[2]/div/div/div[3]/div/button[1]") clickSubmit.click() #start scraping name = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[1]/h1/span").text nameArray.append(name) price = wd.find_element(by=By.XPATH, value="//*[@id='pageBodyContainer']/div[1]/div[2]/div[2]/div/div[1]/div[1]/span").text priceArray.append(price) currentZipCode = zipcode zipCodeArray.append(currentZipCode) tz = pytz.timezone('Europe/London') GMT = datetime.now(tz) GMTArray.append(GMT) with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(ScrapingTarget, urlList) data = {'prod-name': nameArray, 'Price': priceArray, 'currentZipCode': zipCodeArray, "Tcin": TCIN, "UPC":UPC, "GMT": GMTArray } df = pd.DataFrame(data, columns= ['prod-name', 'Price','currentZipCode',"Tcin","UPC","GMT"]) df.to_csv(r'C:\Users\12987\PycharmProjects\python\Network\priceingAlgoriCoding\export_Target_dataframe.csv', mode='a', index = False, header=True) sleep(20)
-
Scraping .aspx page with Python yields 404
I'm a web-scraping beginner and am trying to scrape this webpage: https://profiles.doe.mass.edu/statereport/ap.aspx
I'd like to be able to put in some settings at the top (like District, 2020-2021, Computer Science A, Female) and then download the resulting data for those settings.
Here's the code I'm currently using:
import requests from bs4 import BeautifulSoup url = 'https://profiles.doe.mass.edu/statereport/ap.aspx' with requests.Session() as s: s.headers['User-Agent'] = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:100.0) Gecko/20100101 Firefox/100.0" r = s.get('https://profiles.doe.mass.edu/statereport/ap.aspx') soup = BeautifulSoup(r.text,"lxml") data = {i['name']:i.get('value','') for i in soup.select('input[name]')} data["ctl00$ContentPlaceHolder1$ddReportType"]="DISTRICT", data["ctl00$ContentPlaceHolder1$ddYear"]="2021", data["ctl00$ContentPlaceHolder1$ddSubject"]="COMSCA", data["ctl00$ContentPlaceHolder1$ddStudentGroup"]="F", p = s.post(url,data=data)
When I print out
p.text
, I get a page with title'\t404 - Page Not Found\r\n'
and message<h2>We are unable to locate information at: <br /><br ' '/>http://profiles.doe.mass.edu:80/statereport/ap.aspxp?ASP.NET_SessionId=bxfgao54wru50zl5tkmfml00</h2>\r\n'
Here's what
data
looks like before I modify it:{'__EVENTVALIDATION': '/wEdAFXz4796FFICjJ1Xc5ZOd9SwSHUlrrW+2y3gXxnnQf/b23Vhtt4oQyaVxTPpLLu5SKjKYgCipfSrKpW6jkHllWSEpW6/zTHqyc3IGH3Y0p/oA6xdsl0Dt4O8D2I0RxEvXEWFWVOnvCipZArmSoAj/6Nog6zUh+Jhjqd1LNep6GtJczTu236xw2xaJFSzyG+xo1ygDunu7BCYVmh+LuKcW56TG5L0jGOqySgRaEMolHMgR0Wo68k/uWImXPWE+YrUtgDXkgqzsktuw0QHVZv7mSDJ31NaBb64Fs9ARJ5Argo+FxJW/LIaGGeAYoDphL88oao07IP77wrmH6t1R4d88C8ImDHG9DY3sCDemvzhV+wJcnU4a5qVvRziPyzqDWnj3tqRclGoSw0VvVK9w+C3/577Gx5gqF21UsZuYzfP4emcqvJ7ckTiBk7CpZkjUjM6Z9XchlxNjWi1LkzyZ8QMP0MaNCP4CVYJfndopwFzJC7kI3W106YIA/xglzXrSdmq6/MDUCczeqIsmRQGyTOkQFH724RllsbZyHoPHYvoSAJilrMQf6BUERVN4ojysx3fz5qZhZE7DWaJAC882mXz4mEtcevFrLwuVPD7iB2v2mlWoK0S5Chw4WavlmHC+9BRhT36jtBzSPRROlXuc6P9YehFJOmpQXqlVil7C9OylT4Kz5tYzrX9JVWEpeWULgo9Evm+ipJZOKY2YnC41xTK/MbZFxsIxqwHA3IuS10Q5laFojoB+e+FDCqazV9MvcHllsPv2TK3N1oNHA8ODKnEABoLdRgumrTLDF8Lh+k+Y4EROoHhBaO3aMppAI52v3ajRcCFET22jbEm/5+P2TG2dhPhYgtZ8M/e/AoXht29ixVQ1ReO/6bhLIM+i48RTmcl76n1mNjfimB8r3irXQGYIEqCkXlUHZ/SNlRYyx3obJ6E/eljlPveWNidFHOaj+FznOh264qDkMm7fF78WBO2v0x+or1WGijWDdQtRy9WRKXchYxUchmBlYm15YbBfMrIB7+77NJV+M6uIVVnCyiDRGj+oPXcTYxqSUCLrOMQyzYKJeu8/hWD0gOdKeoYUdUUJq4idIk+bLYy76sI/N2aK+aXZo/JPQ+23gTHzIlyi4Io7O6kXaULPs8rfo8hpkH1qXyKb/rP2VJBNWgyp8jOMx9px+m4/e2Iecd86E4eN4Rk6OIiwqGp+dMdgntXu5ruRHb1awPlVmDw92dL1P0b0XxJW7EGfMzyssMDhs1VT6K6iMUTHbuXkNGaEG1dP1h4ktnCwGqDLVutU6UuzT6i4nfqnvFjGK9+7Ze8qWIl8SYyhmvzmgpLjdMuF9CYMQ2Aa79HXLKFACsSSm0dyiU1/ZGyII2Fvga9o+nVV1jZam3LkcAPaXEKwEyJXfN/DA7P4nFAaQ+QP+2bSgrcw+/dw+86OhPyG88qyJwqZODEXE1WB5zSOUywGb1/Xed7wq9WoRs6v8rAK5c/2iH7YLiJ4mUVDo+7WCKrzO5+Hsyah3frMKbheY1acRmSVUzRgCnTx7jvcLGR9Jbt6TredqZaWZBrDFcntdg7EHd7imK5PqjUld3iCVjdyO+yLKUkMKiFD85G3vEferg/Q/TtfVBqeTU0ohP9d+CsKOmV/dxVYWEtBcfa9KiN6j4N8pP7+3iUOhajojZ8jV98kxT0zPZlzkpqI4SwR6Ys8d2RjIi5K+oQul4pL5u+zZvX0lsLP9Jl7FeVTfBvST67T6ohz8dl9gBfmmbwnT23SyuFSUGd6ZGaKE+9kKYmuImW7w3ePs7C70yDWHpIpxP/IJ4GHb36LWto2g3Ld3goCQ4fXPu7C4iTiN6b5WUSlJJsWGF4eQkJue8=', '__VIEWSTATE': '/wEPDwUKLTM0NzY4OTQ4NmRkDwwPzTpuna+yxVhQxpRF4n2+zYKQtotwRPqzuCkRvyU=', '__VIEWSTATEGENERATOR': '2B6F8D71', 'ctl00$ContentPlaceHolder1$btnViewReport': 'View Report', 'ctl00$ContentPlaceHolder1$hfExport': 'ViewReport', 'leftNavId': '11241', 'quickSearchValue': '', 'runQuickSearch': 'Y', 'searchType': 'QUICK', 'searchtext': ''}
Following suggestions from similar questions, I've tried playing around with the parameters, editing
data
in various ways (to emulate the POST request that I see in my browser when I navigate the site myself), and specifying anASP.NET_SessionId
, but to no avail.How can I access the information from this website?
-
Get specific information from wikipedia on google spreadsheet (not the entire table)
I have a table from "Lead rolling actors" from Wikipedia and I want to add some columns to the table with the dates of birth, years active etc for every actor.
It's the first time I use IMPORTXML formula but for Robert Downey Jr I am trying the following:
-Born: =IMPORTXML(G1!,"//span[@class='bday']")
< span class="bday">1965-04-04</ span>
-Years Active: =IMPORTXML(G1!,"//td[@class='infobox-data']")
< td class="infobox-data">1970–present</ td>
In both cases it gives me errors. What am I doing wrong? I looked on https://www.benlcollins.com/spreadsheets/google-sheet-web-scraper/ to get some guidance but I can't find my error.
-
Library that supports XPath 2.0 in Python
Is it possible to use XPath 2.0 functions like
starts-with()
,ends-with()
andcontains()
in Python? I was trying to uselxml
anddefusedxml
, but unfortunately they do not support any of these functions.I know I can use
substring()
ormatches()
for workaround, but I have really complicated case, so it would be nicer to deal with more readable functions.Any lib that supports XPath 2.0 spec?
-
Xpath using using random.randint(2,8) always identifies the first item using Python Selenium
Working on a random question picker from a codechef webpage but the problem is even when i am using random value of i, it always clicks the first question.
Code:
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium import webdriver from selenium.webdriver.support.ui import Select import time import random PATH = "C:\Program Files (x86)\chromedriver.exe" driver = webdriver.Chrome(PATH) driver.get("https://www.codechef.com/practice?page=0&limit=20&sort_by=difficulty_rating& sort_order=asc&search=&start_rating=0&end_rating=999&topic=&tags=&group=all") driver.execute_script("window.scrollTo(0,document.body.scrollHeight)") time.sleep(3) # element = driver.find_element_by_link_text("Roller Coaster") i = random.randint(2,8) try: item = WebDriverWait(driver, 25).until( EC.presence_of_element_located((By.XPATH, "/html/body/div/div/div/div/div[3]/div/div[2]/div/div[3]/div/div/table/tbody/tr['+str(i)+']/td[2]/div/a")) ) item.click() except: driver.quit()
-
VBA Selenium: Selecting a button behind multiple div tags
I was using selenium with VBA and there is the current webpage. I was trying to get the bot to select the "exportbutton" ID, but I have already tried Findelementbyxpath etc., and the error is always returned saying it cannot locate that ID. It looks like the button is behind multiple divs and I am not sure how to select it. I have already looked through for iframes and there are none present in the webpage. How can I select this button behind this header.
Button Snapshot:
-
Download JSON string from webpage fails
I am trying to download a JSON Object from this particular URL:
https://www.dhl.de/int-verfolgen/data/search?piececode=00000
I tried using
Dim JSON As String = New System.Net.WebClient().DownloadString(URL)
and also with the HtmlAgilityPack:
Dim CurrWebPage As New HtmlAgilityPack.HtmlWeb Dim CurrHTMLDoc As HtmlAgilityPack.HtmlDocument CurrHTMLDoc = CurrWebPage.LoadFromBrowser(URL)
but it does not work either.
The program just stops working here.
What makes this page special? Can someone please help me getting this JSON String.
Note: I do not get any exception. It behaves like a deadlock. The debugger just doesn't continues.
-
c# Remove html body's numeric symbol code from html agility pack inner text
I would like to know if there are any ways to remove numeric symbol codes(such as [) from inner text retrieved from site node using html agility pack.
Programming language: C#
Below is the code i used to get the inner-text:
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb(); HtmlAgilityPack.HtmlDocument htmlDocument = web.Load(result[0]/* Here comes the url of a wikipedia's page */); result_title.Text = result[1]; result_content.Text = ""; for (int i = 2; i <= 4; i++) { try { foreach (var item in htmlDocument.DocumentNode.SelectNodes("/html/body/div[3]/div[3]/div[5]/div[1]/p["+i+"]")) { result_content.Text += item.InnerText; } } catch (NullReferenceException) { result_content.Text = "This is a content page. Please refer the url"; } catch (Exception) { break; } } System.Windows.Documents.Hyperlink hyperlink = new System.Windows.Documents.Hyperlink { NavigateUri = new Uri("" + result[0]), }; hyperlink.Inlines.Add("Read More..."); hyperlink.RequestNavigate += Hyperlink_RequestNavigate; result_content.Inlines.Add(hyperlink); Search_Progress.Visibility = Visibility.Collapsed; selected_result_scroll.Visibility = Visibility.Visible;
The output is below:
As you can see in the image, it consists of the output of the inner-text grabbed from the wikipedia page's body. I like to know if there is any way i can remove those numeric symbolic code from it(the one's marked in red around it).
The text shown in wikipedia site is below(if you want to know what do those codes show in web):
Ambareesha is a 2014 Indian Kannada-language action film directed and produced by Mahesh Sukhadhare under the Sri Sukhadhare Pictures banner.[1][2] The film stars Darshan, Rachita Ram and Priyamani. Dr.Ambareesh and his wife Sumalatha Ambareesh will be seen in guest roles.[3] The soundtrack and score is composed by V. Harikrishna and the cinematography is by Ramesh Babu.