What order is followed when Scrapy CrawlSpider scraping pages?
I am new to Scrapy and reading Learing Scrapy to study, and I have a question about the scrape order.
The book provide a piece of code:
rules = ( Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')), Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'), callback='parse_item') )
And it said that Scrapy using a LIFO strategy to crawl. So I suppose that the first item should be the item on the last page, but it turns out the first item is on the first page.
Why? According to the code, I think Scrapy will keep following the first rule until it find the last page, and then it will start to parse items on the last page. I am confused.
And if a website has millions of pages, Scrapy won't parse any items until it reaches the last page?
All of the rules are being followed on every page in order of the tuple.
For example you have two rules:
- find other pagination pages (no callback)
- find products (with callback)
If you run this spider on the 1st page it will find other pagination urls and schedule them then find products and schedule them with
parse_productcallback or whatever you have set. Afterwards for any scheduled url that has default callback (where you haven't specificied
callbackargument) it will repeat this untill nothing is found anymore.
See also questions close to this topic
No module named 'theano.tensor.signal.downsample' in sklearn-theano
I am working on Google Colab and want to use sklearn-theano package.
However, when I do :
import matplotlib.pyplot as plt from matplotlib.patches import Rectangle from sklearn_theano.datasets import load_sample_image from sklearn_theano.feature_extraction import OverfeatLocalizer from sklearn_theano.feature_extraction import get_all_overfeat_labels
I get an error message :
ModuleNotFoundError Traceback (most recent call last) <ipython-input-7-f6f4d3330da0> in <module>() 1 import matplotlib.pyplot as plt 2 from matplotlib.patches import Rectangle ----> 3 from sklearn_theano.datasets import load_sample_image 4 from sklearn_theano.feature_extraction import OverfeatLocalizer 5 from sklearn_theano.feature_extraction import get_all_overfeat_labels /content/sklearn-theano/sklearn_theano/datasets/__init__.py in <module>() 6 from .base import load_sample_image 7 from .base import load_sample_images ----> 8 from .generators import fetch_mnist_generated 9 from .generators import fetch_cifar_fully_connected_generated 10 /content/sklearn-theano/sklearn_theano/datasets/generators.py in <module>() 10 from sklearn.utils import check_random_state 11 ---> 12 from ..base import (Feedforward, fuse) 13 from ..datasets import get_dataset_dir, download 14 /content/sklearn-theano/sklearn_theano/base.py in <module>() 8 import numpy as np 9 import theano.tensor as T ---> 10 from theano.tensor.signal.downsample import max_pool_2d 11 12 ModuleNotFoundError: No module named 'theano.tensor.signal.downsample'
I saw this post : here but
theano-cache purgedoes not work. I don't know how to proceed to avoid those errors...
axes.set_xticklabels breaks datetime format
im trying to force my will onto this matplotlib graph. When I set
ax1.xaxis.set_major_formatter(myFmt)it works fine like in the upper graph. However when I add
ax1.set_xticklabels((date),rotation=45)the timeformat reverts to matplotlib time like in the lower graph.
Both use the same input time variable. I also tried
ax1.plot_date()but that only changes the look of the graph not the timeformat.,
date_1 = np.vectorize(dt.datetime.fromtimestamp)(time_data) # makes a datetimeobject from unix timestamp date = np.vectorize(mdates.date2num)(date_1) # from datetime makes matplotib time myFmt = mdates.DateFormatter('%d-%m-%Y/%H:%M') ax1 = plt.subplot2grid((10,3), (0,0), rowspan=4, colspan=4) ax1.xaxis_date() ax1.plot(date, x) ax1.xaxis.set_major_formatter(myFmt) ax1.set_xticklabels((date),rotation=45)#ignores time format
Any ideas how I can force the custom timeformat onto the xticklabels? I get that xticklabels directly reads and displays the date variable but shouldnt it be possible to make it stick to the format? Especially if you later want to add xticks in custom date locations.
All ideas appreciated. Cheers
Python 3.7 xlwt write
I have again problem with writing values in excel workbook. I figured out how to use both xlrd + xlwt for my purpose, but since I changed code for converting datetime values and exceptions with uncoded strings it doesn't work anymore.
Traceback (most recent call last): File "D:\rs_al\IdeaProjects\ExcelToSQL\PyXLSSQL\XLS_xlutils.py", line 96, in <module> Excel.xls_wrk(filename) File "D:\rs_al\IdeaProjects\ExcelToSQL\PyXLSSQL\XLS_xlutils.py", line 89, in xls_wrk ws.write(row_idx,col_idx, val) File "C:\Python\lib\site-packages\xlwt\Worksheet.py", line 1088, in write self.row(r).write(c, label, style) File "C:\Python\lib\site-packages\xlwt\Row.py", line 254, in write raise Exception("Unexpected data type %r" % type(label)) Exception: Unexpected data type <class 'xlrd.sheet.Cell'>
I've checked all cell types, there are only type 1 unicode string, type 2 float and type 3 float. When I print values everything is ok.
import ftfy import xlrd import xlwt from xlrd import open_workbook class Excel: #convertin to xlsx for working with openpyxl def xls_wrk (filename): #XLRD rb = open_workbook('abbcards.xls') rs = rb.sheet_by_index(0) rows = rs.nrows cols = rs.ncols wb = xlwt.Workbook() ws = wb.add_sheet('Part1') #iterate and prepare format for SQL db tables for row_idx in range (0,rows): for col_idx in range (0,cols): cell= rs.cell(row_idx, col_idx) ctp_in = cell.ctype cval = cell.value #Input string value "whitespace" in empty cells if (ctp_in == xlrd.XL_CELL_EMPTY): ctp_in = xlrd.XL_CELL_TEXT cval = " " elif (ctp_in == xlrd.XL_CELL_ERROR): ctp_in = xlrd.XL_CELL_TEXT cval=" " #Fixing date elif (ctp_in == xlrd.XL_CELL_DATE): #Manual fixing negative value if (cval==-693594): ctp_in=xlrd.XL_CELL_DATE cval = rs.cell(row_idx, col_idx-7) else: ctp_in=xlrd.XL_CELL_DATE cval=xlrd.xldate.xldate_as_datetime(cval, rb.datemode) #fixing negative values elif (ctp_in == xlrd.XL_CELL_NUMBER and cval < 0): cval= 0 #Fixinig broking UTF-8 as 1252 letters with package ftfy elif (ctp_in == xlrd.XL_CELL_TEXT): #Broken UTF-8 that ftfy package can't fix. Manual fix #CarPlate from "abbcardds" if(cval=="Ð 869Ð¡Ð—197"): #ctp_in=xlrd.XL_CELL_TEXT cval = "Р869СЗ197" elif (cval=="H613Ð'Y"): #ctp_in=xlrd.XL_CELL_TEXT cval="H613BY" #important. Car plate number is in Latin elif (cval=="Ð'509Ð¡Ð'777"): #ctp_in=xlrd.XL_CELL_TEXT cval="В509СВ177" elif (cval=="Ð'674Ð¡Ð¡199"): #ctp_in=xlrd.XL_CELL_TEXT cval="В674СС199" elif (cval=="T357KÐž777"): #ctp_in=xlrd.XL_CELL_TEXT cval = "T357KО777" elif (cval =="Ð'010Ð¡Ð¡199"): #=xlrd.XL_CELL_TEXT cval="В010СС199" elif (cval=="E174Ð¡Ð 777"): #ctp_in=xlrd.XL_CELL_TEXT cval = "E174СР777" else: #ctp_in=xlrd.XL_CELL_TEXT cval=ftfy.fix_text(cval) #print(cval) ws.write(row_idx,col_idx, cval) wb.save('text.xls')
VBA for extracting from URL
I am trying to extracting the last page Number from URL, i want to gett tge maximum Number of pages.
bellow is my url
here bellow is VBA code i try but some where is problem
Dim sResponse As String, html As HTMLDocument Dim url As String Dim N As Long Dim X As Long url = ActiveCell.Value With CreateObject("MSXML2.XMLHTTP") .Open "GET", url, False .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" .send sResponse = StrConv(.responseBody, vbUnicode) End With Set html = New HTMLDocument With html .body.innerHTML = sResponse ActiveCell.Offset(1, 0) = .getElementByClass("Jpag").innerText ActiveCell.Offset(1, 0) = .getElementById("srchpagination").innerText ActiveCell.Offset(0, 1).Select End With
please anybody help me out
Json Data Request from Web using Python
I am new in python. Now I am studying web scraping using request but I am stuck in get data table in json format on link below
Anyone can help Thanks
tracking facebook followers - codeigniter
I'm a hobbyist programmer working on tracking social media followers for my business, and good lord I don't get the fb sdk at all. I'm just simply trying to run a cron to track how many followers we have on social media sites. What's the simplest way to get this info for facebook?
Tried scrapping with simple dom, and it comes back with nothing.
I tried adding the sdk, but there's issues (I think) with making a request from localhost: "Insecure Login Blocked: You can't get an access token or log in to this app from an insecure page. Try re-loading the page as https://".
Uploaded the code to my web server and get: "URL Blocked: This redirect failed because the redirect URI is not whitelisted in the app’s Client OAuth Settings. Make sure Client and Web OAuth Login are on and add all your app domains as Valid OAuth Redirect URIs."
The docs suck, and googling just turns up with old code from the graph api that doesn't work either.
Sorry, if this is simple, but I'm just incredibly frustrated and would love the help... like I said, I'm not the best coder so all help would be appreciated
How to turn off scrapy's ImagesPipeline to automatically create a full folder？
When downloading images using scrapy's ImagesPipeline, I have set the save path, but I will still create a new full folder for me in the save path. I don't want it to create this full for me. How can I close it?
Scrapy - unexpected return when there is Chinese character in xpath
I am new and I know there is a similar question to this. However, I don't think that problem is solved.
The version of scrapy I am using is 1.0.3 and the environment is in a virtualbox. What I am trying to do is to scrap all information from
which has "西二旗" in the @title. My script is like this:
keywords = u'领秀' response.xpath('//h2/a[contains(@title,keywrods)]/text()').extract()
and the output is like this:
[u'\u897f\u4e8c\u65d7\u9886\u79c0\u65b0\u7845\u8c37\u81ea\u4f4f\u578b\u8054\u6392\u522b\u5885', u'\u91d1\u5c71\u8f6f\u4ef6 \u5c0f\u7c73 \u4e94\u5f69\u57ce \u897f\u4e8c\u65d7\u8f6f\u4ef6\u56ed', u'\u9f99\u5174\u56ed\u7cbe\u88c5\u4e24\u5c45\u5ba4\uff0c\u9f99\u6cfd\u56de\u9f99\u89c2\u897f\u4e8c\u65d7\u5317\u6e05\u8def\u3002', u'\u878d\u6cfd\u5609\u56ed\u897f\u4e8c\u65d7\u9f99\u6cfd \u7cbe\u88c5\u4e09\u5c45 \u6708\u5e95\u62ce\u5305\u5165\u4f4f', u'\u9f99\u5174\u56ed\u5317\u533a\u53f2\u8bd7\u7ea7\u7cbe\u88c5\u4fee\u4e24\u5c45\u5ba4\uff0c\u9f99\u6cfd\u56de\u9f99\u89c2\u897f\u4e8c\u65d7\u3002', u'\u4e94\u5f69\u57ce \u5c0f\u7c73 \u91d1\u5c71\u8f6f\u4ef6 \u897f\u4e8c\u65d7\u8f6f\u4ef6\u56ed \u4e0a\u5730\u4e09\u8857', u'\u6b63\u89c4\u5357\u5317\u901a\u900f\u5927\u4e24\u5c45\u6708\u5e95\u5230\u671f\u897f\u4e8c\u65d7\u8f6f\u4ef6\u56ed\u767e\u5ea6', u'\u56de\u9f99\u89c2\u9f99\u6cfd\u897f\u4e8c\u65d7\u5317\u4eac\u4eba\u5bb6\u7cbe\u88c5\u4e24\u5c45\u5bbd\u655e\u660e\u4eae\u62ce\u5305\u4f4f', u'\u897f\u4e8c\u65d7\u9f99\u6cfd\u7535\u68af\u697c\u843d\u5730\u7a97\u5317\u4eac\u4eba\u5bb6\u4e24\u5c45\u7cbe\u88c5\u62ce\u5305\u4f4f']
Which returns all the elements not mater containing the keywords or not.
So I really want to know what's happening. I also tried this on my chrome with
$x('//h2/a[contains(@title,"领秀")]')and it works fine (only one element returned).
How to get stats value after CrawlerProcess finished, i.e. at line after process.start()
I am using this code somewhere inside spider:
So, when this exceptions raised, eventually my spider closing working and I get in console stats with this string:
But - how I can get it from code? Cause I want to run spider again in loop, based on info from this stats, something like this:
from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings import spaida.spiders.spaida_spider import spaida.settings you_need_to_rerun = True while you_need_to_rerun: process = CrawlerProcess(get_project_settings()) process.crawl(spaida.spiders.spaida_spider.SpaidaSpiderSpider) process.start(stop_after_crawl=False) # the script will block here until the crawling is finished finish_reason = 'and here I get somehow finish_reason from stats' # <- how?? if finish_reason == 'finished': print("everything ok, I don't need to rerun this") you_need_to_rerun = False
I found in docs this thing, but can't get it right, where is that "The stats can be accessed through the spider_stats attribute, which is a dict keyed by spider domain name.": https://doc.scrapy.org/en/latest/topics/stats.html#scrapy.statscollectors.MemoryStatsCollector.spider_stats
P.S.: I'm also getting error twisted.internet.error.ReactorNotRestartable when using
process.start(), and recommendations to use
process.start(stop_after_crawl=False)- and then spider just stops and do nothing, but this is another problem...