Why Does My Code Return Blanks? (scraping with Scrapy)

My goal is to scrape the comics in order of day of the week and save it to an excel datasheet. My source is https://comic.naver.com/webtoon/weekday.nhn.

I have had success scraping the data directly through the terminal and would like to write a proper script for the entire process, but have had not had much success.

directly scraping the data through the terminal with response.xpath("//div[@class='list_area daily_all']/div[1]/div/h4/span/text()").extract() will properly yield the data. The weekdays are ordered from div[1~7], and this code returns "Monday."

The following code returns a list of Monday comics. response.xpath("//div[@class='list_area daily_all']/div[1]/div//ul/li/a[@class='title']/text()").extract()

However, the following code does not return the desired results.

def parse(self, response):
    for webtoon in response.xpath("//div[@class='list_area daily_all']/div/div"):
        yield {
            'Day': webtoon.xpath('/h4/span/text()').extract(),
            'Title': webtoon.xpath("/ul/li/a[@class='title']/text()").extract(),
        }

The expected result would be 7 lines of the following code, in order of day of the week {'Day': [day], 'Title': [title1, title2, title3]}

However, my code is returning {'Day': [], 'Title': []}

I hope this all makes sense.

1 answer

  • answered 2019-06-12 14:33 Luiz Rodrigues da Silva

    You need to start your "Day" and "Title" regex with a . (dot).

    When you do this, doesn't matter that you are not using response.xpath you are still trying to get a h4 element at the root of the XML, not a h4 tag after the list_area daily_all div.

    webtoon.xpath('/h4/span/text()').extract()
    

    The correct way to do this is adding a . before the /h4, this dot references the current position of your previous xpath selector.

    webtoon.xpath('./h4/span/text()').extract()