How would I select a selector in CSS that changes?

I'm trying to scrape movie titles from Tmdb but each title has a different selector. Is there a way for me to get them all in one go?

For example: The css selector for Birdman is .7, Star Wars is .9, and other movies have different ones.

You may ask why not just got the titles like this but it is because I need to go on each page in order to get the genre as well.

class PosterSpider(scrapy.Spider):
   name = "movieposter - imgsearch"
   start_urls = [""]

    def parse(self, response):
        url = response.css('.logo~ li:nth-child(3) > a').xpath('//*~[contains(concat( " ", @class, " " ), concat( " ", "logo", " " ))]//li[(((count(preceding-sibling::*) + 1) = 3) and parent::*)]//>//a')
        yield scrapy.Request(url.xpath("@href").extract_first(), self.parse_page)

    def parse_page(self, response):
        Method to press the 'next' button and go through each movie poster

        for href in response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "view_more", " " ))]//*[contains(concat( " ", @class, " " ), concat( " ", "result", " " ))]'):
            yield scrapy.Request(href.xpath('@href').extract_first(), self.parse_covers)

        next = response.css('.glyphicons-circle-arrow-right').xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "glyphicons-circle-arrow-right", " " ))]')
        yield scrapy.Request(next.xpath("@href").extract_first(), self.parse_page)

    def parse_covers(self, response):
        img = response.css('.zoom a').xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "zoom", " " ))]//a')

        # what to put for this selector?
        title = response.css().extract_first()

        genre = response.css('.genres a').extract_first()

        yield MoviePoster(title=title, genre = genre, file_urls=[])

2 answers

  • answered 2018-03-13 23:03 Bill Bell

    Not what you asked for but a method for doing what you want, I think.

    The usual preliminaries:

    >>> import requests
    >>> page = requests.get('').text
    >>> import bs4
    >>> soup = bs4.BeautifulSoup(page, 'lxml')

    Now use find_all with a Python function to identify elements whose id attributes match 'movie_'.

    >>> def movie_id(id):
    ...     return id and'^movie_').match(id)
    >>> movies = soup.find_all(id=movie_id)

    There are 61 of them in the page you highlighted for consideration.

    >>> len(movies)

    Here's the content of the first item.

    >>> movies[0]
    <a alt="Inside Out" class="result" href="/movie/150540?language=en" id="movie_150540" title="Inside Out">
    <img alt="Inside Out" class="poster lazyload fade" data-sizes="auto" data-src="" data-srcset=" 1x, 2x"/>
    <div class="meta">
    <span class="hide popularity_rank_value" id="popularity_50cdfd9c19c2957b79385f6e_value">
    <div class="tooltip_popup popularity">
    <h3>Popularity Rank</h3>
    <p>Today: 42</p>
    <p>Last Week: 132</p>
    <span class="glyphicons glyphicons-cardio x1 popularity_rank" id="popularity_50cdfd9c19c2957b79385f6e"></span>
    <span class="right">

    You can dig out the title in this way.

    >>> movies[0].attrs['title']
    'Inside Out'

  • answered 2018-03-14 00:09 G_M

    I think when a site has an API (and it has the information you are looking for), you should use it instead of webscraping. TheMovieDB API seems to allow 4 requests per second and took only a minute to sign up.

    This script below (written with Python 3.6.4) uses total_pages=100 (you could set up to a maximum of 1000 as per API) and each page has 20 movies returned as JSON. I had to make a separate API call to get the human-readable genres but everything seems to work fine. For 100 pages, this code took about 40sec to run and then all the results are saved to a file for you to work with later.

    import json
    import time
    import requests
    class PopularMovies:
        API_KEY = 'YOUR_API_KEY'
        BASE_URL = ''
        def __init__(self):
            self.session = requests.Session()
            self.genres = self._get_genres()
            self.popular_movies = []
        def _get_genres(self):
            params = {'api_key': self.API_KEY}
            r = self.session.get(
            result = {}
            for genre in r.json()['genres']:
                result[genre['id']] = genre['name']
            return result
        def _add_readable_genres(self):
            for i in range(len(self.popular_movies)):
                current = self.popular_movies[i]
                genre_ids = current['genre_ids']
                    'genres': sorted(self.genres[g_id] for g_id in genre_ids)
        def _get_popular_movies_page(self, *, page_num):
            params = {
                'api_key': self.API_KEY,
                'page': page_num,
                'sort_by': 'popularity.desc'
            r = self.session.get(
            return r.json()
        def get_popular_movie_pages(self, *, total_pages=1):
            if not (1 <= total_pages <= 1000):
                raise ValueError('total_pages must be between 1-1000')
            for page_num in range(1, total_pages + 1):
                movies = self._get_popular_movies_page(page_num=page_num)
                time.sleep(0.25)  # 40 requests every 10 seconds, 1 every 0.25sec
        def write_to_file(self, *, filename='popular_movies.json'):
            with open(filename, 'w') as f:
                json.dump(self.popular_movies, f, indent=4)
    if __name__ == '__main__':
        movies = PopularMovies()
        # just to show that you can easily pick out the data you want
        with open('popular_movies.json', 'r') as f:
            movies = json.load(f)
            for i, movie in enumerate(movies, start=1):
                for genre in movie['genres']:
                print('-' * 20)

    The console output of this script was too long to put in this question but here is a link to it.

    Also, here is a link to popular_movies.json to show how much extra information you get for each movie (allowing you to expand in the future to more than just titles and genres).