How do I web scrape this dynamic page?

Issue

I am trying to web scrape movie reviews from Rotten Tomatoes. An example would be for the following movie.

If I’m correct, this is a dynamic webpage, since when I try to go to the next page of the reviews the URL does not change for the website and page doesn’t refresh. Also, subsequently, when I try to web scrape normally using scrappy I can only get the reviews for that first page.

I am a beginner to web scraping and Selenium as well. I have tried the following code, having followed an online tutorial (Scraping a JS-Rendered Page):

from selenium import webdriver

browser = webdriver.Chrome(executable_path="/Users/me/Downloads/chromedriver")

url = "https://www.rottentomatoes.com/m/notebook/reviews?type=user"

browser.get(url)

innerHTML = browser.execute_script("return document.body.innerHTML")

print(innerHTML)

I expected to see the reviews on the second page, but it still only displays the first page reviews. What should I do to be able to scrape beyond the first page for all the reviews?

Solution

If you are not familiar with python web scraping, I will recommend a book for you.

Web Scraping with Python, 2nd Edition

And I think using requests instead of selenium is more lightweight and elegant.

The following code may help you

import time
import requests

headers = {
    'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

url = 'https://www.rottentomatoes.com/napi/movie/00d1dd5b-5a41-3248-9080-3ef553dd9015/reviews/user'

payload = {
    'direction': 'next',
    'endCursor': '',
    'startCursor': '',
}

sess = requests.Session()

while True:
    r = sess.get(url, headers=headers, params=payload)
    data = r.json()

    if not data['pageInfo']['hasNextPage']:
        break

    payload['endCursor'] = data['pageInfo']['endCursor']
    payload['startCursor'] = data['pageInfo']['startCursor']

    for x in data['reviews']:
        user = x['user']['displayName']
        review = x['review']
        print(user, review)

    time.sleep(1)

Answered By – taseikyo

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published