Issue
I am trying to web scrape movie reviews from Rotten Tomatoes. An example would be for the following movie.
If I’m correct, this is a dynamic webpage, since when I try to go to the next page of the reviews the URL does not change for the website and page doesn’t refresh. Also, subsequently, when I try to web scrape normally using scrappy I can only get the reviews for that first page.
I am a beginner to web scraping and Selenium as well. I have tried the following code, having followed an online tutorial (Scraping a JS-Rendered Page):
from selenium import webdriver
browser = webdriver.Chrome(executable_path="/Users/me/Downloads/chromedriver")
url = "https://www.rottentomatoes.com/m/notebook/reviews?type=user"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
print(innerHTML)
I expected to see the reviews on the second page, but it still only displays the first page reviews. What should I do to be able to scrape beyond the first page for all the reviews?
Solution
If you are not familiar with python web scraping, I will recommend a book for you.
Web Scraping with Python, 2nd Edition
And I think using requests
instead of selenium
is more lightweight and elegant.
The following code may help you
import time
import requests
headers = {
'Referer': 'https://www.rottentomatoes.com/m/notebook/reviews?type=user',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
url = 'https://www.rottentomatoes.com/napi/movie/00d1dd5b-5a41-3248-9080-3ef553dd9015/reviews/user'
payload = {
'direction': 'next',
'endCursor': '',
'startCursor': '',
}
sess = requests.Session()
while True:
r = sess.get(url, headers=headers, params=payload)
data = r.json()
if not data['pageInfo']['hasNextPage']:
break
payload['endCursor'] = data['pageInfo']['endCursor']
payload['startCursor'] = data['pageInfo']['startCursor']
for x in data['reviews']:
user = x['user']['displayName']
review = x['review']
print(user, review)
time.sleep(1)
Answered By – taseikyo
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0