Webscraping Multiple Pages in Python with Selenium – loop not working

Issue

I’m quite new to python and have written a script using selenium to scrape a website. I’ve tried everything but can’t get the loop to cycle through pages. It currently just repeats the data on the first page 5 times. I want to scrape all the pages for ‘BR1’ any help would be great, currently the script below only scrapes the first page 5 times.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

with open('rightmove.csv', 'w') as file:
    file.write('PropertyCardcontent \n')

PATH = ("/usr/local/bin/chromedriver")
driver = webdriver.Chrome(PATH)

driver.get("https://www.rightmove.co.uk/house-prices.html")
print(driver.title)

elem = driver.find_element(By.NAME, 'searchLocation')  # Find the search box
elem.send_keys('BR1' + Keys.RETURN)

try:
    content = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.ID,'content'))
            )

finally:
    time.sleep(3)

for p in range(5):
    sold = content.find_elements(By.CLASS_NAME, 'sold-prices-content-wrapper ')
    for solds in sold:
        address = solds.find_elements(By.CLASS_NAME, 'sold-prices-content ')
        for addresses in address:
            result = addresses.find_elements(By.CLASS_NAME, 'results ')
            for results in result:
                card = results.find_elements(By.CLASS_NAME,'propertyCard')
                for propertyCard in card:
                    header = propertyCard.find_elements(By.CLASS_NAME,'propertyCard-content')
                    for propertyCardcontent in header:
                        road = propertyCardcontent.find_elements(By.CLASS_NAME,'title')
                    for propertyCardcontent in header:
                        road = propertyCardcontent.find_elements(By.CLASS_NAME,'subTitle')
                        for subtitle in road:
                            bed = subtitle.find_elements(By.CLASS_NAME, 'propertyType')
    with open('rightmove.csv', 'a') as file:
        for i in range(len(result)):
            file.write(header[i].text + '\n')
        
        button = driver.find_element(By.XPATH, '//*[@id="content"]/div[2]/div[2]/div[4]/div[27]/div[3]/div')
        button.click()
    file.close()

time.sleep(3)
driver.quit()

Solution

Since the website link has page number on it, I recommend you put the base url as "https://www.rightmove.co.uk/house-prices/br1.html?page=1", and loop through the pages while changing the last index of the url with methods like format string.

One other thing, you don’t need to implement all those for loops, you can simply assign each variable to its specific value since everything you need is inside an html block which is easy to navigate on it.

Update:

I’m sorry for being late, had unexpected stuff(…).

I’ve made some changes as I use Brave, so make sure you select your browser, Chrome I believe, the chromedriver(ver:102) stays the same (or depending your Chrome version).

I’ve also got the Price and Date and stored them in a tuple.
Every record is stored in a list[Title, propertyType, tupleof(Price_Date)]

At the end, it creates a csv and stores everything inside with a ";" as delimter.

You can if you prefer split the price and date for later use, up to you.

Note: This looping method only applies to websites in which the number of page is included within the URL. In this case, both the key and number of page is included in the URL.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import time
import random
import itertools


options = Options()
options.binary_location = r'C:\Program Files\BraveSoftware\Brave-Browser\Application\brave.exe'
driver = webdriver.Chrome(options = options, service = Service("chromedriver.exe"))

key_word = "BR1".lower()
base_url = f"https://www.rightmove.co.uk/house-prices/{key_word}.html?page=1"
driver.get(base_url)

#Number of pages
pages = driver.find_element(By.XPATH, '//span[@class="pagination-label"][2]').text
pages = int(pages.strip('of'))


WebDriverWait(driver, 15).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'results '))
    )

data = []
pc = 0
for p in range(1,pages+1):
    driver.get(f"https://www.rightmove.co.uk/house-prices/{key_word}.html?page={p}")

    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, '//div//div[@class="propertyCard"]'))
    )
    propertyCards = driver.find_elements(By.XPATH, '//div//div[@class="propertyCard"]')

    for propertyCard in propertyCards:
        title = propertyCard.find_element(By.CLASS_NAME, 'title').text
        propertyType = propertyCard.find_element(By.CLASS_NAME, 'propertyType').text        

        price_list = propertyCard.find_elements(By.CLASS_NAME, 'price')
        date_list = propertyCard.find_elements(By.CLASS_NAME, 'date-sold')

        data.append([title,propertyType])
        
        for p, d in itertools.zip_longest(price_list, date_list , fillvalue = None):
            try:
                price = p.text
                date = d.text
                data[pc].append((price, date))
            except Exception as e:
                print(e)
        pc+=1
        
    time.sleep(random.randint(1,4))
print(data)

with open('rightmove.csv', 'w') as file:
    header = "Title;propertyType;Price_Date\n"
    file.write(header)
    for record in data:
        file.write("{};{};{}\n".format(record[0],record[1],record[2:]))

driver.quit()

Answered By – Shodai Thox

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published