Python Selenium: loading next page(s) url but would always first load the first page then switches over

Issue

I’m currently trying to scrape data information from this website. My issue right now is that although it can loop to the next pages, it can’t seem to retrieve data information from those pages (it keeps retrieving data from page 1).

I have tried using implicitly wait and time to let the pages load but it’s not working.

I notice that whenever I load a page (for example, the second page : https://www.zoocasa.com/toronto-on-sold-listings?page=2), it would also first load the page 1 and then switches over to the second page.

Is there anyway where I can wait until the page is fully loaded and no modifications are made before I fetch the data?

Below is the code of what I currently have.

 def get_reference_links(self, page = 5):

        addresses = []
        reference_links = []
        
        for page_number in tqdm(range(1, page + 1)):
            
            search_url = "https://www.zoocasa.com/toronto-on-sold-listings?page=" + str(page_number)
            self.driver.get(search_url)
            
            self.driver.implicitly_wait(20)
            time.sleep(5)
            # Need to fix and let page load completely first

            test =  self.wait.until(EC.presence_of_all_elements_located((By.XPATH, '//a[@itemprop="streetAddress"]')))
            for address in test:
                addresses.append(address.text)
                reference_links.append(address.get_attribute('href'))

            df = pd.DataFrame(list(zip(addresses, reference_links)),
                              columns = ['Address', 'Reference Link'])
            
            self.dfs.append(df)
            
        merged_dfs = pd.concat(self.dfs)
        
        return merged_dfs       

And here is a snapshot of the results.
Sample result. You can see that the Address 129 Davenport Rd (shown on the first page) is repeated every time.

Solution

The algorithm here can be as following:
On the first page get some anchor element and keep it.
For every non-first page wait until previously kept element no more exists. Now you know the page is reloaded. Keep a new anchor element and continue. Something like this:

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

def get_reference_links(self, page = 5):

        addresses = []
        reference_links = []
        content = ""
        anchor_xpath = "//img[contains(@class,'style_component')]"
        
        for page_number in tqdm(range(1, page + 1)):
            
            search_url = "https://www.zoocasa.com/toronto-on-sold-listings?page=" + str(page_number)
            self.driver.get(search_url)
            
            time.sleep(5)
            # Need to fix and let page load completely first
            if(page_number ==1):
                content = driver.find_element_by_xpath(anchor_xpath)
            else:
                WebDriverWait(driver, timeout).until(EC.staleness_of(content))
                content = driver.find_element_by_xpath(anchor_xpath)

            test =  self.wait.until(EC.presence_of_all_elements_located((By.XPATH, '//a[@itemprop="streetAddress"]')))
            for address in test:
                addresses.append(address.text)
                reference_links.append(address.get_attribute('href'))

        df = pd.DataFrame(list(zip(addresses, reference_links)),
                              columns = ['Address', 'Reference Link'])
            
        self.dfs.append(df)
            
        merged_dfs = pd.concat(self.dfs)
        
        return merged_dfs       

The above logic done with use of explicit wait, expected conditions. This is the way you should use here. Not implicitly waits.

Answered By – Prophet

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published