Scraping multiple Web Pages at once with selenium

Issue

I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.

Like bellow:

links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]

for link in links:
    browser.get(link )
    scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
    open("file.csv","a+").write(scrapedinfo)
    time.sleep(1)

The greatest problem : too slow!

With this script I will take days or maybe weeks.

  • Is there a way to increase speed? Such as, by visiting multiple
    links at the same time and scraping all at once?

I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.

But, I am unable to apply it in my script.

Solution

Threading approach

  • You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures

def selenium_work(url):
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless") 
    driver = webdriver.Chrome(options=chromeOptions)  
    #<actual work that needs to be done be selenium>

# default number of threads is optimized for cpu cores 
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:     
    # store the url for each thread as a dict, so we can know which thread fails
    future_results = { url : executor.submit(selenium_work, links) for url in links }
    for url, future in future_results.items(): 
        try:        
           future.result() # can use `timeout` to wait max seconds for each thread  
      except Exception as exc: # can give a exception in some thread
           print('url {:0} generated an exception: {:1}'.format(url, exc))

  • Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.

  • Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It’s a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')

Other approaches

  • Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.

  • Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.

  • Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scrapping methodology.

Answered By – eusoubrasileiro

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published