Python BeautifulSoup & Selenium not scraping full html

Issue

Beginner web-scraper here. My practice task is simple: Collect/count a player’s Pokemon usage over their last 50 games, on this page for example. To do this, I planned to use the image url of the Pokemon which contains the Pokemon’s name (in an <img> tag, encased by <span></span>). Inspecting from Chrome looks like this: <img alt="Played pokemon" srcset="/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=96&amp;q=75 1x, /_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=256&amp;q=75 2x" ...

1) Using Beautiful Soup alone doesn’t get the html of the images that I need:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1')
wp_player = bs(r.content)
wp_player.select('span img')

2) Using Selenium picks up some of what BeautifulSoup missed:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
page = driver.page_source
driver.quit()

soup = bs(page, 'html.parser')
soup.select('span img')

But it gives me links that look like this: <img alt="Played pokemon" data-nimg="fixed" decoding="async" src=""

What am I misunderstanding here? The website I’m interested in does not have a public API, despite its name. Any help is much appreciated.

Solution

This is a common issue while web scraping websites before these gets loaded completely. What you’ll have to do is basically wait for the page to fully load the images that you are requiring. You have two options, either implicit wait or explicit wait for the image elements to get loaded.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

url = r"https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[alt="Played pokemon"]'))) # EXPLICIT WAIT
driver.implicitly_wait(10) # IMPLICIT WAIT

pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]')
for element in pokemons:
    print(element.get_attribute("src"))

You have to choose one or the other, but it’s better to explicit wait for the element(s) to get rendered before you try to access to their values.

OUTPUT:
pokemons =
driver.find_elements_by_css_selector(‘[alt="Played pokemon"]’)
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75
https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75

Your workaround wasn’t working because you are doing a get request to the page that gets you the html values at their initial state, when all the DOM elements are still yet to get rendered.

Answered By – SaC-SeBaS

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published