Selenium does not load <li> inside <ul> inside <div>

Issue

I am new to Selenium, Python, and programming in general but I am trying to write a small web scraper. I have encountered a website that has multiple links but their HTML code is not available for me using

soup = bs4.BeautifulSoup(html, "lxml")

The HTML-Code is:

<div class="content">
    <div class="vertical_page_list is-detailed">
        <div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">[event]
            <ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
                <li class="vertical-page-list--item is-detailed infite-nodes--list-item" style="display: list-item;">
                <li class="...>
                ...
            </ul>
        </div>
    </div>
</div>

But soup only contains this part, missing the li classes:

<div class="content">
    <div class="vertical_page_list is-detailed">
        <div infinite-nodes="true" up-data="{"next":1,"url":"/de/pressemitteilungen?container_contenxt=lg%2C1.0"}">
            <ul class="has-no-bottom-margin list-unstyled infinite-nodes--list">
            </ul>
        </div>
    </div>
</div>

It has somthing to do with the [event] after the div but I can’t figure out what to do. My guess was that it is some lazy-loaded code but using

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

or directly moving to the element

actions = ActionChains(driver)
actions.move_to_element(driver.find_element_by_xpath("//div['infinite-nodes=']")).perform()

did not yield any results. This is the code I am using:

# Enable headless firefox for Serenium
options = Options()
#options.headless = True
options.add_argument("--headless")
options.page_load_strategy = 'normal'
driver = webdriver.Firefox(options=options, executable_path=r'C:\bin\geckodriver.exe')
print ("Headless Firefox Initialized")

# Load html source code from webpage
driver = webdriver.PhantomJS(executable_path=r'C:\phantomjs\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get("https://www.volkswagen-newsroom.com/de/pressemitteilungen?container_context=lg%2C1.0")

SCROLL_PAUSE_TIME = 2

# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
print("Scrolled down to bottom")

# Extract html code
driver.find_element_by_xpath("//div['infinite-nodes=']").click() #just testing
time.sleep(SCROLL_PAUSE_TIME)
html = driver.page_source.encode('utf-8')
soup = bs4.BeautifulSoup(html, "lxml")

Could anyone help me please?

Solution

When you visit the page in a browser, and log your network traffic, every time the page loads (or you press the Mehr Pressemitteilungen anzeigen button) an XHR (XmlHttpRequest) request is made to some kind of API(?) – the response of which is JSON, which also contains HTML. It’s this HTML that contains the list-item elements you’re looking for. You don’t need selenium for this:

def get_article_titles():
    import requests
    from bs4 import BeautifulSoup as Soup

    url = "https://www.volkswagen-newsroom.com/de/pressemitteilungen"

    params = {
        "container_context": "lg,1.0",
        "next": "1"
    }

    headers = {
        "accept": "application/json",
        "accept-encoding": "gzip, deflate",
        "user-agent": "Mozilla/5.0",
        "x-requested-with": "XMLHttpRequest"
    }

    while True:

        response = requests.get(url, params=params, headers=headers)
        response.raise_for_status()

        data = response.json()
        
        params["next"] = data["next"]
        soup = Soup(data["html"], "html.parser")

        for tag in soup.select("h3.page-preview--title > a"):
            yield tag.get_text().strip()


def main():
    from itertools import islice

    for num, title in enumerate(islice(get_article_titles(), 10), start=1):
        print("{}.) {}".format(num, title))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

1.) Volkswagen Konzern, BASF, Daimler AG und Fairphone starten Partnerschaft für nachhaltigen Lithiumabbau in Chile
2.) Verkehrsausschuss-Vorsitzender Cem Özdemir informiert sich über Transformation im Elektro-Werk in Zwickau
3.) Astypalea: Start der Transformation zur smarten, nachhaltigen Insel
4.) Vor 60 Jahren: Fußball-Legende Pelé zu Besuch im Volkswagen Werk Wolfsburg
5.) Novum unter den Kompakten: Neuer Polo ist mit „IQ.DRIVE Travel Assist“ teilautomatisiert unterwegs
6.) Der neue Tiguan Allspace – ab sofort bestellbar
7.) Volkswagen startet Vertriebsoffensive im deutschen Markt
8.) Vor 70 Jahren: Volkswagen erhält ersten Beirat
9.) „Experience our Volkswagen Way to Zero“ – neue Ausstellung im DRIVE. Volkswagen Group Forum für Gäste geöffnet
10.) Jetzt bestellbar: Der neue ID.4 GTX
>>> 

Answered By – Paul M.

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published