how to scrape sites that is using template engine?

Issue

I am trying to scrape a site with scrapy and selenium.
At first I saw the result of [ {{ certificant.FirstName }} {{ certificant.LastName }} ]

So I thought maybe it’s because the page is still loading so I added a WebDriverWait for an button to show before extracting data but I still get the same result.

I do believe the result I got is from template engine do make things dynamic but if so, what should I do to make the scrape to actually work with this?

This is something I have at the moment

import scrapy

from scrapy import Request

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By



class PjFx110Spider(scrapy.Spider):
    name = "pj_fx110"

    ROOT_URL = 'https://aplanner.ca'

    start_urls = [
        ROOT_URL
    ]

    def __init__(self):
        options = Options()
#         options.add_argument("--headless")
        self.driver = webdriver.Chrome('./chromedriver', options=options)

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        self.driver.get(response.url)
        WebDriverWait(self.driver, 3600).until(EC.presence_of_element_located((By.ID, 'btnShowResults')))

        lists = response.css('.list-group')
        name = lists.xpath('//*[@id="FPlist"]/div/ul[1]/li/span[1]/text()').extract()
        print(name, '---------lists----------')

Thank you so much for any suggestions and advices.

Solution

I will assume you want to obtain the full list of planners (you did not confirm this). You are asking for an alternative, here it is (quite far from what you initially planned, I imagine):

import requests
import pandas as pd
headers = {
'authority': 'aplanner.ca',
'path': '/WebServices/AptifyToolsServices.asmx/GetAll',
'scheme': 'https',
'accept': 'application/json, text/plain, */*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'content-length': '0',
'content-type': 'application/json;charset=utf-8',
'cookie': 'ASP.NET_SessionId=345345345345',
'origin': 'https://aplanner',
'referer': 'https://aplanner/findaplanner',
'sec-ch-ua': '"Chromium";v="103", ".Not/A)Brand";v="99"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.53 Safari/537.36 x-requested-with: XMLHttpRequest'
}
r = requests.post('https://aplanner.ca/WebServices/AptifyToolsServices.asmx/GetAll', headers=headers)
df = pd.read_json(r.json()['d'])
df.to_csv('aplanner.csv')
print(df.head())

This will return a csv file, and a dataframe head, displaying the format of the csv.file, in a minute or so.
If using a Jupyter notebook, you may need to run it with

jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10

Answered By – platipus_on_fire_333

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published