Selenium check if a hostname with HTTPS exists

Issue

I have the Alexa Top 1M list and want to crawl sites in this list. Before crawling a website I want to check if it supports https:// and validate if I have to visit the website with or without www. part.

For example, https://cambridge.org is not available and I’m getting a time-out error. I have to visit in this case the URL https://www.cambridge.org to start the right crawling.

Are there any strategies?

Solution

I wrote following code block which can help you.

def findHost(url):
    
    https='https://'
    http='http://'
    www='www.'
    
    if hostExists(https+url):
        return https
    elif hostExists(https+www+url):
        return https+www
    elif hostExists(http+www+url):
        return http+www
    #elif hostExists(http+url):
    #    return http
    else:
        return http
        

def hostExists(url):
    import urllib.request
    from urllib.request import urlopen, Request
    custom_header={
                    'Connection': 'close',
                    'sec-ch-ua': '"Chromium";v="89", ";Not A Brand";v="99"',
                    'sec-ch-ua-mobile': '?0',
                    'Upgrade-Insecure-Requests': '1',
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                    'Sec-Fetch-Site': 'none',
                    'Sec-Fetch-Mode': 'navigate',
                    'Sec-Fetch-User': '?1',
                    'Sec-Fetch-Dest': 'document',
                    'Accept-Encoding': 'gzip, deflate',
                    'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7'
                    }
    req = Request(url, headers=custom_header)
    try:
        urllib.request.urlopen(req, timeout=10).getcode()  
        return True
    except Exception as e: 
        print(e)
        if 'HTTP Error' in str(e):  #if it returns an error, the URL exists :)
            return True
        else:
            return False

url = 'google.com'
scheme = findHost(url)
print(scheme+url)

It checks in following seq.:

  • https:// + url
  • https://www. + url
  • http:// + url
  • http://www. + url

It waits for 10 seconds for a response and if it gets no response tries with the next option.

Edit: You also need to add some header parameters, bot detection methods (like in mod_security) will detect if you make the request via a library. So I updated the code which works well.

Answered By – Nurullah

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published