Issue
I have the Alexa Top 1M list and want to crawl sites in this list. Before crawling a website I want to check if it supports https://
and validate if I have to visit the website with or without www.
part.
For example, https://cambridge.org is not available and I’m getting a time-out error. I have to visit in this case the URL https://www.cambridge.org to start the right crawling.
Are there any strategies?
Solution
I wrote following code block which can help you.
def findHost(url):
https='https://'
http='http://'
www='www.'
if hostExists(https+url):
return https
elif hostExists(https+www+url):
return https+www
elif hostExists(http+www+url):
return http+www
#elif hostExists(http+url):
# return http
else:
return http
def hostExists(url):
import urllib.request
from urllib.request import urlopen, Request
custom_header={
'Connection': 'close',
'sec-ch-ua': '"Chromium";v="89", ";Not A Brand";v="99"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-User': '?1',
'Sec-Fetch-Dest': 'document',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7'
}
req = Request(url, headers=custom_header)
try:
urllib.request.urlopen(req, timeout=10).getcode()
return True
except Exception as e:
print(e)
if 'HTTP Error' in str(e): #if it returns an error, the URL exists :)
return True
else:
return False
url = 'google.com'
scheme = findHost(url)
print(scheme+url)
It checks in following seq.:
- https:// + url
- https://www. + url
- http:// + url
- http://www. + url
It waits for 10 seconds for a response and if it gets no response tries with the next option.
Edit: You also need to add some header parameters, bot detection methods (like in mod_security) will detect if you make the request via a library. So I updated the code which works well.
Answered By – Nurullah
This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0