Walmart scraper gets blocked

Issue

I’m trying to scrape a walmart category from pages 1-100. I’ve implemented random headers and random wait times before requesting pages but still get hiy with a captcha after scraping the first few pages. Is walmart super good at detecing scrapers or am I doing something wrong?

I’m using selenium, bs4, and random_user_agent.

code:

# Randomize User Agents
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value]

user_agent_rotator = UserAgent(
    software_names=software_names, operating_systems=operating_systems, limit=1000)

user_agents = user_agent_rotator.get_user_agents()

################################################

# Selenium
options = webdriver.ChromeOptions()
options.add_argument('--profile-directory=Profile 1')
options.add_argument('use-fake-ui-for-media-stream')
options.add_argument(
    'load-extension=' + r'ad blocker path here')
options.add_argument("window-size=900,1080")

driver = webdriver.Chrome(
    ChromeDriverManager().install(), options=options)

driver.execute_cdp_cmd('Network.setUserAgentOverride', {
    "userAgent": user_agent_rotator.get_random_user_agent()})
 driver.get(url)

 ################################################

# Randomize time between requests
time.sleep(randint(5, 15))  

This is what I’ve tried to do so I don’t get blocked. Are there better methods? Thanks.

Solution

Your IP is still the same for all the requests.
You could look into using python requests with tor which of course takes a bit longer though, because the request get’s routed over TOR. I am not familiar with applying proxying over TOR with selenium but I bet there are a lot of tutorials you can find.

Walmart probably has this captcha mechanism in place for a reason though, so maybe look for another option of getting the data.

Answered By – lightstack

This Answer collected from stackoverflow, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Leave a Reply

(*) Required, Your email will not be published