Dynamic web scraping using selenium : scrape protected websites

Siddhesh Shivdikar
Geek Culture
Published in
3 min readJun 22, 2022

--

<a href=”http://www.freepik.com">Designed by upklyak / Freepik</a>
Designed by upklyak

But ! we already covered everything last time ! 😤

In my last article we used beautiful soup to scrape the information from webpages we might have covered up almost every important aspects and methodologies including including all the different types, crawling, bot protection, Automation, Scraping through NLP but ! Beautiful Soup can only Handel static webpages not dynamic *Mic Drop* 🎤

If you are new to scraping or need to get through advance concepts here is the article to check out

https://medium.com/geekculture/getting-started-with-web-scraping-for-data-analysis-fe55cc0b8d61

What’s the difference between static and dynamic webpages then 🥲

Awesome question you’ve asked ! static webpages uses the content which is directly embedded in your code where as in dynamic webpages generates the data from javascript and places them in DOM elements instead. keeping it simple

Okay Scenario time 🫣

Consider you are scraping website with elements who uses pagination but you guessed it right when we click on the link the url is not updated. Which means we cant use crawling ! 😣 Which means we try to do every possible thing with beautiful-soup but it dosent work ! 😖 Which means you found no HTML Fragments which loads data on request to scrape ! 😫 Which means no more Nutella 😩 … wait Nutella is love

Why Selenium though 🤔

Well glad you asked 🤗 not just Selenium but Selenium + beautiful Soup *old love still got some kick in it* Selenium in simple terms mimics an user interacting with browser. In broader terms though Selenium is a framework which can run and execute the scripts and control your web browser by sending and receiving methods calls and data to or from the Web Driver which bridges your browser with selenium. So yea I had to install Chrome but yea if you dont have Chrome or Firefox just install them at this point. Fun fact various Automation apps that you know use selenium.

Quiz time before we start 😌

If you’ve visited planet Earth you probably know Australia where there’s a chemists who has a warehouse. Congratulation you know the website we are scraping ! 🥳

Installing the pre requisits 🛠

# Run the commands below once 
!pip install beautifulsoup4
!pip install requests
!pip install selenium
!pip install webdriver-manager

Getting our Imports 😌

import requests
from bs4 import BeautifulSoup
import pandas
import time
from selenium import webdriver
#dynamic scraping using selenium and beautiful soup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
driver = webdriver.Chrome("*insert_your_directory*/chromedriver", chrome_options=options)
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(ChromeDriverManager().install())

Make sure that you have correct path of webdriver. I prefer downloading web driver manually from the web and keep it in pwd.

Conjuring 3 Selenium made me do it 👻

We are going to make the browser click the next button through selenium fetch the source from Selenium instead of requests like we did last time. This should work because we are simulating the browser like if there are 63 pages for each page we’ll click next button Scrape ! *java script loads the content* Scrape ! java script loads the content again Scrape ! repeat well selenium will start using your browser and now it’s possessed

driver.get("The_link_that_you_found_via_quiz")
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
final_list = []
df = []
for i in range(62):
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
product = soup.find_all('div', class_='product')
l = []
for item in product:
d={}
d["Name"] = (item.find('div', class_ ='product__title').text.replace("[","").replace("]",""))
d["Price"] = (item.find('span', class_ ='product__price-current').text.replace("[","").replace("]",""))
d["Discount"] = (item.find('em', class_ ='product__price-discount'))
l.append(d)

final_list.extend(l)

next_button = driver.find_elements(by=By.CLASS_NAME, value="pager__button--next")
try:
if next_button[0].is_displayed():
driver.execute_script("arguments[0].click();",next_button[0])
time.sleep(1)
except:
pass

df = pandas.DataFrame(final_list)
df.to_csv("xoxo.csv")
df.to_excel("xoxo.xlsx")

See ya 👋

Concluding this there is not much left to explain but its just a start to dive deeper into selenium since selenium is not only limited to web scraping but remember Every time the GPU gets sold out in 1 sec ? Even the NFT’s ?Selenium did you dirty my friend. But everyday it’s a chance to learn new stuff and happy coding 🫡

--

--

Siddhesh Shivdikar
Geek Culture

Hello! I'm Siddhesh , a software engineer based in Mumbai, India. I'm passionate about field of AI and Data Science