Web scraping
Extracting data from the web with Python's Selenium
Collecting data from the Indeed job platform
You can see the scraper in action where the program automatically opens each page, scrapes the data and stores it in a database as shown in the DB Browser on the right side of the screen. You will be able to see the number of records at the bottom increase as the scraper moves from page to page the data gets updated.
If the video doesn’t start, you made need to click on it.
I have broken my process into 3 separate steps so each are standalone steps that allows for better code monitoring, debugging and maintenance, the whole program consist of just under 300 lines of code. The IDE of choice is the PyCharm IDE by JetBrains, database of choice is the SQLite3 database that comes with python and the library used for scraping is Python Selenium 3.14. Also in the video below you may notice that I am not using the standard Chrome browser, instead I prefer using the Chrome debugger console, which allows me to access a separate Chome instance, this I find perfect with Selenium as we can test each section of the code without rerunning the whole process and burdening the site’s server.
Step 1
Create a SQLite3 Database & table with required attributes.
Step 2
Uses earlied saved links of each page and opens each page.
Step 3
Scrapes each individual page and its data before soring them in a database.
Step 1:
Setup two databases, one for storing the individual job page links and the other to store individual job data that was scraped using these links.
Here we import the sqlite3 library and create the database class called DB_Table_Input_v2, where wemake a connection with the database and create a table with the required fields and its data type.
As you can see above, there are 11 fields with the Job_ID being the primary key with autoincrement. Autoincrement will take care of the indexing and if needed will be used as the primary key when joining other tables. In this case we would only be using one table. I am looking forward in building relational tables, the whole concept of relational databases sounds very intriguing. In the second section of the code snippet will create the table with the fields described in the section before it.
Here we import the sqlite3 library and create the database class called DB_Table_Input_v2, where wemake a connection with the database and create a table with the required fields and its data type.
import sqlite3
# Test DB
class DB_Indeed:
def __init__(self, db_name, db_table_name):
# Connect with Database
self.databaseName = db_name
self.databaseName = f'{self.databaseName}.db'
self.tableName = db_table_name
self.conn = sqlite3.connect(self.databaseName)
self.c = self.conn.cursor()
self.connect_DB()
self.create_Table()
def connect_DB(self):
""" This creates a databse if one has not been created, or
connects to an already existing one if one exists."""
self.conn = sqlite3.connect(self.databaseName, detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES)
self.c = self.conn.cursor()
self.conn.commit()
print('database created')
def create_Table(self):
""" This function creates a new table
is one has not been created"""
self.field1 = 'Job_ID_Indeed'
self.field1Type = 'INTEGER'
self.field2 = 'Job_Name'
self.field2Type = 'TEXT'
self.field3 = 'Company_Name'
self.field3Type = 'TEXT'
self.field4 = 'Salary'
self.field4Type = 'TEXT'
self.field5 = 'Other_Details'
self.field5Type = 'TEXT'
self.field6 = 'Job_Description'
self.field6Type = 'TEXT'
self.field7 = 'Job_Posted_Date'
self.field7Type = 'TEXT'
self.field8 = 'Job_Scraped_Date'
self.field8Type = 'TIMESTAMP'
self.field9 = 'SearchKey'
self.field9Type = 'TEXT'
self.field10 = 'JobPlatform'
self.field10Type = 'TEXT'
self.field11 = 'Page_Url'
self.field11Type = 'TEXT'
self.c.execute(f'CREATE TABLE IF NOT EXISTS {self.tableName}'
f'({self.field1} {self.field1Type} PRIMARY KEY autoincrement,'
f'{self.field2} {self.field2Type},'
f'{self.field3} {self.field3Type},'
f'{self.field4} {self.field4Type},'
f'{self.field5} {self.field5Type},'
f'{self.field6} {self.field6Type},'
f'{self.field7} {self.field7Type},'
f'{self.field8} {self.field8Type},'
f'{self.field9} {self.field9Type},'
f'{self.field10} {self.field10Type},'
f'{self.field11} {self.field11Type})')
self.conn.commit()
print('Table Created')
def insert_Data(self, job_name, company_name, job_salary,
job_other_details, job_description,
posted_date, scraped_date, searchkey,
job_platform, page_url):
self.c.execute(f'''INSERT INTO {self.tableName}
({self.field2},
{self.field3},
{self.field4},
{self.field5},
{self.field6},
{self.field7},
{self.field8},
{self.field9},
{self.field10},
{self.field11})
VALUES(?,?,?,?,?,?,?,?,?,?)''',
(job_name, company_name, job_salary,
job_other_details, job_description,
posted_date, scraped_date, searchkey,
job_platform, page_url))
print('Data Updated')
self.conn.commit()
# dbtest1 = DB_Indeed('db_Indeed_testing1','db_table_Indeed_testing1')
# dbtest1.insert_Data('Python Dev')
#
# dbtest1.connect_DB()
# dbtest1.create_Table()
# Selenium
# Open Dedicated Browser:
# cd c:\\Program Files\\Google\\Chrome\\Application
# chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\\Users\\user\\PycharmProjects\\RedBubble1\\currentBrowser
# this program extracts the links from the link database for Indeed job platform
# Step 1: From Database 'indeed_data_28feb_1_database_links.db'
import sqlite3
from Indeed_selenium_page_scraper_1_from_linkdb import selenium_Indeed_link_looper
import os
"""
# this module has the function that extracts all the links from the link_db which is then used by the link_looper function
# that contains the individual page scraper, which takes in a link as an input
# 0--> (Output to)
# link_scraper() 0--> link_db
# link_db 0--> db_link extractor() 0--> link_looper() 0--> feeds links to individual_page_scraper()
# ind_page_scraper() 0--> data_db
"""
# Get list of files in Data / Links_Monster directory (All databases with Indeed job links for this function to run on)
db_in_Links_directory = os.listdir('Data/Links_Indeed')
db_in_Scraped_Data_directory = os.listdir('Data/Scraped_Data_Indeed')
print(db_in_Links_directory)
print(db_in_Scraped_Data_directory)
database_name = 'indeed_data_28feb_1_database_links.db'
# database_name = db_in_Links_directory[0]
table_name = 'indeed_data_28feb_1_table_links'
print(f'Extracting links from {database_name}')
# Get list of tables in DB
con = sqlite3.connect(f'Data/Links_Indeed/{database_name}')
cur = con.cursor()
cur.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cur.fetchall())
def indeed_link_extractor(database_name, table_name):
con = sqlite3.connect(f'Data/Links_Indeed/{database_name}')
cur = con.cursor()
cur.execute(f"""SELECT * FROM {table_name};""")
links_from_db_table = cur.fetchall()
print(len(links_from_db_table))
links_to_scrape_extracted_from_db = []
i = 7000
while i < 8000: # 15837
# while i < len(links_from_db_table): # should consider scraping in batches of 1000 maybe
link = links_from_db_table[i]
print(link[3])
links_to_scrape_extracted_from_db.append(link[3])
i += 1
print(links_to_scrape_extracted_from_db)
print(f'Total no of links: {len(links_to_scrape_extracted_from_db)} / Total links in table: {len(links_from_db_table)}')
selenium_Indeed_link_looper(links_to_scrape_extracted_from_db) # something very strange being caused because we are trying to import a function from a module(scrapingpage function)
indeed_link_extractor(database_name, table_name)
Step 2:
Extracting the links from the database and opening each individual page.
This section of the code extracts url links of each page stored in a database, that would then be used by the scraper function in step 3 to extract the data points from each individual job page or link.
Step 3:
Scraping a set of data points from each individual page and storing the data in the database.
This section of the code extracts multiple data points on each page and stores them in a database. There are two main functions here, one that loops through all the links in the link database and then another function that extracts data from each page the loop function opens.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from datetime import date
from random import randint
from time import sleep
from Indeed_database_engine import DB_Indeed
opt = Options()
opt.add_experimental_option("debuggerAddress", "localhost:9222")
browser = webdriver.Chrome(chrome_options=opt)
# link = 'https://malaysia.indeed.com/viewjob?jk=e9b5a43e60863b29&tk=1ft1k3gktk3ap803&from=serp&vjs=3'
def scrape_individual_page_data_Indeed(link):
# i = 8064 # 5639 + current total in db_9889_all db_file
# while i < len(link):
browser.get(link)
# XPath
job_name_xpath = "//h1[@class='icl-u-xs-mb--xs icl-u-xs-mt--none jobsearch-JobInfoHeader-title']"
company_name_xpath1 = '//a[@target="_blank"]'
company_name_xpath2 = '//div[@class="icl-u-lg-mr--sm icl-u-xs-mr--xs"]'
salary_xpath = '//span[@class="icl-u-xs-mr--xs attribute_snippet"]'
other_details_xpath = '//span[@class="jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"]'
description_xpath = '//div[@id="jobDescriptionText"]'
meta_data_xpath = '//div[@class="jobsearch-JobMetadataFooter"]'
# Elements with Waits
wait = WebDriverWait(browser, 5)
try:
job_name = wait.until(EC.presence_of_element_located((By.XPATH, job_name_xpath)))
job_name = job_name.text
print(job_name)
except:
job_name = 'NA'
print('Job Name: NA')
try:
company_name = wait.until(EC.presence_of_all_elements_located((By.XPATH, company_name_xpath2)))
company_name = company_name[1].text
print(company_name)
except:
company_name = 'NA'
print('Company Name: NA')
try:
job_salary = wait.until(EC.presence_of_element_located((By.XPATH, salary_xpath)))
job_salary = job_salary.text
print(job_salary)
except:
job_salary = 'NA'
print('Job Salary: NA')
try:
job_other_details = wait.until(EC.presence_of_element_located((By.XPATH, other_details_xpath)))
job_other_details = job_other_details.text
print(job_other_details)
except:
job_other_details = 'NA'
print('Other Details: NA')
try:
job_description = wait.until(EC.presence_of_element_located((By.XPATH, description_xpath)))
job_description = job_description.text
print(job_description)
except:
job_description = 'NA'
print('Job Description: NA')
try:
meta_data = wait.until(EC.presence_of_element_located((By.XPATH, meta_data_xpath)))
meta_data = meta_data.text
posted_date = meta_data
except:
posted_date = 'NA'
print('Metadata: NA')
try:
scraped_date = date.today()
print(scraped_date)
except:
scraped_date = 'NA'
print("Today's date: NA")
try:
searchkey = 'data'
job_platform = 'Indeed'
job_location = 'Malaysia'
page_url = str(link)
except:
print('Error at page_url')
try:
# job_name = 'job_name'
# company_name = 'company_name'
# job_salary = 'job_salary'
# job_other_details = 'job_other_details'
# job_description = 'job_description'
# posted_date = 'posted_date'
# scraped_date = 'scraped_date'
# search_parameters = 'search_parameters'
# job_platform = 'job_platform'
# page_url = 'page_url'
# job_location = 'job_location'
# job_total_applications = 'job_total_applications'
# job_total_views = 'job_total_views'
#
# this is where you set up the database and table for each db
# Output Database Details
database_name = f'Data/Scraped_Data_Indeed/db_{searchkey}_{job_platform}_scraped_data_1'
table_name = f'db_{searchkey}_{job_platform}_scraped_data_1_table'
full_insert_data = DB_Indeed(database_name, table_name)
full_insert_data.insert_Data(job_name, company_name, job_salary,
job_other_details, job_description,
posted_date, scraped_date, searchkey,
job_platform, page_url)
except:
print('There is an error updating the database')
# print(f'Job Count: {i}/ 12510')
# print(f'Job Count: {i}/{len(list_main)}')
# i += 1
browser.back()
# scrape_individual_page_data_Indeed(link)
def selenium_Indeed_link_looper(links_to_scrape_extracted_from_db):
# this function is used by the database_link_extractor to feed the links list that were extracted from the db
total_links_in_db = len(links_to_scrape_extracted_from_db) # this is the list produced by the Indeed_link_extractor in the database_link_extractor function
i = 0
for link in links_to_scrape_extracted_from_db:
print(link)
# print(platform)
# print(what_to_search)
sleep(randint(1, 3))
scrape_individual_page_data_Indeed(link)
i += 1
print(f'Scraped page {i} / {total_links_in_db} total pages.')
def db_tester():
# to test when there is an issue when updating data to the database
try:
job_name = 'job_name'
company_name = 'company_name'
job_salary = 'job_salary'
job_other_details = 'job_other_details'
job_description = 'job_description'
posted_date = 'posted_date'
scraped_date = 'scraped_date'
searchkey = 'search_parameters'
job_platform = 'job_platform'
page_url = 'page_url'
# this is where you set up the database and table for each db
# Output Database Details
database_name = 'Data/Scraped_Data_Indeed/db_data_scraped_data_1'
table_name = f'db_{searchkey}_scraped_data_1_table'
full_insert_data = DB_Indeed(database_name, table_name)
full_insert_data.insert_Data(job_name, company_name, job_salary,
job_other_details, job_description,
posted_date, scraped_date, searchkey,
job_platform, page_url)
print('Line just after insertdata')
except:
print('There is an error updating the database')
# db_tester()
Here you can see the database with all the stored data properly organised and can easily be exported to any other format preferred.