Scraping data with Scrapy and PostgreSQL and execute with scheduler

Photo by Emile Perron on Unsplash

Introduction

In the following tutorial, I will use Scrapy to retrieve the stock information from alphavantage in Python with a scheduler.

Prerequisite

The following software/ Technologies will be used.

  1. Scrapy
  2. Apscheduler
  3. PostgreSQL
  4. Psycopg2
  5. Pandas
  6. Sqlalchemy
  7. Docker

Version listed below

$ python3 --version
Python 3.8.6
$ pip3 freeze | grep APScheduler
APScheduler==3.7.0
$ pip3 freeze | grep psycopg2-binary
psycopg2-binary==2.8.6
$ pip3 freeze | grep pandas
pandas==1.2.3
$ scrapy version
Scrapy 2.4.1
$ pip3 show sqlalchemy
Name: SQLAlchemy
Version: 1.3.23

Before you begin

  1. Install Docker in your workstation from here.
  2. Clone and navigate the github repo and run the commands below
# install the python packages
pip3 install -r requirements.txt
# Start up local PostgreSQL server
docker run --name myPostgresSqlServer -p 5432:5432 -e POSTGRES_PASSWORD=weakpassword -d postgres

3. Go to Alphavantage to get your API key and paste it into the repo file create-ini.py .

Directory Explanation

.
├── README.md
├── docker-compose # Ignore if you are not using docker compose
│ ├── docker-compose.yml
│ └── script.sh
├── scrapy
│ ├── config.ini # Paste your API key in this file
│ ├── myProject
│ │ ├── database.py
│ │ ├── pipeline.py
│ │ └── trim_data_gen_2.py
│ ├── requirements.txt
│ └── scheduler_gen_2.py # The main program entry
└── validate
└── test.py

Run the program

Please remember to update the alphavantage API key in config.ini before you run the program.

  • Execute the following command to run the program
cd scrapy
python3 scheduler_gen_2.py

Validation

  • Execute the following command with pandas to check the result
$ python3 validate/test.py
id transaction_date symbol volume open_value high_value low_value price dividend_amount
....
60 61 2009-09-03 WCIMQ 12962030.0 0.0290 0.0051 0.0160 0.0053 0.0
61 62 2009-08-31 WCIMQ 23353802.0 0.0550 0.0051 0.0053 0.0220 0.0
62 63 2009-07-31 WCIMQ 1171142.0 0.0390 0.0220 0.0220 0.0300 0.0
63 64 2009-06-30 WCIMQ 4127518.0 0.0575 0.0200 0.0375 0.0420 0.0
64 65 2009-05-29 WCIMQ 1824121.0 0.0700 0.0400 0.0420 0.0400 0.0
65 66 2009-04-30 WCIMQ 3051707.0 0.0500 0.0200 0.0400 0.0300 0.0
66 67 2009-03-31 WCIMQ 1959119.0 0.0700 0.0250 0.0275 0.0500 0.0
67 68 2009-02-27 WCIMQ 784978.0 0.0700 0.0400 0.0550 0.0600 0.0
68 69 2009-01-30 WCIMQ 2228852.0 0.0750 0.0390 0.0600 0.0390 0.0
69 70 2008-12-31 WCIMQ 6643793.0 0.1000 0.0220 0.0390 0.0750 0.0
70 71 2008-11-28 WCIMQ 4419475.0 0.1700 0.0500 0.0800 0.1300 0.0

The data is successfully inserted into PostgreSQL!

Conclusion

This program may come in handy when you need to do repeated tasks/ fetch the latest data and insert it into Database.

By using Sqlalchemy makes the developer’s life easier without you to check the database/ table exists in prior.

You can clone the github repo here.

If folks you have question or suggestion, feel free comment it below!

See you next time!

References

  1. https://www.jitsejan.com/scraping-with-scrapy-and-postgres.html
  2. http://newcoder.io/scrape/part-1/
  3. https://medium.com/@sachin93/web-scraping-using-scrapy-with-json-apis-10ce1b5c0fa1
  4. https://github.com/djchie/webreg_scrapy/blob/master/clock.py
  5. https://stackoverflow.com/questions/44228851/scrapy-on-a-schedule

--

--

--

https://www.youracclaim.com/users/jia-hao-chu/

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Spring I/O Barcelona 2019

Update : Distribution of Pet NFTs

Be an OpenMMLab Contributor

Demystifying scale for new engineers in Start-ups

1 Semester into Georgia Tech’s OMSA as a Full-time Working Mom

Flutter vs Xamarin Cross-Platform Mobile Development Compared

Github CLI — bring Github to the command line

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Daniel Chu

Daniel Chu

https://www.youracclaim.com/users/jia-hao-chu/

More from Medium

Using Selenium in Docker/Cloud/WSL2 with Python[2022]

Connect to Oracle Autonomous Database from Python without using the Instance Wallet

Create modularized and scalable flask application using Blueprint

Working with WTForms and Oracle REST Database Services (ORDS) APIs