Scraping data with Scrapy and PostgreSQL and execute with scheduler

Daniel Chu
3 min readMar 13, 2021
Photo by Emile Perron on Unsplash

Introduction

In the following tutorial, I will use Scrapy to retrieve the stock information from alphavantage in Python with a scheduler.

Prerequisite

The following software/ Technologies will be used.

  1. Scrapy
  2. Apscheduler
  3. PostgreSQL
  4. Psycopg2
  5. Pandas
  6. Sqlalchemy
  7. Docker

Version listed below

$ python3 --version
Python 3.8.6
$ pip3 freeze | grep APScheduler
APScheduler==3.7.0
$ pip3 freeze | grep psycopg2-binary
psycopg2-binary==2.8.6
$ pip3 freeze | grep pandas
pandas==1.2.3
$ scrapy version
Scrapy 2.4.1
$ pip3 show sqlalchemy
Name: SQLAlchemy
Version: 1.3.23

Before you begin

  1. Install Docker in your workstation from here.
  2. Clone and navigate the github repo and run the commands below
# install the python packages
pip3 install -r requirements.txt
# Start up local PostgreSQL server
docker run --name myPostgresSqlServer -p 5432:5432 -e POSTGRES_PASSWORD=weakpassword -d postgres

3. Go to Alphavantage to get your API key and paste it into the repo file create-ini.py .

Directory Explanation

.
├── README.md
├── docker-compose # Ignore if you are not using docker compose
│ ├── docker-compose.yml
│ └── script.sh
├── scrapy
│ ├── config.ini # Paste your API key in this file
│ ├── myProject
│ │ ├── database.py
│ │ ├── pipeline.py
│ │ └── trim_data_gen_2.py
│ ├── requirements.txt
│ └── scheduler_gen_2.py # The main program entry
└── validate
└── test.py

Run the program

Please remember to update the alphavantage API key in config.ini before you run the program.

  • Execute the following command to run the program
cd scrapy
python3 scheduler_gen_2.py

Validation

  • Execute the following command with pandas to check the result
$ python3 validate/test.py
id transaction_date symbol volume open_value high_value low_value price dividend_amount
....
60 61 2009-09-03 WCIMQ 12962030.0 0.0290 0.0051 0.0160 0.0053 0.0
61 62 2009-08-31 WCIMQ 23353802.0 0.0550 0.0051 0.0053 0.0220 0.0
62 63 2009-07-31 WCIMQ 1171142.0 0.0390 0.0220 0.0220 0.0300 0.0
63 64 2009-06-30 WCIMQ 4127518.0 0.0575 0.0200 0.0375 0.0420 0.0
64 65 2009-05-29 WCIMQ 1824121.0 0.0700 0.0400 0.0420 0.0400 0.0
65 66 2009-04-30 WCIMQ 3051707.0 0.0500 0.0200 0.0400 0.0300 0.0
66 67 2009-03-31 WCIMQ 1959119.0 0.0700 0.0250 0.0275 0.0500 0.0
67 68 2009-02-27 WCIMQ 784978.0 0.0700 0.0400 0.0550 0.0600 0.0
68 69 2009-01-30 WCIMQ 2228852.0 0.0750 0.0390 0.0600 0.0390 0.0
69 70 2008-12-31 WCIMQ 6643793.0 0.1000 0.0220 0.0390 0.0750 0.0
70 71 2008-11-28 WCIMQ 4419475.0 0.1700 0.0500 0.0800 0.1300 0.0

The data is successfully inserted into PostgreSQL!

Conclusion

This program may come in handy when you need to do repeated tasks/ fetch the latest data and insert it into Database.

By using Sqlalchemy makes the developer’s life easier without you to check the database/ table exists in prior.

You can clone the github repo here.

If folks you have question or suggestion, feel free comment it below!

See you next time!

References

  1. https://www.jitsejan.com/scraping-with-scrapy-and-postgres.html
  2. http://newcoder.io/scrape/part-1/
  3. https://medium.com/@sachin93/web-scraping-using-scrapy-with-json-apis-10ce1b5c0fa1
  4. https://github.com/djchie/webreg_scrapy/blob/master/clock.py
  5. https://stackoverflow.com/questions/44228851/scrapy-on-a-schedule

--

--