For the past three weeks, I have developed and refined a web scrapping script that fetches contents from a certain blog and archives them, including the images into my AWS S3 bucket. I’m able to serve these contents using a simple Angular website. Using AWS S3 for these kind of contents proves to be very useful.
For the scrapping script, it is just a simple ruby script, with a couple of files on it. I have a JSON file containing the list of URLs to crawl. These URLs are organized the way blog posts are organized, that is, by year and by month. I’m sure my script will break if the URL structure is different.
So, for each page, my script will fetch the target contents, via a CSS selector, since I don’t want to get the sidebar and the comments. Images are being copied and all image instances will now point to my local copy. This also includes thumbnail to original image by fetching the parent link’s image so I’m able to serve the full size image too locally.
All these scrapped pages are then organized into a JSON file like a table or contents. These JSON file, html pages and images are then uploaded into a specific AWS S3 bucket which is served publicly ready to serve anytime.
Tools and libraries used:
- Mechanize and HTTParty ruby gems
- s3cmd tool for uploading to S3
- A couple of ruby scripts to complete the package
Updating contents is done simply by re-running the script and re-uploading the contents/assets.
For the website to serve the contents, I’m thinking of using AWS S3 bucket too, but since I want to use Angular for this, I decided to just create a simple Angular Universal website that serves contents from AWS S3. The Angular app is very simple.
- Have a route/page to show the table of contents
- Table of contents is taken straight from the JSON file in my S3 bucket
- Have a route/page to show the article content
- Since the URL is organized like a blog, it is easy to match the URL with the URL in my S3 bucket
- Just fetch the HTML file then load it into the content page
- Caching can be a problem though
With this setup, if I’m going to add a new blog source, I just repeat the process and probably add 2 more routes.
Can’t share the website as it is for personal use only.