Long-running Serverless Web Scrapping – an AWS Well-Architected Solution

What does this AWS Solution do?

Many people want to crawl data from public websites for many purposes. Nowadays, there many tools that can help you implement the web scrapping solution. But crawling websites with a large dataset and complex sitemap take longer and require servers to run the scrapping process.

With web scraping, you won’t need the workload 24/7 and data storage is not a critical aspect. Running web scrapping on a dedicated server/virtual machine requires your administrator to manage and monitor servers. Also, when running web scrapping, you also need to deal and avoid blocking your process from the site in many ways such as dynamic Ip address allocation each time the process runs. This process can be difficult to manage and cause time delays, and increase compute costs.

To help you archive and overcome those challenges, decrease the cost of web scrapping. We provide the Long-running Serverless Web Scrapping solution that provides an open-source web scraping solution with Puppeteer to enable cost-effective solutions on the AWS Cloud. This solution also provides automation deploys and configures a serverless architecture that is optimized for web scrapping uses the AWS CDK framework to deploy. Running on a serverless environment also reduces and provides operational excellence objective that automated your deployment process as code and other pillars of the AWS Well-Architected framework.

AWS solution overview

The below diagram presents the Long-running Serverless Web Scrapping architecture you can deploy in minutes using the AWS CDK framework.

Puppeteer on AWS Fargate
Serverless Web Scrapping

AWS CloudWatch triggers and starts ECS Fargate Task(s), the number of tasks depends on the total of web scrapping specification. You can start hundreds of tasks parallelly at scale. The ECS task runs a Docker container which starts a Puppeteer process in headless mode and starts scrapping data from the websites. The solution generates an S3 bucket in order to archive web scrapping output. Archive to the database is optional and we won’t address details on how to do that in this solution.

Additionally, using AWS CDK also deploys and configures an AWS ECS Fargate cluster, ECR repository, build and push the docker image to the ECR repository.

Features

  • Automatically build a serverless architecture that is optimized for serverless web scrapping process on the AWS Cloud.
  • Web scrapping using the Puppeteer
  • Automatically archive web scrapping output to S3 bucket or database such as MySQL.

Cost

When running the AWS services on your AWS account, you are responsible for the cost of this solution. As the default of our open-source code:

  • Run 1000 ECS tasks with 1GB of RAM and 1 vCPU.
  • The duration of each task is around 3 minutes (average).
  • 1GB S3 storage.
  • 1GB data transfer, with the default setting in the US East 2 (Ohio) Region.

The estimated cost for running this solution is as shown in the table below:

AWS ServiceTotal Cost
Amazon ECS Fargate$2.25

If you choose to deploy to your VP and run under a Private subnet. You are responsible for the incurred variable charges from VPC service. For full details, see the pricing webpage for each AWS service you will be using in this solution.

Implement Consideration

The Serverless Web Scrapping solution provides an important parameter: useDefaultVpc that allow you to determine whether you want to run ECS tasks under your default VPC in the public subnet, or you want to create a new VPC, select subnet type (private or public) to run your ECS task. Be carefully before deploying and running tasks to avoid adding your AWS bill such as NAT instance hourly charge.

This solution is designed to run web scraping in a serverless environment to reduce infrastructure and operation costs for a long-running web scrapping process, a long-running process means you cannot break into multiple shorter processes. If your web scrapping process can take shorter than 15 minutes, you can use another approach is to run your web scrapping on the AWS Lambda.

Template

This solution uses the AWS CDK framework to automate the deployment of the Serverless Web Scrapping solution on the AWS Cloud. It includes an AWS CDK construct, which you can clone and customize before deployment:

  • puppeteer-crawler-fargate-stack.ts use this construct to deploy the Serverless Web Scrapping process and all associated components. The default construct template deploys AWS ECS Cluster, Task Definition, AWS ECR Repository, AWS VPC, Amazon S3 Bucket, and AWS CloudWatch EventRule and CloudWatch Log Group, you also can customize based on your specific requirements.

Deployment

Follow the step-by-step guide in this section to configure and deploy the Serverless Web Scrapping into your AWS account.

Time to deploy: 10 minutes

Prerequisites

Before continuing you need an AWS account. If you do not already have an AWS account, you can utilize their free-tier accounts.

The Steps

See the implementation guide for detailed steps

Security

According to the AWS Shared Responsibility Model, security responsibilities are shared between you and AWS. In the architect diagram, if you use to connect to your database for archiving output data, we suggest using AWS Parameter Store to configure and retrieve your credentials at runtime.

Resources

AWS services: this solution is build based on the following AWS services:

Web Scrapping

Source Code

You can visit our Github repository to clone or download the project setup, scripts. You are free to share with others and customize based on your needs.

Q&A

Can I deploy the Serverless Web Scrapping in any AWS Region?

Yes, you can deploy this solution in any AWS region that supports AWS Fargate.

Can I run the AWS ECS Fargate task under my existing VPC?

Yes, you can run your AWS ECS task withing any existing VPC on your AWS accounts, you can also use default VPC you can create a new once.

Can I run the AWS ECS Fargate task under a public subnet?

Yes, by the default, to avoid adding cost, we recommend you run ECS tasks under a public subnet for the workload that doesn’t contain any sensitive information? If your workload contains sensitive information, you should run under a private subnet.

Can I change the web scrapping tool used in this solution to another one?

Yes, you can run any web scrapping tool that supports headless mode and can run on a docker container environment.

About Author:

Co-Founder, CEO of InnomizeTech | AWS Certified Solution Architect | Passionate about #cloudcomputing #aws #serverless #devops #machinelearning #iot #startup

Leave a Comment

Your email address will not be published. Required fields are marked *