What does this AWS Solution do?
Many people want to crawl data from public websites for many purposes. Nowadays, there many tools that can help you implement the web scrapping solution. But crawling websites with a large dataset and complex sitemap take longer and require servers to run the scrapping process.
With web scraping, you won’t need the workload 24/7 and data storage is not a critical aspect. Running web scrapping on a dedicated server/virtual machine requires your administrator to manage and monitor servers. Also, when running web scrapping, you also need to deal and avoid blocking your process from the site in many ways such as dynamic Ip address allocation each time the process runs. This process can be difficult to manage and cause time delays, and increase compute costs.
To help you archive and overcome those challenges, decrease the cost of web scrapping. We provide the Long-running Serverless Web Scrapping solution that provides an open-source web scraping solution with Puppeteer to enable cost-effective solutions on the AWS Cloud. This solution also provides automation deploys and configures a serverless architecture that is optimized for web scrapping uses the AWS CDK framework to deploy. Running on a serverless environment also reduces and provides operational excellence objective that automated your deployment process as code and other pillars of the AWS Well-Architected framework.
AWS solution overview
The below diagram presents the Long-running Serverless Web Scrapping architecture you can deploy in minutes using the AWS CDK framework.
AWS CloudWatch triggers and starts ECS Fargate Task(s), the number of tasks depends on the total of web scrapping specification. You can start hundreds of tasks parallelly at scale. The ECS task runs a Docker container which starts a Puppeteer process in headless mode and starts scrapping data from the websites. The solution generates an S3 bucket in order to archive web scrapping output. Archive to the database is optional and we won’t address details on how to do that in this solution.
Additionally, using AWS CDK also deploys and configures an AWS ECS Fargate cluster, ECR repository, build and push the docker image to the ECR repository.
- Automatically build a serverless architecture that is optimized for serverless web scrapping process on the AWS Cloud.
- Web scrapping using the Puppeteer
- Automatically archive web scrapping output to S3 bucket or database such as MySQL.
When running the AWS services on your AWS account, you are responsible for the cost of this solution. As the default of our open-source code:
- Run 1000 ECS tasks with 1GB of RAM and 1 vCPU.
- The duration of each task is around 3 minutes (average).
- 1GB S3 storage.
- 1GB data transfer, with the default setting in the US East 2 (Ohio) Region.
The estimated cost for running this solution is as shown in the table below:
|AWS Service||Total Cost|
|Amazon ECS Fargate||$2.25|
If you choose to deploy to your VP and run under a Private subnet. You are responsible for the incurred variable charges from VPC service. For full details, see the pricing webpage for each AWS service you will be using in this solution.
The Serverless Web Scrapping solution provides an important parameter: useDefaultVpc that allow you to determine whether you want to run ECS tasks under your default VPC in the public subnet, or you want to create a new VPC, select subnet type (private or public) to run your ECS task. Be carefully before deploying and running tasks to avoid adding your AWS bill such as NAT instance hourly charge.
This solution is designed to run web scraping in a serverless environment to reduce infrastructure and operation costs for a long-running web scrapping process, a long-running process means you cannot break into multiple shorter processes. If your web scrapping process can take shorter than 15 minutes, you can use another approach is to run your web scrapping on the AWS Lambda.
This solution uses the AWS CDK framework to automate the deployment of the Serverless Web Scrapping solution on the AWS Cloud. It includes an AWS CDK construct, which you can clone and customize before deployment:
- puppeteer-crawler-fargate-stack.ts use this construct to deploy the Serverless Web Scrapping process and all associated components. The default construct template deploys AWS ECS Cluster, Task Definition, AWS ECR Repository, AWS VPC, Amazon S3 Bucket, and AWS CloudWatch EventRule and CloudWatch Log Group, you also can customize based on your specific requirements.
Follow the step-by-step guide in this section to configure and deploy the Serverless Web Scrapping into your AWS account.
Time to deploy: 10 minutes
Before continuing you need an AWS account. If you do not already have an AWS account, you can utilize their free-tier accounts.
- Prepare your AWS credentials to interest with AWS API via CLI.
- Install AWS CLI and Docker on your machine.
See the implementation guide for detailed steps
According to the AWS Shared Responsibility Model, security responsibilities are shared between you and AWS. In the architect diagram, if you use to connect to your database for archiving output data, we suggest using AWS Parameter Store to configure and retrieve your credentials at runtime.
AWS services: this solution is build based on the following AWS services:
- AWS Fargate
- AWS Elastic Container Service
- AWS VPC
- AWS CloudFormation
- Amazon S3
- AWS Identity and Access Management
You can visit our Github repository to clone or download the project setup, scripts. You are free to share with others and customize based on your needs.
Yes, you can deploy this solution in any AWS region that supports AWS Fargate.
Yes, you can run your AWS ECS task withing any existing VPC on your AWS accounts, you can also use default VPC you can create a new once.
Yes, by the default, to avoid adding cost, we recommend you run ECS tasks under a public subnet for the workload that doesn’t contain any sensitive information? If your workload contains sensitive information, you should run under a private subnet.
Yes, you can run any web scrapping tool that supports headless mode and can run on a docker container environment.