{"id":21425436,"url":"https://github.com/utrechtuniversity/ia-webscraping","last_synced_at":"2025-07-13T21:09:02.701Z","repository":{"id":40789208,"uuid":"329035317","full_name":"UtrechtUniversity/ia-webscraping","owner":"UtrechtUniversity","description":"An AWS workflow for collecting webpages from the Internet Archive","archived":false,"fork":false,"pushed_at":"2024-11-05T12:25:38.000Z","size":22477,"stargazers_count":3,"open_issues_count":5,"forks_count":4,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-12T15:19:23.296Z","etag":null,"topics":["aws","internet-archive","python","terraform","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UtrechtUniversity.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-01-12T15:49:39.000Z","updated_at":"2024-11-05T12:25:42.000Z","dependencies_parsed_at":"2024-11-22T21:41:23.491Z","dependency_job_id":null,"html_url":"https://github.com/UtrechtUniversity/ia-webscraping","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/UtrechtUniversity/ia-webscraping","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtrechtUniversity%2Fia-webscraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtrechtUniversity%2Fia-webscraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtrechtUniversity%2Fia-webscraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtrechtUniversity%2Fia-webscraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UtrechtUniversity","download_url":"https://codeload.github.com/UtrechtUniversity/ia-webscraping/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UtrechtUniversity%2Fia-webscraping/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265205779,"owners_count":23727513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","internet-archive","python","terraform","web-scraping"],"created_at":"2024-11-22T21:28:32.249Z","updated_at":"2025-07-13T21:09:02.673Z","avatar_url":"https://github.com/UtrechtUniversity.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ia-webscraping\n\nThis repository provides code to set up an AWS workflow for collecting webpages from the Internet Archive.\nIt was developed for the Crunchbase project to assess the sustainability of European startup-companies by analyzing their websites.\n\nThe [workflow](#architecture) is set up to scrape large numbers (millions) of Web pages. With large numbers of http requests from a single location, \nthe Internet Archive's response becomes slow and less reliable. We use serverless computing to distribute the process as much as possible.\nIn addition, we use queueing services to manage the logistics and a data streaming service to process the large amounts of individual files.\n\nPlease note that this software is designed for users with prior knowledge of Python, AWS and infrastructure.\n\n\n## Table of contents\n\n- [Getting started](#getting-started)\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n- [Running the pipeline](#running-the-pipeline)\n  - [Uploading URLs](#upload-urls-to-be-scraped)\n  - [Monitoring progress](#monitor-progress)\n- [Results](#results)\n  - [Processing Parquet files](#processing-parquet-files)\n- [Cleaning up](#cleaning-up)\n  - [Deleting the infrastructure](#deleting-the-infrastructure)\n  - [Deleting buckets](#deleting-buckets)\n- [About the project](#about-the-project)\n  - [Architecture](#architecture)\n  - [Built with](#built-with)\n  - [License and citation](#license-and-citation)\n  - [Team](#team)\n\n## Getting started\n\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n\n### Prerequisites\nThe process includes multiple bash-files that only run on Linux or a Mac.\nTo run this project you need to take the following steps:\n- [Install AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)\n- [Configure AWS CLI credentials](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html); create a local profile with the name 'crunch'\n- Install [Python3](https://www.python.org/downloads/), [pip3](https://pypi.org/project/pip/), [pandas](https://pandas.pydata.org/), [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#installation)\n- Install [Terraform](https://www.terraform.io/downloads.html)\n- Create a personal S3 bucket in AWS (region: 'eu-central-1'; for other settings defaults suffice). This is the bucket the code for your Lambda-functions will be stored in. Another bucket for the sults will be created automatically.\n\n### IAM Developer Permissions Crunchbase\nIf you are going to use an IAM account for the pipeline, make sure it has the proper permissions to create buckets, queues and policies, and to create, read and write to log goups and streams. The following [AWS managed policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_managed-vs-inline.html) were given to all developers of the original Crunchbase project:\n- AmazonEC2FullAccess\n- AmazonSQSFullAccess\n- IAMFullAccess\n- AmazonEC2ContainerRegistryFullAccess\n- AmazonS3FullAccess\n- CloudWatchFullAccess\n- AWSCloudFormationFullAccess\n- AWSBillingReadOnlyAccess\n- AWSLambda_FullAccess\n\nNote, these policies are broader than required for the deployment of Crunchbase. Giving more access than required does not follow the best practice for least-privelege, for more [information](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html).\n\n\n### Installation\nCheck out [this repository](https://github.com/UtrechtUniversity/ia-webscraping), and make sure you checkout the 'main' branch. Open a terminal window and navigate to the `code` directory.\n```bash\n# Go to code folder\n$ cd code\n```\n\n#### Configuring Lambda functions and Terraform\nThe `build.sh` script in this folder will for each of the Lambda functions:\n- install all requirements from the 'requirements.txt' file in the function's folder\n- create a zip file\n- calculate a hash of this zipfile\n- upload all relevant files to the appropriate S3 bucket\n\nYou can run `build.sh` with the following parameters to configure your scrape job:\n\n- `-c \u003ccode bucket name\u003e`: the name of the S3 bucket you've created for the code (see 'Prerequisites'). Use just the buckets name,\nomitting the scheme (for example: 'my-code-bucket', *not* 's3://my-code-bucket').\n- `-r \u003cresult bucket name\u003e`: the name of the S3 bucket for your results. This will be created automatically. Again, specify just\nthe name (for example: 'my-result-bucket').\n- `-l \u003clambda prefix\u003e`: will be prefixed the Lambda functions' name. Useful for keeping different functions apart, if you\nare running more Lambda's on the same account.\n- `-a \u003cAWS profile\u003e`: the name of you local AWS profile (see 'Prerequisites'; for example: 'crunch').\n- `-f \u003cformats to save\u003e`: the scraper can save all readable text from a html-page (for text analysis), as well as a list of\nlinks present in each page (useful for network analysis). The default is text and links (`-f \"txt,links\"`). Saving\nfull html-pages has been disabled.\n- `-s \u003cstart year\u003e`: start year of the time window for which the scraper will retrieve stored pages. This value affects all domains.\nIt can be overridden with a specific value _per domain_ during the [URL upload process](#upload-urls-to-be-scraped). Same for `-e`.\n- `-e \u003cend year\u003e`: end year.\n- `-m`: switch for exact URL match. By default, the program will retrieve all available pages who's URL *starts with* the domain\nor URL you provide. By using the `-m` switch, the program will only retrieve exact matches of the provided domain or URL. Note that\nwhile matching, the presence of absence of a 'www'-subdomain prefix is ignored (so you can provide either).\n- `-x \u003cmaximum number of scraped pages per provided URL; 0 for unlimited\u003e`: maximum number of pages to retrieve for each provided\ndomain (or URL). If a domain's number of URLs exceeds this value, all the URLs are first sorted by their length (shortest first)\nand subsequently truncated to `-x` URLs.\n- `-n`: Switch to skip re-install of third party packages.\n- `-h`: Show help.\n\n\n#### Building Lambda functions and uploading to AWS\nSave the file and close the text editor. Then run the build script with the correct parameters, for instance:\n```bash\n$ ./build.sh \\\n  -c my_code_bucket \\\n  -r my_result_bucket \\\n  -l my_lambda \\\n  -a crunch \\\n  -s 2020 \\\n  -e 2022 \\\n  -x 1000\n```\nThe script creates relevant Terraform-files, checks whether the code-bucket exists, installs all required Python-packages,\nzips the functions, and uploads them to the code bucket.\n\nIf you run the build-script repeatedly within a short time period (for instance when modifying the code), you\ncan execute subsequent builds with a tag to skip the re-installation of the Python dependencies and save time:\n```\n$ ./new_code_only.sh\n```\nThis will repackage your code, and upload it to the appropriate bucket. The lambda will eventually pick up the new code\nversion; to make sure that the new code is used, go the appropriate function in the Lambda-section of the AWS console,\nand in the section 'Source', click 'Upload from'. Choose 'Amazon S3 location', and enter the S3-path of the uploaded zip-file.\nYou can find the paths at the end of the output of the `new_code_only.sh` script (e.g. `s3://my_code_bucket/code/my_lambda-cdx.zip`)\n\n\n#### Additional Terraform configuration (optional)\nAll relevant Terraform-settings are set by the build-script. There are, however, some defaults that can be changed.\nThese are in the file [terraform.tfvars](terraform/terraform.tfvars), below the line '--- Optional parameters ---':\n\n```php\n[...]\n\n# ------------- Optional parameters -------------\n# Uncomment if you would like to use these parameters.\n# When nothing is specified, defaults apply.\n\n# cdx_logging_level = [CDX_DEBUG_LEVEL; DEFAULT=error]\n\n# scraper_logging_level = [SCRAPER_DEBUG_LEVEL; DEFAULT=error]\n\n# sqs_fetch_limit = [MAX_MESSAGES_FETCH_QUEUE; DEFAULT=1000]\n\n# sqs_cdx_max_messages = [MAX_CDX_MESSAGES_RECEIVED_PER_ITERATION; DEFAULT=10]\n\n# cdx_lambda_n_iterations = [NUMBER_ITERATIONS_CDX_FUNCTION=2]\n\n# cdx_run_id = [CDX_RUN_METRICS_IDENTIFIER; DEFAULT=1]\n```\n\nSee the [variables file](/code/terraform/variables.tf) for more information on each of these variables.\n\nPlease note that [terraform.tfvars](terraform/terraform.tfvars) is automatically generated when you run the build-script,\noverwriting any manual changes you may have made. If you wish to modify any of the variables in the file, do so _after_\nyou've successfully run `build.sh`.\n\n\n#### Initializing Terraform\n_init_\n\nThe `terraform init` command is used to initialize a working directory containing Terraform configuration files.\nThis is the first command that should be executed after writing a new Terraform configuration or cloning an\nexisting one from version control. This command needs to be run only once, but is safe to run multiple times.\n```bash\n# Go to terraform folder\n$ cd terraform\n\n# Initialize terraform\n$ terraform init\n```\nOptionally, if you have made changes to the backend configuration:\n```bash\n$ terraform init -reconfigure\n```\n_plan_\n\nThe `terraform plan` command is used to create an execution plan. Terraform performs a refresh, unless explicitly\ndisabled, and then determines what actions are necessary to achieve the desired state specified in the configuration\nfiles. The optional -out argument is used to save the generated plan to a file for later execution with\n`terraform apply`.\n```bash\n$ terraform plan -out './plan'\n```\n_apply_\n\nThe `terraform apply` command is used to apply the changes required to reach the desired state of the configuration,\nor the pre-determined set of actions generated by a terraform plan execution plan. By using the “plan” command before\n“apply,” you’ll be aware of any unforeseen errors or unexpected resource creation/modification!\n```bash\n$ terraform apply \"./plan\"\n```\n\nFor convenience, all the Terraform-steps can also be run from a single bash-file:\n```bash\n$ ./terraform.sh\n```\n\n## Running the pipeline\n\n### Upload URLs to be scraped\nScraping is done in two steps:\n1. After uploading a list of domains to be scraped to a queue, the 'CDX' Lambda-function queries the API of the [Internet\nArchive](https://archive.org/web/) (IA) and retrieves all archived URLs for each domain. These include all available\ndifferent (historical) versions of each page for the specified time period. After filtering out irrelevant URLs (images,\nJavaScript-files, stylesheets etc.), the remaining links are sent to a second queue for scraping.\n2. The 'scrape' function reads links from the second queue, retrieves the corresponding pages from the Internet Archive,\nand saves the contents to the result bucket. The contents are saved as Parquet datafiles.\n\nThe `fill_sqs_queue.py` script adds domains to be scraped to the initial queue (script is located in the [code folder](code/)):\n```bash\n# Fill sqs queue\n$ python fill_sqs_queue.py [ARGUMENTS]\n```\n```\nArguments:\n  --infile, -f       CSV-file with domains. The appropriate column should have 'Website' as\n                     header. If you're using '--year-window' there should also be a column\n                     'Year'.\n  --job-tag, -t      Tag to label a batch of URLs to be scraped. This tag is repeated in all\n                     log files and in the result bucket, and is intended to keep track of all\n                     data and files of one run. Max. 32 characters.\n  --queue, -q        name of the appropriate queue; the correct value has been set by 'build.sh'\n                     (optional).\n  --profile, -p      name of your local AWS profile; the correct value has been set by 'build.sh'\n                     (optional).\n  --author, -a       author of queued messages; the correct value has been set by 'build.sh'\n                     (optional).\n  --first-stage-only switch to use only the first step of the scraping process (optional, default\n                     false). When used, the domains you queue are looked up in the IA, and the\n                     resulting URLs are filtered and, if appropriate, capped, and logged, but not\n                     passed on to the scrape-queue. \n  --year-window, -y  number of years to scrape from a domain's start year (optional). Requires\n                     the presence of a column with start years in the infile.\n\n\nExample:\n$ python fill_sqs_queue.py -f example.csv -t \"my first run (2022-07-11)\" -y 5\n```\nFor each domain, the script creates a message and loads it into the CDX-queue, after which processing automatically\nstarts.\n\n### Monitor progress\nEach AWS service in the workflow can be monitored in the AWS console. The CloudWatch logs provide additional information\non the Lambda functions. Set the logging level to 'info' to get verbose information on the progress.\n\n#### CDX-queue\nFor each domain a message is created in the CDX-queue, which can be monitored through the\n'Simple Queue Service' in the AWS Console. The CDX-queue is called `my-lambda-cdx-queue` (`my-lambda` being the value\nof `LAMBDA_NAME` you configured; see 'Configuring Lambda functions and Terraform'). The column 'Messages available'\ndisplays the number of remaining messages in the queue, while 'Messages in flights' shows the number of messages\ncurrently being processed. Please note that this process can be relatively slow; if you uploaded thousands of links,\nexpect the entire process to take several hours or longer.\n\n#### Scrape-queue\nThe scrape-queue (`my-lambda-scrape-queue`) contains a message for each URL to be scraped. Depending on the size of the\ncorresponding website, this can be anything between a few and thousands of links per domain (and occasionally none).\nTherefore, the number of messages to be processed from the scrape-queue is usually many times larger than the number\nloaded into the CDX-queue. The is further increased by the availability of multiple versions of the same page.  \n\n#### Stopping a run\nIf you need to stop a run, first go to the details of the CDX-queue (by clicking its name), and choose 'purge'. This\nwill delete all messages from the queue that are not in flight yet. Then do the same for the scrape-queue.\n\n\n#### CloudWatch (logfiles)\nWhile running, both functions log metrics to their own log group. These can be accessed through the CloudWatch module \nof the AWS Console.\n\nEach function logs to its own log group, `/aws/lambda/my_lambda-cdx` and `/aws/lambda/my_lambda-scrape`. These logs\ncontain mostly technical feedback, and are useful for debugging errors.\nBesides standard process info, the lambda's write project specific log lines to their respective log streams. These\ncan be identified by their labels.\n\n_CDX metrics_  (Label: **[CDX_METRIC]**)\n\nMetrics per domain\n+ job tag\n+ domain\n+ start year of the retrieval window\n+ end year of the retrieval window\n+ number of URLs retrieved from the IA\n+ number that remains after filtering out filetypes that carry no content (such as JS-files, and style sheets)\n+ number sent to the scrape-queue, after filtering and possible capping.\nThe last two numbers are usually the same, unless you have specified a maximum number of scraped pages per provided domain, and a domain\nhas more pages than that maximum.\n\n_Scrape metrics_  (Label **[SCRAPE_METRIC]**)\n\nMetrics per scraped URL\n+ job tag\n+ domain for which the CDX-function retrieved the URL.\n+ full URL that was scraped.\n+ size saved txt (in bytes)\n+ size saved links (in bytes)\n\n#### Browsing, querying and downloading log lines\nAll log lines can be browsed through the Log Groups of the CloudWatch section of the AWS Console, and, up to a point, queried via the \nLog Insights function. To download them locally, install and run [saw](https://github.com/TylerBrock/saw). A typical command would be:\n\n```bash\n$ saw get /aws/lambda/my_lambda-cdx --start 2022-06-01 --stop 2022-06-05 | grep CDX_METRIC\n```\nThis tells `saw` to get all log lines from the `/aws/lambda/my_lambda-cdx` stream for the period `2022-06-01 \u003c= date \u003c 2022-06-05`.\nThe output is passed on to `grep` to filter out only lines containing 'CDX_METRIC'\n\n\n## Results\nAll scraped content is written to the S3 result bucket (the name of which you specified in the `RESULT_BUCKET` variable) as\nParquet-files, a type of data file ([Apache Parquet docs](https://parquet.apache.org/docs/)). The files are written by\nKinesis Firehose, which controls when results are written, and to which file. Firehose flushes records to file, and rolls\nover to a new Parquet-file, whenever it is deemed necessary, This make the number of Parquet-files, as well as the number\nof records within each file somewhat unpredictable (see below for downloading and processing of Parquet-files,\nincluding compiling them in less and larger files). The files are ordered in subfolders representing the year, month and day\nthey were created. For instance, a pipeline started on December 7th 2022 will generate a series of Parquet-files such as:\n\n```bash\ns3://my_result_bucket/2022/12/07/scrape-kinesis-firehose-9-2022-12-07-09-23-43-00efb47c-021f-475f-a119-1aecf2b15ed9.parquet\n```\n\n### Processing Parquet-files\n\n#### Downloading Parquet files from S3\nDownload the generated Parquet-files from the appropriate S3 bucket using the `sync_s3.sh` bash file in the \n[scripts-folder](code/scripts/). The script's first parameter is the address of the appropriate bucket and,\noptionally, the path within it. The second parameter is path of the local folder to sync to.\n\nThe script uses the `sync` command of the AWS-client, which mirrors the remote contents to a\nlocal folder. This means that if run repeatedly, only new files will be downloaded each time.\nThe command works recursively so subfolder structure is maintained.\n\nThe example command below syncs all the files and subfolders in the folder `/2022/12/` in the\nbucket `my-result-bucket` to the local folder `/my_data/parquet_files/202211/`.\n\n```bash\n$ sync_s3.sh s3://my-result-bucket/2022/12/ /my_data/parquet_files/202211/\n```\n\nBe aware that there will be some time between completion of the final invocation of the\nscraping lambda, and the writing of its data by the Kinesis Firehose (usually no more than\n15 minutes).\n\n#### Quick check of downloaded files\nTo make sure all files were downloaded, check the number of files you just downloaded with the\nnumber in the S3 bucket. The latter can be calculated by accessing the AWS Console. Navigate\ntowards the S3-module, and then the appropriate S3 bucket, and select the appropriate folders.\nNext, click the 'Actions'-button and select 'Calculate total size'; this will give you the\ntotal number of objects and their collective size.\n\n#### Split files based on job_tag\nIf the Parquet-files contain data of multiple runs with different job tags, they can be split\naccordingly. Run the following command to recursively process all Parquet-files in the folder\n`/my_data/parquet_files/202211/` and write them to `/my_data/parquet_files/jobs/`.\n\n```bash\n$ python parquet_file_split.py \\\n    --input '/my_data/parquet_files/202211/' \\\n    --outdir '/my_data/parquet_files/jobs/' \n```\nThis will result in a subfolder per job tag in the output folder. Within each subfolder, there\nwill be a series of Parquet-files containing only the records for that job tag.\n\n#### Combining split files into larger files\nOptionally, to combine many small Parquet-files into larger files, run the following command:\n\n```bash\n$ python parquet_file_join.py \\\n    --indir '/my_data/parquet_files/jobs/my_job/' \\\n    --outdir '/my_data/parquet_files/jobs/my_job_larger/' \\\n    --basename 'my_job' \\\n    --max-file-size 100 \\\n    --delete-originals \n```\n+ `basename`: the basename of the new, larger files. Incremental numbers and '.parquet'\nextensions are added automatically.\n+ `max-file-size`: optional parameter to set the appromximate maximum file size in MB of the resulting files (default: 25)\n+ `delete-originals`: optional, default False.\n\n#### Reading Parquet files\nIf you are using Python, you can read the Parquet files into a Pandas or Polars DataFrame, \nor use the [pyarrow](https://pypi.org/project/pyarrow/) package.\nFor R you can use the [arrow](https://arrow.apache.org/docs/r/reference/read_parquet.html) package.\n\nEach Parquet-file contains a number of rows, each one corresponding with one scraped URL, with the following columns:\n+ job_tag: id.\n+ domain: domain for which the CDX-function retrieved the URL.\n+ url: full URL that was scraped.\n+ page_text: full page text\n+ page_links: list of page links\n+ timestamp: timestamp of creation of the record\n\n\n## Cleaning up\n### Deleting the infrastructure\nAfter finishing scraping, run the following [command](https://www.terraform.io/docs/commands/destroy.html), to\nclean up the AWS resources that were deployed by Terraform:\n```bash\n# Go to terraform folder\n$ cd terraform\n\n# Clean up AWS resources\n$ terraform destroy\n```\n\nThis leaves in tact the code and result buckets.\n\n### Deleting buckets\nYou can delete s3 buckets through the AWS management interface. When there\nare a lot of files in a bucket, the removal process in the management interface sometimes hangs before it finishes.\nIn that case it is advisable to use the AWS client. Example command:\n```bash\n$ aws s3 rb s3://my_result_bucket --force\n```\nThis will delete all files from the bucket, and subsequently the bucket itself.\n\n## About the Project\n\n### Architecture\nThe ia-webpository utilizes the following AWS services:\n- **Simple Queueing System**: manage distribution of tasks among Lambda functions and give insight in results\n    - queue with initial urls\n    - queue with scraping tasks\n- **AWS Lambda**: run code without the need for provisioning or managing servers\n    - lambda to retrieve cdx records for initial urls, filter these and send tasks to scraping queue\n    - lambda to retrieve webpages for cdx records and send these to s3 bucket\n- **S3**: for storage of the HTML pages\n- **CloudWatch**: monitor and manage AWS services\n   - CloudWatch to monitor the metrics of the SQS queue and Lambda functions\n   - CloudWatch to trigger the Lambda function on a timely basis, the interval can be changed to throttle the process\n- **Kinesis Data Firehose**: delivery streams\n   - data from the scraping lambda is pushed to S3 using the Kinesis Data Firehose delviery system.\n   - stores data in Apache Parquet files.\n\nConfiguration of the necessary AWS infrastructure and deployment of the Lambda functions is done using the\n“infrastructure as code” tool Terraform.\n\nDeploying this solution will result in the following scrape pipeline in the AWS Cloud.\n\n![Alt text](docs/architecture_overview.png?raw=true \"Architecture Overview\")\n\n(n.b. schema lacks Kinesis Data Firehose component)\n\n### Built with\n\n- [Terraform](https://www.terraform.io/)\n- [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)\n- [asyncio](https://docs.aiohttp.org/en/stable/glossary.html#term-asyncio)\n\n### License and citation\n\nThe code in this project is released under [MIT](LICENSE).\n\nPlease cite this repository as \n\nSchermer, M., Bood, R.J., Kaandorp, C., \u0026 de Vos, M.G. (2023). \"Ia-webscraping: An AWS workflow for collecting webpages from the Internet Archive \"  (Version 1.0.0) [Computer software]. https://doi.org/10.5281/zenodo.7554441\n\n[![DOI](https://zenodo.org/badge/329035317.svg)](https://zenodo.org/badge/latestdoi/329035317)\n\n\n### Team\n\n**Researcher**:\n\n- Jip Leendertse (j.leendertse@uu.nl)\n\n**Research Software Engineer**:\n\n- Casper Kaandorp (c.s.kaandorp@uu.nl)\n- Martine de Vos (m.g.devos@uu.nl)\n- Robert Jan Bood (robert-jan.bood@surf.nl)\n- Maarten Schermer (m.d.schermer@uu.nl)\n\nThis project is part of the Public Cloud call of [SURF](https://www.surf.nl/en/)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futrechtuniversity%2Fia-webscraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Futrechtuniversity%2Fia-webscraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Futrechtuniversity%2Fia-webscraping/lists"}