{"id":42222181,"url":"https://github.com/blacksound1/concordia-web-crawler","last_synced_at":"2026-01-27T02:12:17.288Z","repository":{"id":215148057,"uuid":"718780437","full_name":"BlackSound1/Concordia-Web-Crawler","owner":"BlackSound1","description":"Crawls the Concordia.ca domain, clusters the text into categories, and performs sentiment analysis","archived":false,"fork":false,"pushed_at":"2023-12-05T16:26:19.000Z","size":78,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-29T11:21:29.179Z","etag":null,"topics":["clustering","crawling","machine-learning","python","sentiment-analysis","web"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BlackSound1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-11-14T19:35:50.000Z","updated_at":"2024-01-02T17:53:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"c6d5cc72-d7f6-4e10-b3d8-597d1de61406","html_url":"https://github.com/BlackSound1/Concordia-Web-Crawler","commit_stats":{"total_commits":42,"total_committers":1,"mean_commits":42.0,"dds":0.0,"last_synced_commit":"f1966501ceaf5b0353f23c95e73d1f2547762ef2"},"previous_names":["blacksound1/concordia-web-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BlackSound1/Concordia-Web-Crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSound1%2FConcordia-Web-Crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSound1%2FConcordia-Web-Crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSound1%2FConcordia-Web-Crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSound1%2FConcordia-Web-Crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BlackSound1","download_url":"https://codeload.github.com/BlackSound1/Concordia-Web-Crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BlackSound1%2FConcordia-Web-Crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28796962,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-27T01:07:07.743Z","status":"online","status_checked_at":"2026-01-27T02:00:07.755Z","response_time":168,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","crawling","machine-learning","python","sentiment-analysis","web"],"created_at":"2026-01-27T02:12:16.681Z","updated_at":"2026-01-27T02:12:17.283Z","avatar_url":"https://github.com/BlackSound1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# COMP479P4\n\n\u003chr /\u003e\n\nThe `AFINN-111.txt` lexicon was sourced from: http://corpustext.com/reference/sentiment_afinn.html.\n\nThe strategy for clustering was sourced from: https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html.\n\nThe Afinn 0.1 library was sourced from: https://github.com/fnielsen/afinn.\n\nThe Scrapy 2.8.0 library was sourced from: https://github.com/scrapy/scrapy. Basic scrapy usage was inspired by:\nhttps://docs.scrapy.org/en/latest/intro/tutorial.html.\nUsing Scrapy within Python was inspired by: https://stackoverflow.com/a/31374345.\n\nThe Scikit-learn 1.3.0 library was sourced from: https://anaconda.org/conda-forge/scikit-learn.\n\n## Setup\n\n1. Navigate to this directory.\n\n2. Create a virtual environment with:\n\n    ```shell\n    $ python -m venv COMP479\n    ```\n\n3. Activate it with:\n\n    ```shell\n    $ source COMP479\\Scripts\\activate.bat\n    ```\n\n4. Install the dependencies with:\n\n    ```shell\n    $ pip install -r requirements.txt\n    ```\n\n## To Run\n\nFirst run the `crawl.py` module with:\n\n```shell\n$ python crawl.py -n 1000\n```\n\nThe `-n` flag determines how many documents to crawl and download. Due to some potential errors found when performing \nclustering on a small numer of files, please set `-n` to a large number.\n\nThen, run the `cluster.py` module with:\n\n```shell\n$ python cluster.py\n```\n\nFinally, run the `sentiment.py` module with:\n\n```shell\n$ python sentiment.py\n```\n\n## Sequences of Calls\n\n### Crawling\n\nIn `crawl.py`, use the external library Scrapy https://github.com/scrapy/scrapy. \nI create a `CrawlerProcess` object that lets me use Scrapy from within Python, not the command line.\nI pass the parameter `max_files` to it, which is set to the `-n` argument passed in via the `$ python crawl.py -n 1000`\ncall above. If none is provided, `max_files` is set to 100 by default.\n\nThe `MainSpiders` parameters include:\n\n- `name = 'test'`: The name of the spider.\n- `allowed_domains = ['www.concordia.ca']`: should hopefully keep the\ncrawler on Concordia websites.\n- `start_urls = ['https://www.concordia.ca']`: Makes the crawler start on the Concordia Homepage.\n- `num_files = 0`: Keeps track of how many files have been downloaded.\n\nIn the `update_settings()` method of the spider, I set it to obey `robots.txt` using the line:\n\n```python\nsettings.set('ROBOTSTXT_OBEY', 'True', priority='spider')\n```\n\n### Parsing\n\nThe BeautifulSoup library is used to parse the cralwed web text. It uses the following parameters:\n\n- `features=\"html.parser\"`: makes the BeautifulSoup parser treat each document as HTML.\n- `from_encoding='utf-8'`: Forces the documents to\nbe interpreted using UTF-8 encoding. Some encoding errors were noticed when this wasn't set. \n\n### Vectorizing\n\nIn `cluster.py`, the `TfidfVectorizer` vectorizer is used to vectorize the documents.\nIt's paramters include:\n\n- `max_df=0.5`: sets the vectorizer to ignore terms that occur in more that 50% of the documents.\n- `min_df=0.1`: makes it ignore all terms that occur in fewer than 10% of all documents.\n- `stop_words=stopwords`: lets me set a custom set \nof stopwords to ignore when vectorizing. I made a custom set composed of all English and French stopwords, plus several\nothers found in experiment.\n- `strip_accents='unicode'`: Strips all accents according to Unicode (not ASCII) standards\n- `input='filename'`: Sets the input for the `fit_transform()` method take a list of filenames.\n- `encoding=\"utf-8\"`: Forces UTF-8 encoding.\n\n### K-Means\n\nIn `cluster.py`, the `KMeans` classifier is used from `sklearn`. It's parameters include:\n\n- `max_iter=100`: Sets the maximum number of iterations for a single K-Means run to 100.\n- `n_clusters`: Sets the number of clusters (and centroids) to create. For the `k=3` run, this number is 3,\n                likewise for when `k=6`.\n- `random_state`: Sets a seed for randomness to hopefully generate reproducible results. Set to 3 when `k=3`, likewise\n                  for when `k=6`.\n- `n_init=1`: Set the number of times the K-Means algorithm is run with different centroid seeds to 1.\n\n### Sentiment Analysis\n\nIn `sentiment.py`, sentiment analysis is done on the discovered clusters two times. The first time is via the\nalgorithm I came up with. The second is via the algorithm found in the Python `afinn` library.\nAs it turns out, these are the same algorithm, but I worked on my algorithm before even downloading the `afinn` library,\nso I wouldn't have known that. After I did know they were the same algorithm, I decided to keep mine the way it is\nbecause I knew I had something to say about their differing performance for some clusters.\n\nFor my algorithm, I use the `AFINN-111.txt` lexicon. For each cluster, I score each word according to this lexicon and\nadd up each word's score into a final cluster score.\n\nFor the library algorithm, for each cluster, I feed the entire cluster as one string to `afinn.score()`.\nThe library algorithm uses the `AFINN-en-165.txt` lexicon by default. Since these are different lexicons, this easily\nexplains any minor differing scores.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksound1%2Fconcordia-web-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblacksound1%2Fconcordia-web-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksound1%2Fconcordia-web-crawler/lists"}