{"id":16431822,"url":"https://github.com/drkostas/hgn","last_synced_at":"2025-09-20T21:38:40.670Z","repository":{"id":40959661,"uuid":"262106011","full_name":"drkostas/HGN","owner":"drkostas","description":"[Algorithms '19] Official code for the paper \"A Distributed Hybrid Community Detection Methodology for Social Networks\" paper.","archived":false,"fork":false,"pushed_at":"2023-07-06T21:58:06.000Z","size":224,"stargazers_count":27,"open_issues_count":6,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-07-27T11:51:56.706Z","etag":null,"topics":["apache-spark","community-detection","distributed","girvan-newman","graphframes","paper-implementations","papers-with-code","social-networks","spark"],"latest_commit_sha":null,"homepage":"https://www.mdpi.com/1999-4893/12/8/175","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/drkostas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-07T16:44:45.000Z","updated_at":"2025-01-06T16:40:12.000Z","dependencies_parsed_at":"2024-10-28T15:29:32.708Z","dependency_job_id":"0cb4f8e9-34ee-4965-a2d3-710cb6410a5b","html_url":"https://github.com/drkostas/HGN","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":"drkostas/template_python_project","purl":"pkg:github/drkostas/HGN","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drkostas%2FHGN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drkostas%2FHGN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drkostas%2FHGN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drkostas%2FHGN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/drkostas","download_url":"https://codeload.github.com/drkostas/HGN/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/drkostas%2FHGN/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267867803,"owners_count":24157357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","community-detection","distributed","girvan-newman","graphframes","paper-implementations","papers-with-code","social-networks","spark"],"created_at":"2024-10-11T08:32:50.884Z","updated_at":"2025-09-20T21:38:35.617Z","avatar_url":"https://github.com/drkostas.png","language":"Python","readme":"# Hybrid Girvan Newman\n[![CircleCI](https://circleci.com/gh/drkostas/HGN/tree/master.svg?style=svg)](https://circleci.com/gh/drkostas/HGN/tree/master)\n[![GitHub license](https://img.shields.io/badge/license-GNU-blue.svg)](https://raw.githubusercontent.com/drkostas/HGN/master/LICENSE)\n\n## Table of Contents\n\n+ [About](#about)\n+ [Getting Started](#getting_started)\n    + [Prerequisites](#prerequisites)\n    + [Environment Variables](#env_variables)\n+ [Installing, Testing, Building](#installing)\n    + [Available Make Commands](#check_make_commamnds)\n    + [Clean Previous Builds](#clean_previous)\n    + [Venv and Requirements](#venv_requirements)\n    + [Run the tests](#tests)\n    + [Build Locally](#build_locally)\n+ [Running locally](#run_locally)\n\t+ [Configuration](#configuration)\n\t+ [Execution Options](#execution_options)\t\n+ [Deployment](#deployment)\n+ [Continuous Ιntegration](#ci)\n+ [Todo](#todo)\n+ [Built With](#built_with)\n+ [License](#license)\n+ [Acknowledgments](#acknowledgments)\n\n## About \u003ca name = \"about\"\u003e\u003c/a\u003e\n\nHybrid Girvan Newman. Code for the paper \"[A Distributed Hybrid Community Detection Methodology for Social Networks.](https://www.mdpi.com/1999-4893/12/8/175)\"\n\u003cbr\u003e\u003cbr\u003e\nThe proposed methodology is an iterative, divisive community detection process that combines the network topology features \nof loose similarity and local edge betweenness measure, along with the user content information in order to remove the \ninter-connection edges and thus unravel the subjacent community structure. Even if this iterative process might sound \ncomputationally over-demanding, its application is certainly not prohibitive, since it can be safely concluded \nfrom the experimentation results that the aforementioned measures are that well-informative and highly representative, \nso merely few iterations are required to converge to the final community hierarchy at any case.\n\u003cbr\u003e\u003cbr\u003e\nImplementation last tested with [Python 3.6](https://www.python.org/downloads/release/python-36), \n[Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/) \nand [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0)\n\n## Getting Started \u003ca name = \"getting_started\"\u003e\u003c/a\u003e\n\nThese instructions will get you a copy of the project up and running on your local machine for development \nand testing purposes. See deployment for notes on how to deploy the project on a live system.\n\n### Prerequisites \u003ca name = \"prerequisites\"\u003e\u003c/a\u003e\n\nYou need to have a machine with Python = 3.6, Apache Spark = 2.4.5, GraphFrames = 0.8.0 \nand any Bash based shell (e.g. zsh) installed. For Apache Spark = 2.4.5 you will also need Java 8.\n\n```\n$ python3.6 -V\nPython 3.6.9\n\necho $SHELL\n/usr/bin/zsh\n```\n\n### Set the required environment variables \u003ca name = \"env_variables\"\u003e\u003c/a\u003e\n\nIn order to run the [main.py](main.py) or the tests you will need to set the following \nenvironmental variables in your system (or in the [spark.env file](spark.env)):\n\n```bash\n$ export SPARK_HOME=\"\u003cPath to Spark Home\u003e\"\n$ export PYSPARK_SUBMIT_ARGS=\"--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell\"\n$ export JAVA_HOME=\"\u003cPath to Java 8\u003e\"\n\n$ cd $SPARK_HOME\n\n/usr/local/spark                                                                                                                                                                                                                                             \n$ ./bin/pyspark --version\nWelcome to\n      ____              __\n     / __/__  ___ _____/ /__\n    _\\ \\/ _ \\/ _ `/ __/  '_/\n   /___/ .__/\\_,_/_/ /_/\\_\\   version 2.4.5\n      /_/\n                        \nUsing Scala version 2.11.12, OpenJDK 64-Bit Server VM, 1.8.0_252\nBranch HEAD\nCompiled by user centos on 2020-02-02T19:38:06Z\nRevision cee4ecbb16917fa85f02c635925e2687400aa56b\nUrl https://gitbox.apache.org/repos/asf/spark.git\nType --help for more information.\n\n```\n\n## Installing, Testing, Building \u003ca name = \"installing\"\u003e\u003c/a\u003e\n\nAll the installation steps are being handled by the [Makefile](Makefile).\n\n\u003ci\u003eIf you don't want to go through the setup steps and finish the installation and run the tests,\nexecute the following command:\u003c/i\u003e\n\n```bash\n$ make install server=local\n```\n\n\u003ci\u003eIf you executed the previous command, you can skip through to the [Running locally](#run_locally) section.\u003c/i\u003e\n\n### Check the available make commands \u003ca name = \"check_make_commamnds\"\u003e\u003c/a\u003e\n\n```bash\n$ make help\n\n-----------------------------------------------------------------------------------------------------------\n                                              DISPLAYING HELP                                              \n-----------------------------------------------------------------------------------------------------------\nmake delete_venv\n       Delete the current venv\nmake create_venv\n       Create a new venv for the specified python version\nmake requirements\n       Upgrade pip and install the requirements\nmake run_tests\n       Run all the tests from the specified folder\nmake setup\n       Call setup.py install\nmake clean_pyc\n       Clean all the pyc files\nmake clean_build\n       Clean all the build folders\nmake clean\n       Call delete_venv clean_pyc clean_build\nmake install\n       Call clean create_venv requirements run_tests setup\nmake help\n       Display this message\n-----------------------------------------------------------------------------------------------------------\n```\n\n### Clean any previous builds \u003ca name = \"clean_previous\"\u003e\u003c/a\u003e\n\n```bash\n$ make clean server=local\nmake delete_venv\nmake[1]: Entering directory '/home/drkostas/Projects/HGN'\nDeleting venv..\nrm -rf venv\nmake[1]: Leaving directory '/home/drkostas/Projects/HGN'\nmake clean_pyc\nmake[1]: Entering directory '/home/drkostas/Projects/HGN'\nCleaning pyc files..\nfind . -name '*.pyc' -delete\nfind . -name '*.pyo' -delete\nfind . -name '*~' -delete\nmake[1]: Leaving directory '/home/drkostas/Projects/HGN'\nmake clean_build\nmake[1]: Entering directory '/home/drkostas/Projects/HGN'\nCleaning build directories..\nrm --force --recursive build/\nrm --force --recursive dist/\nrm --force --recursive *.egg-info\nmake[1]: Leaving directory '/home/drkostas/Projects/HGN'\n\n```\n\n### Create a new venv and install the requirements \u003ca name = \"venv_requirements\"\u003e\u003c/a\u003e\n\n```bash\n$ make create_venv server=local\nCreating venv..\npython3.6 -m venv ./venv\n\n$ make requirements server=local\nUpgrading pip..\nvenv/bin/pip install --upgrade pip wheel setuptools\nCollecting pip\n.................\n```\n\n### Run the tests \u003ca name = \"tests\"\u003e\u003c/a\u003e\n\nThe tests are located in the `tests` folder. To run all of them, execute the following command:\n\n```bash\n$ make run_tests server=local\nsource venv/bin/activate \u0026\u0026 \\\n.................\n```\n\n### Build the project locally \u003ca name = \"build_locally\"\u003e\u003c/a\u003e\n\nTo build the project locally using the setup.py command, execute the following command:\n\n```bash\n$ make setup server=local\nvenv/bin/python setup.py install '--local'\nrunning install\n.................\n```\n\n## Running the code locally \u003ca name = \"run_locally\"\u003e\u003c/a\u003e\n\nIn order to run the code now, you should place under the [data/input_graphs](data/input_graphs) the graph you \nwant the communities to be identified from.\u003cbr\u003e\nYou will also only need to create a yml file for any new graph before executing the [main.py](main.py).\n\n### Modifying the Configuration \u003ca name = \"configuration\"\u003e\u003c/a\u003e\n\nThere two already configured yml files: [confs/quakers.yml](confs/quakers.yml) \nand [confs/hamsterster.yml](confs/hamsterster.yml) with the following structure:\n\n```yaml\ntag: dev  # Required\nspark:\n  - config:  # The spark settings\n      spark.master: local[*]  # Required\n      spark.submit.deployMode: client  # Required\n      spark_warehouse_folder: data/spark-warehouse  # Required\n      spark.ui.port: 4040\n      spark.driver.cores: 5\n      spark.driver.memory: 8g\n      spark.driver.memoryOverhead: 4096\n      spark.driver.maxResultSize: 0\n      spark.executor.instances: 2\n      spark.executor.cores: 3\n      spark.executor.memory: 4g\n      spark.executor.memoryOverhead: 4096\n      spark.sql.broadcastTimeout: 3600\n      spark.sql.autoBroadcastJoinThreshold: -1\n      spark.sql.shuffle.partitions: 4\n      spark.default.parallelism: 4\n      spark.network.timeout: 3600s\n    dirs:\n      df_data_folder: data/dataframes  # Folder to store the DataFrames as parquets\n      spark_warehouse_folder: data/spark-warehouse\n      checkpoints_folder: data/checkpoints\n      communities_csv_folder: data/csv_data  # Folder to save the computed communities as csvs\ninput:\n  - config:  # All properties required\n      name: Quakers\n      nodes:\n        path: data/input_graphs/Quakers/quakers_nodelist.csv2  # Path to the nodes file\n        has_header: true  # Whether they have a header with the attribute names\n        delimiter: ','\n        encoding: ISO-8859-1\n        feature_names:  # You can rename the attribute names (the number should be the same as the original)\n          - id\n          - Historical_Significance\n          - Gender\n          - Birthdate\n          - Deathdate\n          - internal_id\n      edges:\n        path: data/input_graphs/Quakers/quakers_edgelist.csv2  # Path to the edges file\n        has_header: true  # Whether they have a header with the source and dest\n        has_weights: false  # Whether they have a weight column\n        delimiter: ','\n    type: local\nrun_options:  # All properties required\n  - config:\n      cached_init_step: false  # Whether the cosine similarities and edge_betweenness been already been computed\n      # See the paper for info regarding the following attributes\n      feature_min_avg: 0.33\n      r_lvl1_thres: 0.50\n      r_lvl2_thres: 0.85\n      max_edge_weight: 0.50\n      betweenness_thres: 10\n      max_sp_length: 2\n      min_comp_size: 2 \n      max_steps: 30  # Max steps for the algorithm to run if it doesn't converge\n      features_to_check:  # Which attributes to take into consideration for the cosine similarities\n        - id\n        - Gender\noutput:  # All properties required\n  - config:\n      logs_folder: data/logs\n      save_communities_to_csvs: false  # Whether to save the computed communities in csvs or not\n      visualizer:\n        dimensions: 3  # Dimensions of the scatter plot (2 or 3)\n        save_img: true\n        folder: data/plots\n        steps:  # The steps to plot\n          - 0   # The step before entering the main loop\n          - -1  # The Last step\n```\n\nThe `!ENV` flag indicates that a environmental value follows. For example you can set: \u003cbr\u003e`logs_folder: !ENV ${LOGS_FOLDER}`\u003cbr\u003e\nYou can change the values/environmental var names as you wish.\nIf a yaml variable name is changed/added/deleted, the corresponding changes should be reflected \non the [Configuration class](configuration/configuration.py) and the [yml_schema.json](configuration/yml_schema.json) too.\n\n### Execution Options \u003ca name = \"execution_options\"\u003e\u003c/a\u003e\n\nFirst, make sure you are in the created virtual environment:\n\n```bash\n$ source venv/bin/activate\n(venv) \nOneDrive/Projects/HGN  dev \n\n$ which python\n/home/drkostas/Projects/HGN/venv/bin/python\n(venv) \n```\n\nNow, in order to run the code you can either call the `main.py` directly, or the `HGN` console script.\n\n```bash\n$ python main.py -h\nusage: main.py -c CONFIG_FILE [-d] [-h]\n\nA Distributed Hybrid Community Detection Methodology for Social Networks.\n\nRequired Arguments:\n  -c CONFIG_FILE, --config-file CONFIG_FILE\n                        The configuration yml file\n\nOptional Arguments:\n  -d, --debug           Enables the debug log messages\n  -h, --help            Show this help message and exit\n\n\n# Or\n\n$ hgn --help\nusage: hgn -c CONFIG_FILE [-d] [-h]\n\nA Distributed Hybrid Community Detection Methodology for Social Networks.\n\nRequired Arguments:\n  -c CONFIG_FILE, --config-file CONFIG_FILE\n                        The configuration yml file\n\nOptional Arguments:\n  -d, --debug           Enables the debug log messages\n  -h, --help            Show this help message and exit\n\n```\n\n## Deployment \u003ca name = \"deployment\"\u003e\u003c/a\u003e\n\nIt is recommended that you deploy the application to a Spark Cluster.\u003cbr\u003ePlease see: \n- [Spark Cluster Overview \\[Apache Spark Docs\\]](https://spark.apache.org/docs/latest/cluster-overview.html)\n- [Apache Spark on Multi Node Cluster \\[Medium\\]](https://medium.com/ymedialabs-innovation/apache-spark-on-a-multi-node-cluster-b75967c8cb2b)\n- [Databricks Cluster](https://docs.databricks.com/clusters/index.html)\n- [Flintrock \\[Cheap \u0026 Easy EC2 Cluster\\]](https://github.com/nchammas/flintrock)\n\n## Continuous Integration \u003ca name = \"ci\"\u003e\u003c/a\u003e\n\nFor the continuous integration, the \u003cb\u003eCircleCI\u003c/b\u003e service is being used. \nFor more information you can check the [setup guide](https://circleci.com/docs/2.0/language-python/). \n\nAgain, you should set the [above-mentioned environmental variables](#env_variables) ([reference](https://circleci.com/docs/2.0/env-vars/#setting-an-environment-variable-in-a-context))\nand for any modifications, edit the [circleci config](/.circleci/config.yml).\n\n## TODO \u003ca name = \"todo\"\u003e\u003c/a\u003e\n\nRead the [TODO](TODO.md) to see the current task list.\n\n## Built With \u003ca name = \"built_with\"\u003e\u003c/a\u003e\n\n* [Apache Spark 2.4.5](https://spark.apache.org/docs/2.4.5/) - Fast and general-purpose cluster computing system\n* [GraphFrames 0.8.0](https://github.com/graphframes/graphframes/tree/v0.8.0) - A package for Apache Spark which provides DataFrame-based Graphs.\n* [CircleCI](https://www.circleci.com/) - Continuous Integration service\n\n\n## License \u003ca name = \"license\"\u003e\u003c/a\u003e\n\nThis project is licensed under the GNU License - see the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrkostas%2Fhgn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdrkostas%2Fhgn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdrkostas%2Fhgn/lists"}