{"id":17793008,"url":"https://github.com/evcu/nalwe","last_synced_at":"2025-04-02T02:18:06.261Z","repository":{"id":79244275,"uuid":"65491252","full_name":"evcu/Nalwe","owner":"evcu","description":"Results of various trained w2v models.","archived":false,"fork":false,"pushed_at":"2016-12-13T23:40:13.000Z","size":465,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-07T17:15:38.376Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/evcu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-08-11T18:11:59.000Z","updated_at":"2016-12-13T23:40:15.000Z","dependencies_parsed_at":"2023-04-18T11:57:05.276Z","dependency_job_id":null,"html_url":"https://github.com/evcu/Nalwe","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evcu%2FNalwe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evcu%2FNalwe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evcu%2FNalwe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/evcu%2FNalwe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/evcu","download_url":"https://codeload.github.com/evcu/Nalwe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246741117,"owners_count":20826067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-27T11:03:43.146Z","updated_at":"2025-04-02T02:18:06.163Z","avatar_url":"https://github.com/evcu.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The importance of normalisation while learning word embeddings \n\nThis code is written by [Utku Evci](https://github.com/evcu/) during the internship at ILPS, UvA in Summer 2016.\nThe scripts provided aims to facilitate training word2vec and GloVe vectors in SurfSara's Lisa cluster. \nCurrently project only includes scripts for _word2vec_ training. However extending it to GloVe requires a little work. \nTo be able to submit multiple jobs to the cluster, groupping approach is taken. You provide a corpuss and a vocabulary and a set of training-parameters. Then the script submits multiple-jobs with all different combinations of the parameter-set to the cluster automatically under the name of a *JOB_GROUP* and tracks them.\n\n### Quick Take: _How To Run Project_\n\nThe steps needed to be taken for using the code demonstrated in the `./demo.sh`script and summarized below. \n\n1. __Download project to your home folder__\n * All the folders(`output/`,`nalwe/`,etc.) and files(`genJob.sh`,etc.) of the project should be in $HOME folder of your SurfSara account for proper execution. (TODO: update the code to use git clone directly.)  \n2. __Get the corpus__\n * Download wiki-corpus and process it with Matt Mahoney's perl script. Be sure that the clean corpus is on `corpuss/` folder.  \n * You can submit downloading and processing of latest wikipedia-dump job to Lisa with `corpuss/getWikiCorpusJob.sh` or use `corpuss/getWikiCorpusOnPlace.sh` for immidiate execution.  \n3. __Generate Vocabularies__ \n * Open the `submitVocJobs.sh` file and set the parameter space by updating _PARAMETER SET_ section in the script.  \n * Change the `CALL_PER_JOB` variable to set how many vocabularies to generate in one job. Since vocabulary generation is fast relative to the training. So you can submit all vocabularies in 1 job.  \n * Change the `MAX_TIME_PER_JOB` variable to set the maximum time. Usually as small as 10 minutes.  \n * Execute the script by providing the name of the corpus file which is in `corpuss/` folder to the submitVocJobs script.  \n * The results are going to be saved to `output/vocs/`.  \n4. __Train Models__\n * Open the `submitw2vJobs.sh` file and set the parameter space by updating _PARAMETER SET_ section in the script.\n * Change the `CALL_PER_JOB` variable to set how many models to train in one job. \n * Change the `MAX_TIME_PER_JOB` variable to set the maximum time. You can refer to the results of previous trainings(links) to estimate maximum-time.\n * Submit the w2v training with the vocabulary file and corrpuss path. Vocabulary path is relative to the `output/vocs/` and corpus path is relative to `corpuss\\`\n * The results are going to be saved to `output/vecs/`\n * The training results are printed into standard output of Lisa and they are saved to the _$HOME_ folder when the job is finished. \n\n5. __Process Output__\n * A next direction for the project includes presenting the evaluation of models in an online javascript page. Therefore the project includes an after-processing script, which combines the log files, which includes the hyper-parameter set-up of each model and the results of the evaluation and training, into an _JSON_ array.\n * You just need to execute the script at _$HOME_ folder `pp_`*JOBGROUP*`.sh`and the resulting JSON file is going to be saved in the same folder as the resulting vectors: i.e. `/output/vecs/_JOBGROUP_/`. If the JSON generation is succesfull, then the logs and the job-output files are removed.\n\n### Details \n\nCode has the following organization \n\nFolder \t| Description\n-\t\t|-\ncorpuss\t|Various common corpara are downloaded into this directory. There are two scripts for downloading wiki-corpus and Matt Mahoney's perl script to process it.\njobs\t|Includes the scripts and tar-archive of submitted jobs. `logs/` folder includes the hyperparamater information of submitted job-groups.\nnalwe\t|Includes the various scripts for training and evoluation along with the c-code of word2vec and GloVe. \noutput \t|Contains the vocabulary and vector outputs. `vocs/` for Vocabulary files. `vecs/` for trained word embeddings. `cooc/` is for Coocurence files used by GloVe algorithm.\noldScripts | Scripts used for corpus download.\n\nScript \t| Description \t| Usage\n- \t\t| -\t\t\t\t| -\n`./genJob.sh`\t|Prepares tar snapshot of `nalwe/` folder and generates script to submit to the cluster with the name given (_Eg: job1_). After the script (*./script_local*) is executed in the `nalwe/` folder ,files in the `nalwe/out/` copied to the output folder provided(*output_vecs/*)  |`./genJob.sh job1 output/vecs/ ./script_local 3:00:00`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevcu%2Fnalwe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevcu%2Fnalwe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevcu%2Fnalwe/lists"}