{"id":20329527,"url":"https://github.com/raypereda/good-urls","last_synced_at":"2025-07-22T03:04:36.613Z","repository":{"id":69360391,"uuid":"212445635","full_name":"raypereda/good-urls","owner":"raypereda","description":null,"archived":false,"fork":false,"pushed_at":"2019-10-24T14:05:14.000Z","size":4438,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-04-21T02:09:26.728Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raypereda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-02T21:34:00.000Z","updated_at":"2024-06-19T10:21:03.231Z","dependencies_parsed_at":"2023-06-03T20:30:07.990Z","dependency_job_id":null,"html_url":"https://github.com/raypereda/good-urls","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/raypereda/good-urls","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raypereda%2Fgood-urls","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raypereda%2Fgood-urls/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raypereda%2Fgood-urls/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raypereda%2Fgood-urls/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raypereda","download_url":"https://codeload.github.com/raypereda/good-urls/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raypereda%2Fgood-urls/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266417042,"owners_count":23925300,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T20:11:37.589Z","updated_at":"2025-07-22T03:04:36.586Z","avatar_url":"https://github.com/raypereda.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Good URLs  \u003c!-- omit in toc --\u003e\n\n- [Introduction](#introduction)\n- [Setting up fastText](#setting-up-fasttext)\n- [Background Preprocessing the Input Text](#background-preprocessing-the-input-text)\n- [Estimating Accuracy on Unseen](#estimating-accuracy-on-unseen)\n- [Making Predictions Using the Neural Network](#making-predictions-using-the-neural-network)\n- [Assembling the Report](#assembling-the-report)\n\n## Introduction\n\nGiven a URL's text, can a neural network predict if it is going to be a\ngood URL?\n\nWe're going to find out. What is good or bad does not need to be specified.\nAll that is needs is examples, 100s, 1000s. What Good URL means is between\nyou and the trained neural network.\n\nWe use a neural network to predict status: good or bad, of a URL.\nWe retrieve the visible text from a URL and use that the input for network.\nThe library used is [fastText](https://fasttext.cc/). We use the text\nclassification feature and follows the dataflow, as shown like this\n[tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html) from the project.\n\n## Setting up fastText\n\nThe input files need to be UTF-8 with Unix style line endings, CR.\nThis tutorial assumes that you're running on Windows but could be executed on\nLinux. We are going to be working at the terminal. Here are some common options:\n\n1. `PowerShell` - not recommended, it writes UTF-16 by default.\n2. `DOS Command Prompt` - not recommended, it writes Windows-style line ending, CRLF.\n3. `Git Bash` - recommended, it writes UTF-8 and Unix style line ending.\n\nCopy the files in the `bin` directory to somewhere on your `$PATH`\nenvironment variable. I recommend a `/c/Users/rpereda/bin` directory and\nthat directory is on my `$PATH` environment variable.\n\nHere are the files:\n\n1. `fasttext.exe` - the command-line interface to fastText\n2. `fasttext.dll` - a supporting dynamic library for the .exe\n3. `normalize.exe` - a line-by-line, character-by-character text normalizer\n4. `shuffle.exe` - shuffler of lines, required for the learning process\n5. `rowpaste.exe` - two CSV files row wise.\n6. `rowcut.exe` - cut out columns of a CSV file row wise.\n\nfasttext.exe and fasttext.dll are for the command-line interface to the library.\n\n```bash\n$ ls -1sh\n... fill in later\n```\n\n| file                          | purpose                                                 |\n| ----------------------------- | ------------------------------------------------------- |\n| training.xlsx                 | unformatted training data used from the MLR 2018 Report |\n| model_goodness.bin            | neural network created during training                  |\n| model_goodness.vec            | dictionary created during training                      |\n| prediction-model??.txt        | prediction on the training data; should be 99% accurate |\n\n1. the initial human vetted training data: training.xlsx\n2. created using Excel to narrow down to this file: training.prn\n3. `cat training.prn | normalize \u003e training-n.prn`\n4. `shuffle training-n.prn \u003e training-ns.prn`\n5. `fasttext supervised -input training-ns.prn -output model_goodness -lr 1.0 -epoch 25 -wordNgrams 2`\n   That creates two files: model_goodness.{bin, vec}.\n6. `fasttext test model_goodness.bin training-ns.prn`\n   This measures the accuracy of the model. \n\n## Background Preprocessing the Input Text\n\nHere is an excerpt of the Go program that normalizes the text. Line by line, text\nis transformed. The label, in this case the country, is prefaced by ```___label___```.\nThis prefix is how fasttext distinguishes labels in a large amount of text.\nSpaces surround some punctuations. Double quotes are deleted. Semicolons and\ncolons are replaced by a space. Each white space runs is compressed to a single space.\n\nDigits are mapped to the symbol @. The reason for doing this is for the neural network\nto focus on learning the grammar of address based on the shape numbers, not a specific\nnumber. By shape of numbers, we mean the number of digits. So, we know that address that\nends with 5 digits, represented by @@@@@, more readily see the pattern than if we focus\non specific zipcode. In effect 90803 and 92705 are treated a single word, namely a\n\n```go\nline := scanner.Text()\nline = strings.ToLower(line)\nline = \"__label__\" + line\nline = strings.ReplaceAll(line, \"'\", \" ' \")\nline = strings.ReplaceAll(line, `\"`, \"\")\nline = strings.ReplaceAll(line, \".\", \" . \")\nline = strings.ReplaceAll(line, \"\u003cbr /\u003e\", \" \")\nline = strings.ReplaceAll(line, \",\", \" , \")\nline = strings.ReplaceAll(line, \"(\", \" ( \")\nline = strings.ReplaceAll(line, \")\", \" ) \")\nline = strings.ReplaceAll(line, \"!\", \" ! \")\nline = strings.ReplaceAll(line, \"?\", \" ? \")\nline = strings.ReplaceAll(line, \";\", \" \")\nline = strings.ReplaceAll(line, \":\", \" \")\nspace := regexp.MustCompile(`\\s+`)\nline = space.ReplaceAllString(line, \" \")\nline = strings.ReplaceAll(line, \"0\", \"@\")\nline = strings.ReplaceAll(line, \"1\", \"@\")\nline = strings.ReplaceAll(line, \"2\", \"@\")\nline = strings.ReplaceAll(line, \"3\", \"@\")\nline = strings.ReplaceAll(line, \"4\", \"@\")\nline = strings.ReplaceAll(line, \"5\", \"@\")\nline = strings.ReplaceAll(line, \"6\", \"@\")\nline = strings.ReplaceAll(line, \"7\", \"@\")\nline = strings.ReplaceAll(line, \"8\", \"@\")\nline = strings.ReplaceAll(line, \"9\", \"@\")\n\n```\n\n## Estimating Accuracy on Unseen\n\nThe standard way to do this is to split off part of the training\ndata and save it for testing. \n\n`P@1` and `R@1` are precision and recall for one label, namely country.\nBecause there is only one label, precision and recall are the same.\nWe can more call it in this case accuracy.\n\n```bash\n$ head -13000 training-ns.prn \u003e training-ns-head.prn\n$ tail -358 training-ns.prn \u003e training-ns-tail.prn\n\n$ time fasttext test model_goodness.bin training-ns-head.prn\nreplace later\n\nN       13000\nP@1     0.996\nR@1     0.996\nNumber of examples: 13000\n\nreal    0m1.908s\nuser    0m0.015s\nsys     0m0.094s\n\n$ time fasttext test model_goodness-.bin training-ns-tail358.prn\nN       357\nP@1     0.969\nR@1     0.969\nNumber of examples: 357\n\nreal    0m1.604s\nuser    0m0.015s\nsys     0m0.094s\n\n$ time fasttext predict-prob model_goodness.bin training-ns-predict.prn \u003e prediction-prod-on-model.txt\n\nreal    0m2.153s\nuser    0m0.062s\nsys     0m0.108s\n\n```\n\n## Making Predictions Using the Neural Network\n\n```bash\nfasttext predict-prob model_goodness.bin goodness.prn \u003e ???\n```\n\nHere sample of the output file:\n\n```bash\ntodo\n```\n\nIf the probability is larger than one, as in 1.00001, that is a minor rounding\nerror in fastText. Don't worry about it.\n\n## Assembling the Report\n\nWe will simply append two columns to the input file. \n\n```bash\n$ rowcut -c=1 urls.csv | \\                      # cuts out the 2nd column, url text\n  normalize -l=0 | \\                            # normalizes the text\n  fasttext predict-prob model_goodness.bin - | \\ # run fasttext\n  cut -c 10- \\                                  # removes the __label__ prefix\n  | tr \" \" \",\"  \u003e predictions.csv               # makes a valid csv\n\n# this combines two input CSV and the predictions CSV\n$ rowpaste urls.csv predictions.csv  \u003e urls-with-predictions.csv\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraypereda%2Fgood-urls","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraypereda%2Fgood-urls","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraypereda%2Fgood-urls/lists"}