{"id":13754454,"url":"https://github.com/dragnet-org/dragnet","last_synced_at":"2025-05-15T19:08:36.529Z","repository":{"id":3663924,"uuid":"4732636","full_name":"dragnet-org/dragnet","owner":"dragnet-org","description":"Just the facts -- web page content extraction","archived":false,"fork":false,"pushed_at":"2024-07-03T20:45:24.000Z","size":348329,"stargazers_count":1264,"open_issues_count":24,"forks_count":181,"subscribers_count":129,"default_branch":"master","last_synced_at":"2025-05-10T03:32:36.166Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dragnet-org.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2012-06-20T23:18:10.000Z","updated_at":"2025-05-07T15:50:36.000Z","dependencies_parsed_at":"2023-07-05T19:16:16.037Z","dependency_job_id":"08a01a30-d20d-4fdc-b516-d7f470ccd78d","html_url":"https://github.com/dragnet-org/dragnet","commit_stats":{"total_commits":290,"total_committers":23,"mean_commits":"12.608695652173912","dds":0.6827586206896552,"last_synced_commit":"4a1649d9b29bf64ccc5a86200e415e8b04cd257b"},"previous_names":["seomoz/dragnet"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dragnet-org%2Fdragnet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dragnet-org%2Fdragnet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dragnet-org%2Fdragnet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dragnet-org%2Fdragnet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dragnet-org","download_url":"https://codeload.github.com/dragnet-org/dragnet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254404357,"owners_count":22065641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T10:00:20.053Z","updated_at":"2025-05-15T19:08:36.508Z","avatar_url":"https://github.com/dragnet-org.png","language":"Python","funding_links":[],"categories":["Python","Content extraction"],"sub_categories":[],"readme":"Dragnet\n=======\n\n[![Build Status](https://travis-ci.com/dragnet-org/dragnet.svg?branch=master)](https://travis-ci.com/dragnet-org/dragnet)\n\nDragnet isn't interested in the shiny chrome or boilerplate dressing\nof a web page. It's interested in... 'just the facts.'  The machine\nlearning models in Dragnet extract the main article content and\noptionally user generated comments from a web page.  They provide\nstate of the art performance on a variety of test benchmarks.\n\nFor more information on our approach check out:\n\n* Our paper [_Content Extraction Using Diverse Feature Sets_](dragnet_www2013.pdf?raw=true), published\nat WWW in 2013, gives an overview of the machine learning approach.\n* [A comparison](https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-dragnet-readability-goose-and-eatiht/) of Dragnet and alternate content extraction packages.\n* [This blog post](https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/) explains the intuition behind the algorithms.\n\nThis project was originally inspired by\nKohlschütter et al, [Boilerplate Detection using Shallow Text Features](http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf) and\nWeninger et al [CETR -- Content Extraction with Tag Ratios](https://www3.nd.edu/~tweninge/cetr/#main-content-area), and more recently by [Readability](https://github.com/buriy/python-readability).\n\n# GETTING STARTED\n\nDepending on your use case, we provide two separate functions to extract\njust the main article content or the content and any user generated\ncomments.  Each function takes an HTML string and returns the content string.\n\n```python\nimport requests\nfrom dragnet import extract_content, extract_content_and_comments\n\n# fetch HTML\nurl = 'https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/'\nr = requests.get(url)\n\n# get main article without comments\ncontent = extract_content(r.content)\n\n# get article and comments\ncontent_comments = extract_content_and_comments(r.content)\n```\n\nWe also provide a sklearn-style extractor class(complete with `fit` and \n`predict` methods). You can either train an extractor yourself, or load a\npre-trained one:\n```python\nfrom dragnet.util import load_pickled_model\n\ncontent_extractor = load_pickled_model(\n            'kohlschuetter_readability_weninger_content_model.pkl.gz')\ncontent_comments_extractor = load_pickled_model(\n            'kohlschuetter_readability_weninger_comments_content_model.pkl.gz')\n            \ncontent = content_extractor.extract(r.content)\ncontent_comments = content_comments_extractor.extract(r.content)\n```\n\n## A note about encoding\n\nIf you know the encoding of the document (e.g. from HTTP headers),\nyou can pass it down to the parser:\n\n```python\ncontent = content_extractor.extract(html_string, encoding='utf-8')\n```\n\nOtherwise, we try to guess the encoding from a `meta` tag or specified\n`\u003c?xml encoding=\"..\"?\u003e` tag.  If that fails, we assume \"UTF-8\".\n\n## Installing\n\nDragnet is written in Python (developed with 2.7, with support recently \nadded for 3) and built on the numpy/scipy/Cython numerical computing\nenvironment.\nIn addition we use [lxml](http://lxml.de/) (libxml2)\nfor HTML parsing.\n\nWe recommend installing from the master branch to ensure you have the latest\nversion.\n\n### Installing with Docker:\n\nThis is the easiest method to install Dragnet and builds a Docker\ncontainer with Dragnet and its dependencies.\n\n1. Install [Docker](https://docs.docker.com/get-docker/).\n2. Clone the master branch: `git clone https://github.com/dragnet-org/dragnet.git`\n3. Build the docker container: `docker build -t dragnet .`\n4. Run the tests: `docker run dragnet make test`\n\nYou can also run an interactive Python session:\n```bash\ndocker run -ti dragnet python3\n```\n\n### Installing without Docker\n\n1.  Install the dependencies needed for Dragnet. The build depends on\nGCC, numpy, Cython and lxml (which in turn depends on `libxml2`). We\nuse `provision.sh` to setup the dependencies in the Docker container,\nso you can use it as a template and modify as appropriate for your\noperation system.\n2.  Clone the master branch: `git clone https://github.com/dragnet-org/dragnet.git`\n3.  Install the requirements: `cd dragnet; pip install -r requirements.txt`\n4.  Build dragnet:\n\n```bash\n$ cd dragnet\n$ make install\n# these should now pass\n$ make test\n```\n\n# Contributing\n\nWe love contributions! Open an issue, or fork/create a pull\nrequest.\n\n# More details about the code structure\n\nThe `Extractor` class encapsulates a blockifier, some feature extractors and a machine learning model.\n\nA blockifier implements `blockify` that takes a HTML string and returns a list\nof block objects.  A feature extractor is a callable that takes a list\nof blocks and returns a numpy array of features `(len(blocks), nfeatures)`.\nThere is some additional optional functionality\nto \"train\" the feature (e.g. estimate parameters needed for centering)\nspecified in `features.py`.  The machine learning model implements\nthe [scikits-learn](http://scikit-learn.org/stable/) interface (`predict` and `fit`) and is used to compute\nthe content/no-content prediction for each block.\n\n# Training/test data\n\nThe training and test data is available at [dragnet_data](https://github.com/seomoz/dragnet_data).\n\n# Training content extraction models\n\n0.  Download the training data (see above).  In what follows `ROOTDIR` contains\n    the root of the `dragnet_data` repo, another directory with similar\n    structure (`HTML` and `Corrected` sub-directories).\n1.  Create the block corrected files needed to do supervised learning on the block level.\n    First make a sub-directory `$ROOTDIR/block_corrected/` for the output files, then run:\n\n    ```python\n    from dragnet.data_processing import extract_all_gold_standard_data\n    rootdir = '/path/to/dragnet_data/'\n    extract_all_gold_standard_data(rootdir)\n    ```\n\n    This solves the longest common sub-sequence problem to determine\n    which blocks were extracted in the gold standard.\n    Occasionally this will fail if lxml (libxml2) cannot parse\n    a HTML document.  In this case, remove the offending document and restart\n    the process.\n2.  Use k-fold cross validation in the training set to do model selection\n    and set any hyperparameters.  Make decisions about the following:\n\n    * Whether to use just article content or content and comments.\n    * The features to use\n    * The machine learning model to use\n\n    For example, to train the randomized decision tree classifier from\n    sklearn using the shallow text features from Kohlschuetter et al.\n    and the CETR features from Weninger et al.:\n\n    ```python\n    from dragnet.extractor import Extractor\n    from dragnet.model_training import train_model\n    from sklearn.ensemble import ExtraTreesClassifier\n\n    rootdir = '/path/to/dragnet_data/'\n\n    features = ['kohlschuetter', 'weninger', 'readability']\n\n    to_extract = ['content', 'comments']   # or ['content']\n\n    model = ExtraTreesClassifier(\n        n_estimators=10,\n        max_features=None,\n        min_samples_leaf=75\n    )\n    base_extractor = Extractor(\n        features=features,\n        to_extract=to_extract,\n        model=model\n    )\n\n    extractor = train_model(base_extractor, rootdir)\n    ```\n\n    This trains the model and, if a value is passed to `output_dir`, writes a\n    pickled version of it along with some some *block level* classification\n    errors to a file in the specified `output_dir`. If no `output_dir` is\n    specified, the block-level performance is printed to stdout.\n3.  Once you have decided on a final model, train it on the entire training\n    data using `dragnet.model_training.train_models`.\n4.  As a last step, test the performance of the model on the test set (see\n    below).\n\n## Evaluating content extraction models\n\nUse `evaluate_models_predictions` in `model_training` to compute the token level\naccuracy, precision, recall, and F1.  For example, to evaluate a trained model\nrun:\n\n```python\nfrom dragnet.compat import train_test_split\nfrom dragnet.data_processing import prepare_all_data\nfrom dragnet.model_training import evaluate_model_predictions\n\nrootdir = '/path/to/dragnet_data/'\ndata = prepare_all_data(rootdir)\ntraining_data, test_data = train_test_split(data, test_size=0.2, random_state=42)\n\ntest_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data)\ntrain_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data)\n\nextractor.fit(train_html, train_labels, weights=train_weights)\npredictions = extractor.predict(test_html)\nscores = evaluate_model_predictions(test_labels, predictions, test_weights)\n```\n\nNote that this is the same evaluation that is run/printed in `train_model`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdragnet-org%2Fdragnet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdragnet-org%2Fdragnet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdragnet-org%2Fdragnet/lists"}