{"id":15633160,"url":"https://github.com/ogrisel/pignlproc","last_synced_at":"2025-10-26T15:03:25.718Z","repository":{"id":1245857,"uuid":"1184317","full_name":"ogrisel/pignlproc","owner":"ogrisel","description":"Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.","archived":false,"fork":false,"pushed_at":"2022-11-08T12:52:10.000Z","size":885,"stargazers_count":158,"open_issues_count":6,"forks_count":64,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-05-08T21:12:58.332Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ogrisel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-12-20T13:53:53.000Z","updated_at":"2024-04-02T17:40:32.000Z","dependencies_parsed_at":"2023-01-11T16:03:27.168Z","dependency_job_id":null,"html_url":"https://github.com/ogrisel/pignlproc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ogrisel/pignlproc","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ogrisel%2Fpignlproc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ogrisel%2Fpignlproc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ogrisel%2Fpignlproc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ogrisel%2Fpignlproc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ogrisel","download_url":"https://codeload.github.com/ogrisel/pignlproc/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ogrisel%2Fpignlproc/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281121746,"owners_count":26447229,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-26T02:00:06.575Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-03T10:47:11.842Z","updated_at":"2025-10-26T15:03:25.692Z","avatar_url":"https://github.com/ogrisel.png","language":"Java","funding_links":[],"categories":["人工智能"],"sub_categories":["自然语言处理"],"readme":"# pignlproc\n\n**This project is archived.**\n\nApache Pig utilities to build training corpora for machine learning /\nNLP out of public Wikipedia and DBpedia dumps.\n\n## Project status\n\nThis project is alpha / experimental code. Features are implemented when needed.\n\nSome preliminary results are available in this blog post:\n\n  * [Mining Wikipedia with Hadoop and Pig for Natural Language Processing](http://www.nuxeo.com/blog/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing/)\n\n## Building from source\n\nInstall maven (tested with 2.2.1) and java jdk 6, then:\n\n    $ mvn assembly:assembly\n\nThis should download the dependencies, build a jar in the target/\nsubfolder and run the tests.\n\n## Usage\n\nThe following introduces some sample scripts to demo the User Defined\nFunctions provided by pignlproc for some practical Wikipedia mining tasks.\n\nThose examples demo how to use pig on your local machine on sample\nfiles. In production (with complete dumps) you might want to startup a\nreal Hadoop cluster, upload the dumps into HDFS, adjust the above paths\nto match your setup and remove the '-x local' command line parameter to\ntell pig to use your Hadoop cluster.\n\nThe [pignlproc wiki](https://github.com/ogrisel/pignlproc/wiki) provides\ncomprehensive documentation on where to download the dumps from and how\nto setup a Hadoop cluster on EC2 using [Apache Whirr](\nhttp://incubator.apache.org/whirr).\n\n### Extracting links from a raw Wikipedia XML dump\n\nYou can take example on the extract-links.pig example script:\n\n    $ pig -x local \\\n      -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \\\n      -p LANG=fr \\\n      -p INPUT=src/test/resources/frwiki-20101103-pages-articles-sample.xml \\\n      -p OUTPUT=/tmp/output \\\n      examples/extract_links.pig\n\n### Building a NER training / evaluation corpus from Wikipedia and DBpedia\n\nThe goal of those samples scripts is to extract a pre-formatted corpus\nsuitable for the training of sequence labeling algorithms such as MaxEnt\nor CRF models with [OpenNLP](http://incubator.apache.org/opennlp),\n[Mallet](http://mallet.cs.umass.edu/) or\n[crfsuite](http://www.chokkan.org/software/crfsuite/).\n\nTo achieve this you can run time following scripts (splitted into somewhat\nindependant parts that store intermediate results to avoid recomputing\neverything from scratch when you can the source files or some parameters.\n\nThe first script parses a wikipedia dump and extract occurrences of\nsentences with outgoing links along with some ordering and positioning\ninformation:\n\n    $ pig -x local \\\n      -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \\\n      -p LANG=en \\\n      -p INPUT=src/test/resources/enwiki-20090902-pages-articles-sample.xml \\\n      -p OUTPUT=workspace \\\n      examples/ner-corpus/01_extract_sentences_with_links.pig\n\nThe parser has been measured to run at a processing of 1MB/s on in local\nmode on a MacBook Pro of 2009.\n\nThe second script parses dbpedia dumps assumed to be in the folder\n/home/ogrisel/data/dbpedia:\n\n    $ pig -x local \\\n      -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \\\n      -p LANG=en \\\n      -p INPUT=/home/ogrisel/data/dbpedia \\\n      -p OUTPUT=workspace \\\n      examples/ner-corpus/02_dbpedia_article_types.pig\n\nThis step should complete in a couple of minutes in local mode.\n\nThis script could be adapted / replaced to use other typed entities\nknowledge bases linked to Wikipedia with downloadable dumps in NT\nor TSV formats; for instance: [freebase](http://freebase.com) or\n[Uberblic](http://uberblic.org).\n\nThe third script merges the partial results of the first two scripts and\norder back the results by grouping the sentences of the same article\ntogether to be able to build annotated sentences suitable for OpenNLP\nfor instance:\n\n    $ pig -x local \\\n      -p PIGNLPROC_JAR=target/pignlproc-0.1.0-SNAPSHOT.jar \\\n      -p INPUT=workspace \\\n      -p OUTPUT=workspace \\\n      -p LANG=en \\\n      -p TYPE_URI=http://dbpedia.org/ontology/Person \\\n      -p TYPE_NAME=person \\\n      examples/ner-corpus/03bis_filter_join_by_type_and_convert.pig\n\n    $ head -3 workspace/opennlp_person/part-r-00000\n    The Table Talk of \u003cSTART:person\u003e Martin Luther \u003cEND\u003e contains the story of a 12-year-old boy who may have been severely autistic .\n    The New Latin word autismus ( English translation autism ) was coined by the Swiss psychiatrist \u003cSTART:person\u003e Eugen Bleuler \u003cEND\u003e in 1910 as he was defining symptoms of schizophrenia .\n    Noted autistic \u003cSTART:person\u003e Temple Grandin \u003cEND\u003e described her inability to understand the social communication of neurotypicals , or people with normal neural development , as leaving her feeling \"like an anthropologist on Mars \" .\n\n\n### Building a document classification corpus\n\nTODO: Explain howto extract bag of words or ngrams and document frequency\nfeatures suitable for document classification using a SGD model from\n[Mahout](http://mahout.apache.org) for instance.\n\n\n## License\n\nCopyright 2010 Nuxeo and contributors:\n\n  Licensed under the Apache License, Version 2.0 (the \"License\");\n  you may not use this file except in compliance with the License.\n  You may obtain a copy of the License at\n\n  http://www.apache.org/licenses/LICENSE-2.0\n\n  Unless required by applicable law or agreed to in writing, software\n  distributed under the License is distributed on an \"AS IS\" BASIS,\n  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n  See the License for the specific language governing permissions and\n  limitations under the License.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fogrisel%2Fpignlproc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fogrisel%2Fpignlproc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fogrisel%2Fpignlproc/lists"}