{"id":19333888,"url":"https://github.com/arne-cl/ppi_graphkernel","last_synced_at":"2026-01-10T08:04:56.941Z","repository":{"id":16276789,"uuid":"19025166","full_name":"arne-cl/ppi_graphkernel","owner":"arne-cl","description":"all-paths graph kernel for protein-protein interaction extraction","archived":false,"fork":false,"pushed_at":"2014-04-22T10:08:27.000Z","size":4520,"stargazers_count":12,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2023-10-20T17:31:27.158Z","etag":null,"topics":["graph-kernel","natural-language-processing","nlp","ppi","protein-protein-interaction","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arne-cl.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-04-22T10:06:48.000Z","updated_at":"2021-12-09T01:58:52.000Z","dependencies_parsed_at":"2022-09-24T11:53:07.041Z","dependency_job_id":null,"html_url":"https://github.com/arne-cl/ppi_graphkernel","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arne-cl%2Fppi_graphkernel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arne-cl%2Fppi_graphkernel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arne-cl%2Fppi_graphkernel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arne-cl%2Fppi_graphkernel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arne-cl","download_url":"https://codeload.github.com/arne-cl/ppi_graphkernel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223906265,"owners_count":17223046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["graph-kernel","natural-language-processing","nlp","ppi","protein-protein-interaction","python"],"created_at":"2024-11-10T02:55:30.935Z","updated_at":"2024-11-10T02:55:31.570Z","avatar_url":"https://github.com/arne-cl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"PPI-learning with all-dependency-paths kernel\n=============================================\n\nThis is my \"fork\" of the graph kernel implemented in Python by\nAirola et al. (2008), which is described\n`on their website \u003chttp://mars.cs.utu.fi/PPICorpora/GraphKernel.html\u003e`_.\nMy intention for playing with the original code is to understand graph kernels\non dependency graphs better.\n\nSo far, I have made the following changes:\n\n- structured the codebase into a package `ppi_graphkernels` and subpackages\n- added a setup.py to install the package system-wide with dependencies\n- added documentation to some methods/functions\n- replaced some JAVAisms with more 'pythonic' code\n\nThe rest of this document contains the original README (converted to\nrestructuredText).\n\n\nSOFTWARE FOR PPI-EXTRACTION WITH GRAPH KERNELS\n----------------------------------------------\n\nThis package contains an implementation of the all-dependency-paths graph kernel\ndescribed in the paper \"A Graph Kernel for Protein-Protein Interaction\nExtraction\", presented at the ACL 2008 BioNLP Workshop. In addition, many of the\nscripts used in experiments done with the kernel are provided, including software\nfor preprocessing the data, an implementation of the Sparse RLS learning\nalgorithm, and software for doing efficient cross-validation using the algorithm.\n\nThe graph kernel is based on calculating the similarities of dependency graph\nrepresentations of sentences. Thus prior to running the system you should parse\nyour sentences with a parser capable of supplying such analysis, and supply\nthis infomation in the xml format used by the system. The system has been\ndeveloped based on the collapsed Stanford format.\n\nThe analysis xml format that the graph kernel software processes is derived from\nthe xml format used for the transformations introduced in the paper \"Comparative\nAnalysis of Five Protein-protein Interaction Corpora\" presented at the LBM07\nconference. The transformation software for five publicly available corpora is\navailable at\n\nhttp://mars.cs.utu.fi/PPICorpora/\n\nThese are the same corpora for which the test results for the graph-kernel are\nreported for. The fold-split used when cross-validating on the full corpora\nis also provided. \n\nNote that many of the scripts assume that the input and output files are compressed\nin the gzip format.\n\nThis is research software, provided as is without express or implied warranties\netc. see licence.txt for more details. We have tried to make it reasonably usable and\nprovided help options, but adapting the system to new environments or transforming\na corpus to the format used by the system may require significant effort. \nContact us in case of having problems or requiring further information about the\nexperimental procedure used when testing the kernel.\n\nQUICKSTART\n----------\n\nNote that most of the scripts used here have -h (help) option you can use\nto check available options.\n\nExample files, such as the binarized version of the BioInfer corpus are provided in\nthe xml format processed by the system. To train a system on a file CORPUS.XML, which\ncontains a parse produced by MYPARSER and a tokenization produced by MYTOKENIZER,\nproceed as follows (in the example file MYPARSER = split_parse and\nMYTOKENIZER = split).\n\nScript named ConvertCorpus.py can be used to transform the xml format used\nin the aforementioned LBM07 paper to the format used by the graph kernel\nsoftware. The resulting file will still need parses and tokenizations.\nOnce they are in the script HyphenSplitter.py can be used to modify the\ntokenization and parse so that tokens such as \"actin-binding\" are split in\ntwo, so that words such as \"binding\" in this case will not be blinded when protein\nnames are blinded.\n\nFirst, build a dictionary that maps the possible features to a running indexing.\n\n::\n\n    python BuildDictionaryMapping.py -i CORPUS.XML.gz -p MYPARSER -t MYTOKENIZER\n    -o dictionary.txt.gz\n\nSecond, compute the graph kernels for your data, producing a linearized feature\nrepresentation corresponding to the graph kernels.\n\n::\n\n    python LinearizeAnalysis.py -i CORPUS.XML.gz -o LinearCorpus.gz -p MYPARSER\n    -t MYTOKENIZER\n\nThe software can be run in two modes, which affect how the G matrix is\nconstructed. \"-m max\" is the default option which corresponds to how\n\n\nShould you wish to convert a data file containing separate test\ndata, linearize it using the dictionary produced from the training\ndata. Still, there is no harm in creating the dictionary from a file that\ncontains both the training and test data, the features that appear only in\nthe test data will not affect the learned hypothesis for the RLS learner.\nThus for cross-validation the dictionary doesn't have to be reformed for\neach split.\n\nThird, you can normalize the data vectors to unit length. Sometimes this\ncan boost the results, sometimes it makes them worse.\n\n::\n\n    python NormalizeData.py -i LinearCorpus.gz -o NormalizedCorpus.gz\n\nFrom now on, let us assume that you have created two data files linearized\nusing a dictionary created from the training data. One of them is the\nTRAIN_SET, and one the TEST_SET.\n\nTo choose optimal parameters (according to F-score), you can do leave-one\ndocument-out cross-validation on the TRAIN_SET.\n\n::\n\n    python CrossValidate.py -i TRAIN_SET -o CV_predictions.o -p Parameters.p\n    -r -10_10 -b 500\n\nThis command will run cross-validation on the sparse RLS algorithm using\n500 basis vectors (or less, if your data set is smaller that that).\nThe predictions for each data point are written to CV_predictions.o\nfile, and the F-score results with different parameter values to file\nParameters.p. The serached grid for the regularization parameter is\nin this example 2^-10 ... 2^10.\n\nTo build a model using these learned parameters you can run\n\n::\n\n    python TrainLinearized.py -i TRAIN_SET -p Parameters.p -b 500\n    -o Model.m\n\nAlternatively you can supply the value of the regularization\nparemeter directly with the r -option.\n\nTo make predictions with this model run\n\n::\n\n    python TestLinearized.py -i TEST_SET -m Model.m -o Predictions\n\nWhen calculating the performance, use the threshold selected\nin cross-validation, if your performance metric needs such a\nthing. Separating the classes at zero can produce quite bad results.\nBe aware that selecting the threshold on the training data can\nalso fail, if the \"identically distributed\" assumption does not\nhold between the training and test data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farne-cl%2Fppi_graphkernel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farne-cl%2Fppi_graphkernel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farne-cl%2Fppi_graphkernel/lists"}