{"id":20776424,"url":"https://github.com/ad115/icgc-data-parser","last_synced_at":"2025-09-01T03:04:54.417Z","repository":{"id":62570192,"uuid":"58873703","full_name":"Ad115/ICGC-data-parser","owner":"Ad115","description":"To automate data collection from ICGC database.","archived":false,"fork":false,"pushed_at":"2018-09-03T19:21:56.000Z","size":18346,"stargazers_count":6,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"develop","last_synced_at":"2025-07-31T16:41:09.122Z","etag":null,"topics":["ensembl","icgc","perl"],"latest_commit_sha":null,"homepage":"https://icgc-data-parser.readthedocs.io","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ad115.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-05-15T17:03:49.000Z","updated_at":"2023-05-28T12:37:12.000Z","dependencies_parsed_at":"2022-11-03T17:15:39.728Z","dependency_job_id":null,"html_url":"https://github.com/Ad115/ICGC-data-parser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Ad115/ICGC-data-parser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ad115%2FICGC-data-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ad115%2FICGC-data-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ad115%2FICGC-data-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ad115%2FICGC-data-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ad115","download_url":"https://codeload.github.com/Ad115/ICGC-data-parser/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ad115%2FICGC-data-parser/sbom","scorecard":{"id":8416,"data":{"date":"2025-08-11","repo":{"name":"github.com/Ad115/ICGC-data-parser","commit":"d509a3469d27773876f146764dd42dbe9ff6ef64"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'develop'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}}]},"last_synced_at":"2025-08-14T14:04:02.449Z","repository_id":62570192,"created_at":"2025-08-14T14:04:02.449Z","updated_at":"2025-08-14T14:04:02.449Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273068841,"owners_count":25039911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-01T02:00:09.058Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ensembl","icgc","perl"],"created_at":"2024-11-17T13:08:09.807Z","updated_at":"2025-09-01T03:04:54.388Z","avatar_url":"https://github.com/Ad115.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\nWhat is the ICGC-data-parser?\n=============================\n\n|Documentation Status|\n\n.. |Documentation Status| image:: https://readthedocs.org/projects/icgc-data-parser/badge/?version=develop\n   :target: http://icgc-data-parser.readthedocs.io/en/develop/?badge=develop\n\nA library to ease the parsing of data from the International Cancer Genome \nConsortium data releases, in particular, the simple somatic mutation \naggregates.\n\n\nTutorial\n========   \n\nInstallation\n------------\n\nInstall via `PyPI \u003chttps://pypi.org/project/ICGC-data-parser/\u003e`__:\n\n::\n\n    $ pip install ICGC_data_parser\n\n    \nData download\n-------------\n\nThe base data for the scripts is the ICGC's aggregated of the simple\nsomatic mutation data. Which can be downloded using\n\n::\n\n    wget https://dcc.icgc.org/api/v1/download?fn=/current/Summary/simple_somatic_mutation.aggregated.vcf.gz\n\nTo know more about this file, please read `About the ICGC's simple\nsomatic mutations\nfile \u003chttps://icgc-data-parser.readthedocs.io/en/master/icgc-ssm-file.html\u003e`__\n\n**WARNING**: The current release of the data contains a malformed\nheader that causes the library to crash with an ``IndexError``::\n\n    ---------------------------------------------------------------------------\n    ValueError                                Traceback (most recent call last)\n    ~/.local/lib/python3.6/site-packages/vcf/parser.py in _parse_info(self, info_str)\n        389                 try:\n    ...\n    ...\n    ...\n    362     def _parse_info(self, info_str):\n\n    ValueError: could not convert string to float: 'PCAWG'\n    \nThis is caused by a bad type specification in the header of the \nVCF file. To solve it, use the lollowing line after creating the \n``SSM_Reader`` object (asuming the reader is in the ``reader`` \nvariable)\n\n.. code-block:: python\n\n    # Fix weird bug due to malformed description headers\n    reader.infos['studies'] = reader.infos['studies']._replace(type='String')\n    \nIn the future this will be solved in a more elegant way, but for \nnow this is what we've got.\n\n\nUsage\n-----\n\nThe main class in the project is the `SSM_Reader \n\u003chttps://icgc-data-parser.readthedocs.io/en/master/api-documentation.html#ICGC_data_parser.SSM_Reader\u003e`__. \nIt allows to read easily the ICGC mutations file:\n\n.. code:: python\n\n\n    \u003e\u003e\u003e from ICGC_data_parser import SSM_Reader\n        \n    # Reads also compressed files!\n    \u003e\u003e\u003e reader = SSM_Reader(open('data/simple_somatic_mutations.aggregated.vcf.gz'))\n        \n    # or...\n    \u003e\u003e\u003e reader = SSM_Reader(filename='data/simple_somatic_mutations.aggregated.vcf.gz')\n    #                       ^^^^^^^^\n    # The filename keyord argument is important, else we get an IndexError\n    \n\nThe `SSM_Reader.parse \n\u003chttps://icgc-data-parser.readthedocs.io/en/master/api-documentation.html#ICGC_data_parser.SSM_Reader.parse\u003e`__ \nmethod allows to iterate through the records of the file and access the parts \nof the record. You can also specify regular expressions to filter only the \nlines you want:\n\n.. code:: python\n\n\n    # Print only the mutations that are in the\n    # European Union Breast Cancer project (BRCA-EU).\n\n    \u003e\u003e\u003e for record in reader.parse(filters=['BRCA-EU']):\n    ...    print(record.ID, record.CHROM, record.POS)\n\n    MU66865518 1 100141201\n    MU65487875 1 100160548\n    MU66281118 1 100638179\n    MU66254120 1 101352655\n    ...\n\nThe INFO field is special in the sense that it contains several\nsubfields, AND those subfields may be list-like entries with more\nsubfields themselves (in particular the CONSEQUENCE and OCCURRENCE\nsubfields):\n\n.. code:: python\n\n\n    # The subfields of the INFO field:\n    \u003e\u003e\u003e next(reader).INFO\n\n    {'CONSEQUENCE': [\n        '||||||intergenic_region||', \n        'CD1A|ENSG00000158477|+|CD1A-001|ENST00000289429||upstream_gene_variant||'\n        ], \n     'OCCURRENCE': [\n         'ESAD-UK|1|301|0.00332', \n         'EOPC-DE|1|202|0.00495', \n         'BRCA-EU|1|569|0.00176'\n        ],\n     'affected_donors': 3, \n     'mutation': 'T\u003eA', \n     'project_count': 3, \n     'studies': None, \n     'tested_donors': 12068}\n\n.. code:: python\n\n\n    # The description of the CONSEQUENCE subfield\n    \u003e\u003e\u003e print(reader.infos['CONSEQUENCE'].desc)\n\n    Mutation consequence predictions annotated by SnpEff \n    (subfields: gene_symbol|gene_affected|gene_strand|transcript_name|transcript_affected|protein_affected|consequence_type|cds_mutation|aa_mutation)\n    \n\n.. code:: python\n\n\n    # The description of the OCCURRENCE subfield\n    \u003e\u003e\u003e print(reader.infos['OCCURRENCE'].desc)\n\n    Mutation occurrence counts broken down by project \n    (subfields: project_code|affected_donors|tested_donors|frequency)\n\n\nSometimes we want to also parse the information in those subfields. For\nthis purpose, the ``SSM_Reader.subfield_parser`` factory method is\nuseful. This method creates a parser of the specified subfield that\nallows easy access to the data:\n\n.. code:: python\n\n\n    # Create the subfield parser for the CONSEQUENCE subfield\n    \u003e\u003e\u003e consequences = reader.subfield_parser('CONSEQUENCE')\n\n\n    \u003e\u003e\u003e for record in reader.parse():\n    ...    # Which genes are affected?\n    ...    genes_affected = {c.gene_symbol \n    ...                          for c in consequences(record)\n    ...                          if c.gene_affected}\n    ...\n    ...    print(f'Mutation: {record.ID}')\n    ...    print('\\t', \", \".join(genes_affected))\n\n    Mutation: MU93246178\n         TPM3\n    Mutation: MU66962994\n         RP11-350G8.9, SHE\n    Mutation: MU93246498\n         DCST1, ADAM15, RP11-307C12.11\n    Mutation: MU66377106\n         EFNA3, ADAM15, EFNA4\n    ...\n\nThe library also contains some helper scripts to manipulate VCF files\n(like the ICGC mutations file): \n\n- ``vcf_map_assembly.py``: Creates a new VCF with the positions mapped to \n  another genome assembly. This is useful because currently the positions \n  reported by ICGC are in the human genome assembly GRCh37, while the most recent\n  (and the one the rest of the world uses) is the GRCh38 assembly. \n\n- ``vcf_sample.py``: Creates a new VCF with a fraction of the mutations in the\n  original. The mutations are randomly sampled but maintain the order they had in\n  the original file. This is useful when one wants to make small test analysis on\n  the data, but still wants the results to be representative of all the \n  mutations. \n\n- ``vcf_split.py``: Splits the input VCF into several (also valid VCFs),\n  this is useful in case one wants to split the analyses into processes\n  that receive one file each.\n\nThe specific documentation of the scripts can be obtained by executing:\n\n::\n\n    $ python3 \u003cscript name\u003e.py --help\n\nAlso, the library is shipped with some Jupyter Notebooks that elaborate\non the examples. Besides, in the notebooks are demonstrated ways\nto manage common parsing errors that have to do with malformed input\nfiles.\n\nMeta\n----\n\n**Author**: \n`Ad115 \u003chttps://agargar.wordpress.com/\u003e`__ -\n`Github \u003chttps://github.com/Ad115/\u003e`__ – \na.garcia230395@gmail.com\n\n\n**Project pages**: \n`Docs \u003chttps://icgc-data-parser.readthedocs.io\u003e`__ - `@GitHub \u003chttps://github.com/Ad115/ICGC-data-parser/\u003e`__ - `@PyPI \u003chttps://pypi.org/project/ICGC-data-parser/\u003e`__\n\nDistributed under the MIT license. See\n`LICENSE \u003chttps://github.com/Ad115/ICGC_data_parser/blob/master/LICENSE\u003e`__ for\nmore information.\n\nContributing\n------------\n\n1. Check for open issues or open a fresh issue to start a discussion\n   around a feature idea or a bug.\n2. Fork `the repository \u003chttps://github.com/Ad115/ICGC-data-parser/\u003e`__\n   on GitHub to start making your changes to a feature branch, derived\n   from the **master** branch.\n3. Write a test which shows that the bug was fixed or that the feature\n   works as expected.\n4. Send a pull request and bug the maintainer until it gets merged and\n   published.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fad115%2Ficgc-data-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fad115%2Ficgc-data-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fad115%2Ficgc-data-parser/lists"}