{"id":17030325,"url":"https://github.com/vsoch/zenodo-ml","last_synced_at":"2025-03-22T20:29:08.702Z","repository":{"id":141668099,"uuid":"134999575","full_name":"vsoch/zenodo-ml","owner":"vsoch","description":"dinosaur dataset to help understand domain-specific software","archived":false,"fork":false,"pushed_at":"2018-12-20T18:53:33.000Z","size":10813,"stargazers_count":1,"open_issues_count":1,"forks_count":2,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-28T00:32:01.616Z","etag":null,"topics":["database","image-processing","open-source","software","zenodo"],"latest_commit_sha":null,"homepage":"https://vsoch.github.io/datasets/2018/zenodo/#software-in-the-context-of-image-analysis","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vsoch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-26T22:45:15.000Z","updated_at":"2023-05-30T03:23:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"dddd8377-0496-47f4-ac2d-817eabcecfd6","html_url":"https://github.com/vsoch/zenodo-ml","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Fzenodo-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Fzenodo-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Fzenodo-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vsoch%2Fzenodo-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vsoch","download_url":"https://codeload.github.com/vsoch/zenodo-ml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245018643,"owners_count":20548049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["database","image-processing","open-source","software","zenodo"],"created_at":"2024-10-14T08:06:12.926Z","updated_at":"2025-03-22T20:29:08.682Z","avatar_url":"https://github.com/vsoch.png","language":"Jupyter Notebook","readme":"# Zenodo ML\n\nThis is a part of the [Dinosaur Dataset](https://vsoch.github.io/datasets) series. I'll parse a dataset for you to use, show you how to use it, and you can do awesome research with it. Instructions for \ngeneration are below, and see the section on Questions if you are looking for ideas of what to do with it.\nThe link below provides instructions for downloading the dataset releases, along with links to example analyses with it.\n\n - [Instructions](https://vsoch.github.io/datasets/2018/zenodo/#what-can-i-learn-from-this-dataset): for download and use are provided on the Dinosaur Dataset Site\n - [Schema.org Dataset](https://vsoch.github.io/zenodo-ml/) metadata is generated on the Github Pages associated with this repository. See the [.github/main.workflow](.github/main.workflow) for how this is done.\n - [Validate Google Dataset](https://search.google.com/structured-data/testing-tool/u/0/#url=https://vsoch.github.io/zenodo-ml) using the Google Dataset Testing Tool\n\nIf you use this dataset in your work, please cite the Zenodo DOI:\n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1286417.svg)](https://doi.org/10.5281/zenodo.1286417)\n\n\n## Assumptions\n\n 1. We use an \"image size\" of 80 by 80, under the assumption that the typical editor / programming language prefers lines of max length 80 (see Python's Pep8 specification) and most machine learning algorithms prefer square images.\n 2. We filter the files down to those less than or equal to 100,000 bytes (100KB --\u003e 0.1 MB). This still leads to having on the order of a few thousand images (each 80x80) for one small script.\n 3. We filter down the Zenodo repos to the first 10K within the set of the bucket called \"software.\"\n 4. I filtered out repos that (were strangely common) related to something to do with \"gcube.\"\n 5. We take a greedy approach in parsing files - in the case that a single file produces some special error, we pass it in favor of continued processing of the rest.\n\n## Generation\n\nTo generate the dataset, the entire code is provided in a Docker container. If you don't\nwant to use the version on Docker Hub, you can build it locally.\n\n```bash\ndocker build -t vanessa/zenodo-ml .\nsingularity pull --name zenodo-ml docker://vanessa/zenodo-ml\n```\n\n### Sherlock at Stanford\n\n```bash\nmodule load singularity\nsingularity pull --name zenodo-ml docker://vanessa/zenodo-ml\nmkdir -p $SCRATCH/zenodo-ml \u0026\u0026 cd $SCRATCH/zenodo-ml\nmv /scratch/users/vsochat/.singularity/zenodo-ml $SCRATCH/zenodo-ml/\n```\n\nNow you have a container binary called `zenodo-ml` in your `$SCRATCH/zenodo-ml` folder.\n\n\n### Download from Zenodo\n\n**optional**\n\nThe first step is to produce a file called \"records.pkl\" that should contain about 10K\ndifferent records from the Zenodo API. You should [create an API key](https://zenodo.org/account/settings/applications/tokens/new/), save the key to a file called `.secrets` in the directory you are going to run\nthe container, and then run the container and map your present working directory to it. \nThat looks like this:\n\n```bash\ndocker run -v $PWD:/code -it vanessa/zenodo-ml exec python /code/download/0.download_records.py\n```\n\nYou don't actually need to do this, because the `records.pkl` is already provided in the container.\n\n### Parse Records\n\n**optional**\n\nOnce you have the `records.pkl` you can load them in for parsing! This will generate a data\nfolder in your present working directory with subfolders, each corresponding to a Zenodo identifier.\n\n```bash\ndocker run -v $PWD:/data -it vanessa/zenodo-ml python /code/download/1.parse_records.py\nsingularity exec zenodo-ml python /code/download/1.parse_records.py\n```\n\n### Loading Data\nLet's take a look at the contents of one of the subfolders under folder:\n\n```bash\ntree data/1065022/\n    metadata_1065022.pkl    \n    images_1065022.pkl    \n```\n\nThe filenames speak for themselves! Each is a python pickle, which means that you can\nload them with `pickle` in Python. The file `images_*.pkl` contains a dictionary data structure\nwith keys as files in the repository, and each index into the array is a list of file segments.\nA file segment is an 80x80 section of the file (the key) that has had it's characters converted\nto ordinal. You do this in Python as follows:\n\n```python\n#  Character to Ordinal (number)\nchar = 'a'\nnumber = ord(char)\nprint(number)\n97\n\n# Ordinal back to character\nchr(number)\n'a'\n```\n\n#### Images\nHere is how you would load and look at an image.\n\n```python\nimport pickle\n\nimage_pkl = os.path.abspath('data/1065022/images_1065022.pkl')\nimages = pickle.load(open(image_pkl, 'rb'))\n```\n\nRemember, this entire pickle is for just one repository that is found in a record from Zenodo! If you\nlook at the images \"keys\" you will see that each one corresponds to a file in the repository.\n\nFor complete usage and tutorial, see the page on [dinosaur datasets](https://vsoch.github.io/datasets/2018/zenodo/#what-can-i-learn-from-this-dataset).\n\n## Analysis Ideas\n\nSoftware, whether compiled or not, is a collection of scripts. A script is a stack of lines,\nand you might be tempted to relate it to a document or page in book. Actually, I think\nwe can conceptualize scripts more like images. A script is like an image in that it is a grid\nof pixels, each of which has some pixel value. In medical imaging we may think of these\nvalues in some range of grayscale, and with any kind of photograph we might imagine having\ndifferent color channels. A script is no different - the empty spaces are akin to values of zero,\nand the various characters akin to different pixel values. While it might be tempting to use\nmethods like Word2Vec to assess code in terms of lines, I believe that the larger context of the\nprevious and next lines are important too. Thus, my specific aims are the following:\n\n## Input Data\nThe first step above is the following:\n\n\u003e identify a corpus of scripts and metadata\n\nZenodo, specifically the \"software\" bucket, is ideal for this case. From this database\nI could extract about 10,000 software records, most of which are linked to a Github repository\nand carry various amounts of metadata (from keywords to entire manuscript publications).\n\n## Preprocessing\nThen I preprocessed this data, meaning download of the repository, reading of the script files,\nand representation of each file as an image. We enforced equivalent dimensions (80x80) regardless\nof the language.\n\n## Relationship Extraction\nThe relationship (how two files are related to one another) in terms of location in the repository\nmight be meaningful, and so this could be of further interest to extract.\n\n## Deep Learning\nThe first suggestion is to use convolutional neural networks to generate features of the scripts.\nI'm not experienced in doing this so I don't know what kinds of questions / analyses I'd like to try, \nplease reach out if you would like to work together on this! Here are some overall ideas for goals:\n\n# Goals\n\n## Software Development\nHere are some early goals that I think this work could help:\n\n - **comparison of software**: it follows logically that if we can make an association between features of software and some meaningful metadata tag, we can take unlabeled code and better organize it.\n - **comparison of containers**: in that our \"unit of understanding\" is not an entire container, if we are able to identify features of software, and then identify groups of software in containers, we can again better label the purpose / domain of the contianer.\n - **optimized / automated script generation**: If we have features of software, the next step is to make an association between features and what constitutes \"good\" software. For example, if I can measure the resource usage or system calls of a particular piece of software and I also can extract (humanly interpretable) features about it, I can use this information to make predictions about other software without using it.\n\nFor complete usage and tutorial, see the page on [dinosaur datasets](https://vsoch.github.io/datasets/2018/zenodo/#what-can-i-learn-from-this-dataset).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvsoch%2Fzenodo-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvsoch%2Fzenodo-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvsoch%2Fzenodo-ml/lists"}