Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-msr
A curated repository of software engineering repository mining data sets
https://github.com/dspinellis/awesome-msr
- awesome
- SIR - Software-artifact infrastructure repository; Java, C, C++, and C# software together with test suites and fault data.
- PROMISE - About 20 datasets related to software engineering research.
- FLOSSmole - Collaborative collection and analysis of free/libre/open source project data.
- Zenodo - Software data collections in CERN's open-access repository.
- Software Engineering Artifacts Can Really Assist Future Tasks
- Empirical Software Engineering
- Mining Software Repositories
- AndroidTimeMachine - Graph-based dataset of commit history of 8,431 real-world Android apps.
- AndroZoo - Collection of Android Applications.
- Bug Prediction Dataset - Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.
- Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
- CoREBench - Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.
- Cryptocurrency GitHub Activity and Market Cap Dataset - Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also [available](https://zenodo.org/record/2595588#.XRuzuBNKhSM).
- Defects4J - Collection of 395 reproducible bugs collected with the goal of advancing software testing research.
- Eclipse AERI stacktraces - Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.
- Enron Spreadsheets and Emails - All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'.
- Findbugs-maven - Set of FindBugs reports for the Java projects of the [Maven repository](https://maven.apache.org).
- GHTorrent - Scalable, queriable, offline mirror of data offered through the GitHub REST API.
- GitHub Bug Dataset - Bug Dataset of 15 Java open-source projects characterized by static source code metrics.
- GitHub on Google BigQuery - GitHub data accessible through Google's BigQuery platform.
- Grammar Zoo - Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.
- KaVE - Developer tool interaction data.
- Linux Kernel 4.21 Call Graphs - The Linux Kernel 4.21 Call Graphs produced using [CScout](https://github.com/dspinellis/cscout/).
- Maven metrics - Collection of software complexity & sizing metrics for the [Maven Repository](https://maven.apache.org).
- Maven Dependency Graph - Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.
- mzdata - Multi-extract and multi-level dataset of Mozilla issue tracking history.
- npm-miner - The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages.
- OCL Expressions on GitHub - Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
- RepoReapers Data Set - Data set containing a collection of _engineered software projects_ from GHTorrent.
- Software Heritage Graph Dataset - Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation ([paper here](https://dl.acm.org/citation.cfm?id=3341907)).
- STAMINA - (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).
- Stack Exchange - Anonymized dump of all user-contributed content on the Stack Exchange network.
- TravisTorrent - Provides free and easy-to-use Traivs CI build analyses.
- Ultimate Debian Database (UDD) - Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.
- Unified Bug Dataset - Static source code based datasets which includes the Bugcatchers Bug Dataset, the [Bug Prediction Dataset](http://bug.inf.usi.ch/index.php), the [Eclipse Bug Dataset](https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/), the [GitHub Bug Dataset](http://www.inf.u-szeged.hu/~ferenc/papers/GitHubBugDataSet/), some datasets from the [PROMISE](http://promise.site.uottawa.ca/SERepository/datasets-page.html) repository.
- Unix history - Git repository with 46 years of Unix history evolution.
- astminer - Library and tool for mining of path-based representations of code and other data derived from ASTs.
- Boa - Domain-specific language and infrastructure that eases mining software repositories.
- buckwheat - Multi-language tokenizer for extracting identifiers from source code.
- ckjm - Chidamber and Kemerer Java Metrics.
- Coming - A Java framework for analyzing code changes and mining instances of change patterns from Git repositories.
- CryptOSS - Mine GitHub activity and market cap data for cryptocurrency projects.
- DbDeo - Extract embedded SQL statements and detect database schema smells.
- Designite - Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.
- DesigniteJava - Compute source code metrics and detect a variety of implementation and design smells for Java.
- Diggit - Agile Ruby Tool to analyze Git repositories.
- GrimoireLab - Free/Libre/Open Source tools for Software Development Analytics.
- MetricMiner - Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.
- Maven-miner - Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a [Neo4j](https://neo4j.com/) Graph.
- Perceval - Fetch repository data from tens of back-ends.
- Puppeteer - Detect configuration smells in Puppet code.
- PyDriller - Python Framework to analyse Git repositories.
- qmcalc - Calculate quality metrics from C source code.
- reaper - Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is _engineered_.
- RefactoringMiner - Library/API for detection of refactorings in changes of Java code.
- VulData7 - Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git).
- Empirical Software Engineering journal
- MSR: Mining Software Repositories conference
- PROMISE: Predictive Models and Data Analytics in Software Engineering conference
- ACM Transactions on Software Engineering and Methodology (TOSEM)
- ESEC/FSE: ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
- ICSE: International Conference on Software Engineering
- IEEE Software magazine
- IEEE Transactions on Software Engineering
- Journal of Systems and Software
- SANER: IEEE International Conference on Software Analysis, Evolution and Reengineering
- ![CC0
- Diomidis Spinellis
Programming Languages
Keywords
mining-software-repositories
5
git
2
msr
2
software-engineering
2
cryptocurrency
2
python
2
github
2
open-source-software
2
software-metrics
2
metrics
2
java
2
evolution
1
freebsd
1
history
1
snapshot
1
unix
1
antlr
1
code2vec
1
mining
1
berkeley
1
bell-labs
1
perl
1
defects4j
1
unicorns
1
resources
1
lists
1
awesome-list
1
refactoring
1
quality-metrics
1
c
1
python3
1
python-framework
1
software-analytics
1
perceval
1
grimoirelab
1
data-sources
1
data-mining
1
data-fetching
1
software-engineering-research
1
software-en
1
repository-mining
1
technical-debt
1
object-oriented-metrics
1
design-smells
1
code-smells
1
kth
1
inria
1
ast-analysis
1
awesome
1