Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nicbet/infozilla

The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.
https://github.com/nicbet/infozilla

bugreport bugzilla data-mining data-science tools unstructured-data

Last synced: 13 days ago
JSON representation

The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.

Awesome Lists containing this project

README

        

# infoZilla
The `infoZilla` tool is a library and tool for extracting structural software engineering data from unstructrured data sources such as e-mails, discussions, bug reports and wiki pages.

Out of the box, the `infoZilla` tool finds the following artefacts:

## Java Source Code Regions
Source code as small- to medium-sized code examples are often used to illustrate a problem, describe the programming context in which a problem occurred, or represent a sample fix to the problem described in the text.

![source code](docs/annotated_source.png)

## Patches

Patches represent a small piece of software designed to update or fix problems with a computer program or supporting data. The mostly used format of patches is the unified diff format.

![patches](docs/annotated_patch.png)

## Enumerations and Itemizations
Enumerations and itemizations are used to list items, describe a chain of causality, or present a sequence of actions the developer has to take in order to be able to reproduce or fix a problem. They add structure to the description of a problem and ease understanding.

![enumerations](docs/annotated_enum.png)

## Java Stacktraces
Stack traces list the active stack frames in the calling stack upon execution of a program. They are widely used to aid bug fixing by giving hints at the origin of a problem.

![stacktraces](docs/annotated_trace.png)

## Talkback Traces
Talkback traces detail the crash contexts and environment details when a problem is detected.

![talkback](docs/annotated_talkback.png)

## Usage
```
Usage: infozilla [-clps] [--charset=] FILE...
FILE... File(s) to process.
--charset=
Character Set of Input (default=ISO-8859-1)
-c, --with-source-code Process and extract source code regions (default=true)
-l, --with-lists Process and extract lists (default=true)
-p, --with-patches Process and extract patches (default=true)
-s, --with-stacktraces Process and extract stacktraces (default=true)
```

## Quickstart
Run the tool against a sample bug report:

```
gradle run --args="demo/demo0001.txt"

Task :run
Extracted Structural Elements from demo/demo0001.txt
0 Patches
2 Stack Traces
4 Source Code Fragments
1 Enumerations
Writing Cleaned Output
Writing XML Output
```

This will produce two files:
- `demo/demo0001.txt.cleaned` which contains the natural langauge text with all structural elements removed. This output is useful when the goal is to apply NLP algorithms that would otherwise be applied to large chunks of non-NLP data.
- `demo/demo001.txt.result.xml` which contains a machine-parseable representation of all structural elements that were found in the input.

### Example XML Output
```xml




org.eclipse.core.internal.resources.ResourceException
Resource /org.eclipse.debug.core/.classpath is not local.

org.eclipse.core.internal.resources.Resource.checkLocal(Resource.java:313)
org.eclipse.core.internal.resources.File.getContents(File.java:213)
org.eclipse.jdt.internal.core.Util.getResourceContentsAsByteArray(Util.java:671)
org.eclipse.jdt.internal.core.JavaProject.getSharedProperty(JavaProject.java:1793)
org.eclipse.jdt.internal.core.JavaProject.readClasspathFile(JavaProject.java:2089)
org.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1579)
org.eclipse.jdt.internal.core.search.indexing.IndexAllProject.execute(IndexAllProject.java:77)
org.eclipse.jdt.internal.core.search.processing.JobManager.run(JobManager.java:371)



org.eclipse.core.internal.resources.ResourceException
Resource /org.eclipse.jdt.launching/.classpath is not local.

org.eclipse.core.internal.resources.Resource.checkLocal(Resource.java:307)
org.eclipse.core.internal.resources.File.getContents(File.java:213)
org.eclipse.jdt.internal.core.Util.getResourceContentsAsByteArray(Util.java:677)
org.eclipse.jdt.internal.core.JavaProject.getSharedProperty(JavaProject.java:1809)
org.eclipse.jdt.internal.core.JavaProject.readClasspathFile(JavaProject.java:2105)
org.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1593)
org.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1583)
org.eclipse.jdt.internal.core.JavaProject.getOutputLocation(JavaProject.java:1375)
org.eclipse.jdt.internal.core.search.indexing.IndexAllProject.execute(IndexAllProject.java:90)
org.eclipse.jdt.internal.core.search.processing.JobManager.run(JobManager.java:375)
java.lang.Thread.run(Thread.java:536)






(monitor,1));



if (isJavaProject) {
/*IJavaProject jProject = JavaCore.create(project);
if (jProject.getRawClasspath() != null
&& jProject.getRawClasspath().length > 0)
jProject.setRawClasspath(new IClasspathEntry[0], monitor);*/
modelIds.add(model.getPluginBase().getId());
}




if (isJavaProject) {
IJavaProject jProject = JavaCore.create(project);
jProject.setRawClasspath(new IClasspathEntry[0], project.getFullPath(),
monitor);
modelIds.add(model.getPluginBase().getId());
}




if (isJavaProject) {
IJavaProject jProject = JavaCore.create(project);
jProject.setRawClasspath(new IClasspathEntry[0], project.getFullPath(),
monitor);
modelIds.add(model.getPluginBase().getId());
}






1. If autobuilding is on, we turn it off.

2. We import all the plug-ins selected in the import wizard and create a Java
project for each plug-in that contains libraries. Note that at this step, we
used to clear the classpath of the freshly created Java project because we
will correctly set it at a later step. However, just before we released 2.1,
Philippe suggested in bug report 34574 that we do not flush the classpath
completely. So we stopped flushing the classpath at this point, and this
introduced the transient error markers that we now see in the Problems view in
the middle of the operation. Since these error markers go away later in step
3 when we set the classpath, we regarded them as benign, yet still annoying,
intermediary entities. This step is done in an IWorkspace.run
(IWorkspaceRunnable, IProgressMonitor) operation.

3. We set the classpath of all the projects that were succesfully imported
into the workspace. This step has to be done in a subsequent IWorkspace.run
(IWorkspaceRunnable, IProgressMonitor) operation for an accurate classpath
computation. i.e. the Java projects from step 2 have to become part of the
workspace before we set their classpath.

4. If we had turned autobuilding off in step 1, we turn it back on and invoke
a build via PDEPlugin.getWorkspace().build
(IncrementalProjectBuilder.INCREMENTAL_BUILD,new SubProgressMonitor


```

## Building
Building the infoZilla tool requires `gradle` 4 or newer. Run `gradle tasks` for an overview.

## Citing
If you like the tool and find it useful, feel free to cite the original research work:
```
@conference {972,
title = {Extracting structural information from bug reports},
booktitle = {Proceedings of the 2008 international workshop on Mining software repositories - MSR {\textquoteright}08},
year = {2008},
month = {05/2008},
pages = {27-30},
publisher = {ACM Press},
organization = {ACM Press},
address = {New York, New York, USA},
abstract = {In software engineering experiments, the description of bug reports is typically treated as natural language text, although it often contains stack traces, source code, and patches. Neglecting such structural elements is a loss of valuable information; structure usually leads to a better performance of machine learning approaches. In this paper, we present a tool called infoZilla that detects structural elements from bug reports with near perfect accuracy and allows us to extract them. We anticipate that infoZilla can be used to leverage data from bug reports at a different granularity level that can facilitate interesting research in the future.},
keywords = {bug reports, eclipse, enumerations, infozilla, natural language, patches, source code, stack trace},
isbn = {9781605580241},
doi = {10.1145/1370750.1370757},
attachments = {https://flosshub.org/sites/flosshub.org/files/p27-bettenburg.pdf},
author = {Premraj, Rahul and Zimmermann, Thomas and Kim, Sunghun and Bettenburg, Nicolas}
}

```