{"id":18507936,"url":"https://github.com/nicbet/infozilla","last_synced_at":"2025-10-13T05:09:29.695Z","repository":{"id":109836234,"uuid":"165971709","full_name":"nicbet/infozilla","owner":"nicbet","description":"The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.","archived":false,"fork":false,"pushed_at":"2019-01-24T00:53:51.000Z","size":543,"stargazers_count":15,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-09T20:48:20.814Z","etag":null,"topics":["bugreport","bugzilla","data-mining","data-science","tools","unstructured-data"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nicbet.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-16T04:09:53.000Z","updated_at":"2025-01-13T19:40:44.000Z","dependencies_parsed_at":"2023-03-12T00:45:34.158Z","dependency_job_id":null,"html_url":"https://github.com/nicbet/infozilla","commit_stats":null,"previous_names":["nicbet/infozilla"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nicbet/infozilla","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicbet%2Finfozilla","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicbet%2Finfozilla/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicbet%2Finfozilla/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicbet%2Finfozilla/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nicbet","download_url":"https://codeload.github.com/nicbet/infozilla/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicbet%2Finfozilla/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279013695,"owners_count":26085390,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bugreport","bugzilla","data-mining","data-science","tools","unstructured-data"],"created_at":"2024-11-06T15:12:53.020Z","updated_at":"2025-10-13T05:09:29.635Z","avatar_url":"https://github.com/nicbet.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# infoZilla\nThe `infoZilla` tool is a library and tool for extracting structural software engineering data from unstructrured data sources such as e-mails, discussions, bug reports and wiki pages.\n\nOut of the box, the `infoZilla` tool finds the following artefacts:\n\n## Java Source Code Regions\nSource code as small- to medium-sized code examples are often used to illustrate a problem, describe the programming context in which a problem occurred, or represent a sample fix to the problem described in the text.\n\n![source code](docs/annotated_source.png)\n\n## Patches\n\nPatches represent a small piece of software designed to update or fix problems with a computer program or supporting data. The mostly used format of patches is the unified diff format.\n\n![patches](docs/annotated_patch.png)\n\n## Enumerations and Itemizations\nEnumerations and itemizations are used to list items, describe a chain of causality, or present a sequence of actions the developer has to take in order to be able to reproduce or fix a problem. They add structure to the description of a problem and  ease understanding.\n\n![enumerations](docs/annotated_enum.png)\n\n## Java Stacktraces\nStack traces list the active stack frames in the calling stack upon execution of a program. They are widely used to aid bug fixing by giving hints at the origin of a problem.\n\n![stacktraces](docs/annotated_trace.png)\n\n## Talkback Traces\nTalkback traces detail the crash contexts and environment details when a problem is detected.\n\n![talkback](docs/annotated_talkback.png)\n\n## Usage\n```\nUsage: infozilla [-clps] [--charset=\u003cinputCharset\u003e] FILE...\n      FILE...              File(s) to process.\n      --charset=\u003cinputCharset\u003e\n                           Character Set of Input (default=ISO-8859-1)\n  -c, --with-source-code   Process and extract source code regions (default=true)\n  -l, --with-lists         Process and extract lists (default=true)\n  -p, --with-patches       Process and extract patches (default=true)\n  -s, --with-stacktraces   Process and extract stacktraces (default=true)\n```\n\n## Quickstart\nRun the tool against a sample bug report:\n\n```\ngradle run --args=\"demo/demo0001.txt\"\n\nTask :run\nExtracted Structural Elements from demo/demo0001.txt\n0        Patches\n2        Stack Traces\n4        Source Code Fragments\n1        Enumerations\nWriting Cleaned Output\nWriting XML Output\n```\n\nThis will produce two files: \n- `demo/demo0001.txt.cleaned` which contains the natural langauge text with all structural elements removed. This output is useful when the goal is to apply NLP algorithms that would otherwise be applied to large chunks of non-NLP data.\n- `demo/demo001.txt.result.xml` which contains a machine-parseable representation of all structural elements that were found in the input.\n\n### Example XML Output\n```xml\n\u003c?xml version=\"1.0\" encoding=\"UTF-8\"?\u003e\n\u003cinfozilla-output\u003e\n  \u003cPatches amount=\"0\" /\u003e\n  \u003cStacktraces amount=\"2\"\u003e\n    \u003cStacktrace timestamp=\"1547858607503\"\u003e\n      \u003cException\u003eorg.eclipse.core.internal.resources.ResourceException\u003c/Exception\u003e\n      \u003cReason\u003eResource /org.eclipse.debug.core/.classpath is not local.\u003c/Reason\u003e\n      \u003cFrames\u003e\n        \u003cFrame depth=\"0\"\u003eorg.eclipse.core.internal.resources.Resource.checkLocal(Resource.java:313)\u003c/Frame\u003e\n        \u003cFrame depth=\"1\"\u003eorg.eclipse.core.internal.resources.File.getContents(File.java:213)\u003c/Frame\u003e\n        \u003cFrame depth=\"2\"\u003eorg.eclipse.jdt.internal.core.Util.getResourceContentsAsByteArray(Util.java:671)\u003c/Frame\u003e\n        \u003cFrame depth=\"3\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getSharedProperty(JavaProject.java:1793)\u003c/Frame\u003e\n        \u003cFrame depth=\"4\"\u003eorg.eclipse.jdt.internal.core.JavaProject.readClasspathFile(JavaProject.java:2089)\u003c/Frame\u003e\n        \u003cFrame depth=\"5\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1579)\u003c/Frame\u003e\n        \u003cFrame depth=\"6\"\u003eorg.eclipse.jdt.internal.core.search.indexing.IndexAllProject.execute(IndexAllProject.java:77)\u003c/Frame\u003e\n        \u003cFrame depth=\"7\"\u003eorg.eclipse.jdt.internal.core.search.processing.JobManager.run(JobManager.java:371)\u003c/Frame\u003e\n      \u003c/Frames\u003e\n    \u003c/Stacktrace\u003e\n    \u003cStacktrace timestamp=\"1547858607503\"\u003e\n      \u003cException\u003eorg.eclipse.core.internal.resources.ResourceException\u003c/Exception\u003e\n      \u003cReason\u003eResource /org.eclipse.jdt.launching/.classpath is not local.\u003c/Reason\u003e\n      \u003cFrames\u003e\n        \u003cFrame depth=\"0\"\u003eorg.eclipse.core.internal.resources.Resource.checkLocal(Resource.java:307)\u003c/Frame\u003e\n        \u003cFrame depth=\"1\"\u003eorg.eclipse.core.internal.resources.File.getContents(File.java:213)\u003c/Frame\u003e\n        \u003cFrame depth=\"2\"\u003eorg.eclipse.jdt.internal.core.Util.getResourceContentsAsByteArray(Util.java:677)\u003c/Frame\u003e\n        \u003cFrame depth=\"3\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getSharedProperty(JavaProject.java:1809)\u003c/Frame\u003e\n        \u003cFrame depth=\"4\"\u003eorg.eclipse.jdt.internal.core.JavaProject.readClasspathFile(JavaProject.java:2105)\u003c/Frame\u003e\n        \u003cFrame depth=\"5\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1593)\u003c/Frame\u003e\n        \u003cFrame depth=\"6\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getRawClasspath(JavaProject.java:1583)\u003c/Frame\u003e\n        \u003cFrame depth=\"7\"\u003eorg.eclipse.jdt.internal.core.JavaProject.getOutputLocation(JavaProject.java:1375)\u003c/Frame\u003e\n        \u003cFrame depth=\"8\"\u003eorg.eclipse.jdt.internal.core.search.indexing.IndexAllProject.execute(IndexAllProject.java:90)\u003c/Frame\u003e\n        \u003cFrame depth=\"9\"\u003eorg.eclipse.jdt.internal.core.search.processing.JobManager.run(JobManager.java:375)\u003c/Frame\u003e\n        \u003cFrame depth=\"10\"\u003ejava.lang.Thread.run(Thread.java:536)\u003c/Frame\u003e\n      \u003c/Frames\u003e\n    \u003c/Stacktrace\u003e\n  \u003c/Stacktraces\u003e\n  \u003cSourceCodeRegions amount=\"4\"\u003e\n    \u003csource_code type=\"functioncall\"\u003e\n      \u003clocation start=\"3466\" end=\"3479\" /\u003e\n      \u003ccode\u003e(monitor,1));\u003c/code\u003e\n    \u003c/source_code\u003e\n    \u003csource_code type=\"ifstatement\"\u003e\n      \u003clocation start=\"6059\" end=\"6327\" /\u003e\n      \u003ccode\u003eif (isJavaProject) {\n\t/*IJavaProject jProject = JavaCore.create(project);\n\tif (jProject.getRawClasspath() != null\n\t\t\u0026amp;\u0026amp; jProject.getRawClasspath().length \u0026gt; 0)\n\t\tjProject.setRawClasspath(new IClasspathEntry[0], monitor);*/\n\tmodelIds.add(model.getPluginBase().getId());\n}\u003c/code\u003e\n    \u003c/source_code\u003e\n    \u003csource_code type=\"ifstatement\"\u003e\n      \u003clocation start=\"6335\" end=\"6538\" /\u003e\n      \u003ccode\u003eif (isJavaProject) {\n\tIJavaProject jProject = JavaCore.create(project);\n\tjProject.setRawClasspath(new IClasspathEntry[0], project.getFullPath(),\nmonitor);\n\tmodelIds.add(model.getPluginBase().getId());\n}\u003c/code\u003e\n    \u003c/source_code\u003e\n    \u003csource_code type=\"ifstatement\"\u003e\n      \u003clocation start=\"7483\" end=\"7686\" /\u003e\n      \u003ccode\u003eif (isJavaProject) {\n\tIJavaProject jProject = JavaCore.create(project);\n\tjProject.setRawClasspath(new IClasspathEntry[0], project.getFullPath(),\nmonitor);\n\tmodelIds.add(model.getPluginBase().getId());\n}\u003c/code\u003e\n    \u003c/source_code\u003e\n  \u003c/SourceCodeRegions\u003e\n  \u003cEnumerations amount=\"1\"\u003e\n    \u003cEnumeration lines=\"23\"\u003e\n      \u003cLines\u003e\n        \u003cLine\u003e1. If autobuilding is on, we turn it off.\u003c/Line\u003e\n        \u003cLine /\u003e\n        \u003cLine\u003e2. We import all the plug-ins selected in the import wizard and create a Java\u003c/Line\u003e\n        \u003cLine\u003eproject for each plug-in that contains libraries.  Note that at this step, we\u003c/Line\u003e\n        \u003cLine\u003eused to clear the classpath of the freshly created Java project because we\u003c/Line\u003e\n        \u003cLine\u003ewill correctly set it at a later step.  However, just before we released 2.1,\u003c/Line\u003e\n        \u003cLine\u003ePhilippe suggested in bug report 34574 that we do not flush the classpath\u003c/Line\u003e\n        \u003cLine\u003ecompletely.  So we stopped flushing the classpath at this point, and this\u003c/Line\u003e\n        \u003cLine\u003eintroduced the transient error markers that we now see in the Problems view in\u003c/Line\u003e\n        \u003cLine\u003ethe middle of the operation.  Since these error markers go away later in step\u003c/Line\u003e\n        \u003cLine\u003e3 when we set the classpath, we regarded them as benign, yet still annoying,\u003c/Line\u003e\n        \u003cLine\u003eintermediary entities.  This step is done in an IWorkspace.run\u003c/Line\u003e\n        \u003cLine\u003e(IWorkspaceRunnable, IProgressMonitor) operation.\u003c/Line\u003e\n        \u003cLine /\u003e\n        \u003cLine\u003e3. We set the classpath of all the projects that were succesfully imported\u003c/Line\u003e\n        \u003cLine\u003einto the workspace. This step has to be done in a subsequent IWorkspace.run\u003c/Line\u003e\n        \u003cLine\u003e(IWorkspaceRunnable, IProgressMonitor) operation for an accurate classpath\u003c/Line\u003e\n        \u003cLine\u003ecomputation.  i.e. the Java projects from step 2 have to become part of the\u003c/Line\u003e\n        \u003cLine\u003eworkspace before we set their classpath.\u003c/Line\u003e\n        \u003cLine /\u003e\n        \u003cLine\u003e4.  If we had turned autobuilding off in step 1, we turn it back on and invoke\u003c/Line\u003e\n        \u003cLine\u003ea build via PDEPlugin.getWorkspace().build\u003c/Line\u003e\n        \u003cLine\u003e(IncrementalProjectBuilder.INCREMENTAL_BUILD,new SubProgressMonitor\u003c/Line\u003e\n      \u003c/Lines\u003e\n    \u003c/Enumeration\u003e\n  \u003c/Enumerations\u003e\n\u003c/infozilla-output\u003e\n```\n\n## Building\nBuilding the infoZilla tool requires `gradle` 4 or newer. Run `gradle tasks` for an overview.\n\n\n## Citing\nIf you like the tool and find it useful, feel free to cite the original research work:\n```\n@conference {972,\n\ttitle = {Extracting structural information from bug reports},\n\tbooktitle = {Proceedings of the 2008 international workshop on Mining software repositories  - MSR {\\textquoteright}08},\n\tyear = {2008},\n\tmonth = {05/2008},\n\tpages = {27-30},\n\tpublisher = {ACM Press},\n\torganization = {ACM Press},\n\taddress = {New York, New York, USA},\n\tabstract = {In software engineering experiments, the description of bug reports is typically treated as natural language text, although it often contains stack traces, source code, and patches. Neglecting such structural elements is a loss of valuable information; structure usually leads to a better performance of machine learning approaches. In this paper, we present a tool called infoZilla that detects structural elements from bug reports with near perfect accuracy and allows us to extract them. We anticipate that infoZilla can be used to leverage data from bug reports at a different granularity level that can facilitate interesting research in the future.},\n\tkeywords = {bug reports, eclipse, enumerations, infozilla, natural language, patches, source code, stack trace},\n\tisbn = {9781605580241},\n\tdoi = {10.1145/1370750.1370757},\n\tattachments = {https://flosshub.org/sites/flosshub.org/files/p27-bettenburg.pdf},\n\tauthor = {Premraj, Rahul and Zimmermann, Thomas and Kim, Sunghun and Bettenburg, Nicolas}\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicbet%2Finfozilla","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicbet%2Finfozilla","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicbet%2Finfozilla/lists"}