Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stefangt44/concurrent-word-distribution-tool

A JavaFX desktop application for concurrently computing the distribution of specified words in large files/directories and plotting the results on a graph.
https://github.com/stefangt44/concurrent-word-distribution-tool

component-based concurrent-programming desktop-application graph javafx

Last synced: 22 days ago
JSON representation

A JavaFX desktop application for concurrently computing the distribution of specified words in large files/directories and plotting the results on a graph.

Host: GitHub
URL: https://github.com/stefangt44/concurrent-word-distribution-tool
Owner: stefanGT44
Created: 2020-09-02T13:55:48.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-09-15T23:09:54.000Z (over 4 years ago)
Last Synced: 2024-11-25T03:22:06.847Z (3 months ago)
Topics: component-based, concurrent-programming, desktop-application, graph, javafx
Language: Java
Homepage:
Size: 21.3 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Concurrent-Word-Distribution-Tool
A JavaFX desktop application for concurrently computing the distribution of specified words in large directories/files and plotting the results on a graph. (Distribution - the number of times each word appears in a text file)

## Overview
The application functions as a pipeline that consists of several types of components that are working concurrently in conjunction.

There are three types of components:
1. Input component - data entry point
2. Cruncher component - data processing
3. Output component - storing and visualizing results

The user can make multiple instances of each component and link (connect) them in a way he sees fit.

Every component instance runs in its own thread and every component type has a dedicated thread pool.

Input components provide input to cruncher components, which provide input to the output components.

Component communication (data flow) is based on shared blocking queues.

The architecture of the system makes it easy to integrate new types of components.

The application is optimized to use as little RAM as possible.

Components and the main app follow the MVC design pattern.

#### Component pipeline example:
![Alt text](images/wdt.png?raw=true "")

## Usage example

![Alt text](images/de4.png?raw=true "")

Input 0 is linked to Cruncher 0 which is automatically linked to the default output component.

Input 0 is active and currently reading one text file (see the bottom blue label).

Cruncher 0 is currently computing the distribution in three files that Input 0 has provided.

Cruncher progress can also be monitored in the output component, if an item in the list has a prefix \*, the results are not ready yet (cruncher is still working on that file).

![Alt text](images/de5.png?raw=true "")

In this image the Input and Cruncher components have finished their work from the previous image.

The output component is showing the distribution of words in the file wiki-7.txt.

It is also currently computing the sum distribution that the user specified.

![Alt text](images/de6.png?raw=true "")

In this example the output component is computing the specified distribution sum (aggregation) and is waiting for the final file results to become available, in order to finish the computation.

## Component details:
Every component instance runs in its own thread and every component type has a dedicated thread pool for completing its main tasks.

Components communicate among each other using blocking queues. Every component has a blocking queue that its predecessors can write to.

### Input components:
Every input component can be linked to one or more cruncher components.

The main objective of input components is to scan directories for text files which are then read and supplied to linked crunchers.

The reading of text files is done in a separate task within the input thread pool.

Input components are tied to a disk (drive) that the user specifies when creating a new instance.

Only directories on the specified disk can be scanned, and only one reading task can be active in the thread pool per disk.

After one scan cycle is finished, the component pauses for a certain duration before the next cycle (specified in the config file).

The user can manually pause and resume input components.

The last modified value of scanned directories is tracked, so if a directory has been modified, it is scanned again (the text files are read again).

### Cruncher components:
In the current implementation, cruncher components are automatically linked to one default output component, but the code supports multiple output components.
The main objective of cruncher components is to count the word distribution in text objects that linked input components provided, and supply linked output components with the results.

Upon receiving input text, a new RecursiveTask is created within the cruncher thread pool and a Future object is forwarded to all linked output components.

The task recursively creates new tasks and splits the job (text) into smaller chunks (chunk size specified in the config file), after which the distribution computation is done, and finally the results are combined.
Every cruncher instance has a specified arity number.

If arity = 1 the cruncher counts the number of times every single word appears in a text,

if arity = 3 the cruncher counts the number of times every three consecutive words, in exactly that order, appear in a text , etc.

### Output components:
Output components store results provided by the linked crunchers.

The results can be aggregated (this is done within the output component thread pool), sorted and plotted on the graph.
Output components are aware of all created jobs, even unfinished ones (active jobs have * as a prefix).

The component offers get (blocking) and poll (not blocking) methods for retrieving results.
Single result plotting uses the poll method and notifies the user if results are not ready yet.

The aggregation task uses the get method and waits (is blocked) if some results are not ready yet.

All types of results (single or aggregated) are sorted before plotting.

## System quality:
The application is optimized to use as little RAM as possible. But in the events that RAM runs out, the user is notified and the application shut down.

GUI buttons, lists, labels are always refreshed and enabled only when that makes sense.

The user is notified when errors occur with an error message alert.

When exiting the application, new jobs cannot be started, and all unfinished jobs must finish (reading a text file, cruncher working on a file, output aggregating, sorting or plotting results). If unfinished jobs exist, the user is shown a modal dialog with a message that the application is in the process of exiting.

## Configuration file (app.properties):
Parameters are read during app start and cannot be changed during app operation.

File structure:

file_input_sleep_time=5000 - pause duration for the input component

disks=data/disk1/;data/disk2 - list of disks for the input component

counter_data_limit=10000000 - job limit for counting tasks given in characters

sort_progress_limit=10000 - number of comparisons after which progress bar is updated during sorting

## Sidenote
This project was an assignment as a part of the course - Concurrent and Distributed Systems during the 8th semester at the Faculty of Computer Science in Belgrade. All system functionalities were defined in the assignment specifications.

## Download
You can download the .jar files [here](download/Concurrent-Distribution-Tool.zip).

## Contributors
- Stefan Ginic -