{"id":15056823,"url":"https://github.com/hurence/logisland","last_synced_at":"2025-04-07T14:13:40.502Z","repository":{"id":38983965,"uuid":"50912874","full_name":"Hurence/logisland","owner":"Hurence","description":"Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.","archived":false,"fork":false,"pushed_at":"2023-10-16T11:39:33.000Z","size":126100,"stargazers_count":109,"open_issues_count":183,"forks_count":29,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-04-07T14:12:58.574Z","etag":null,"topics":["analytics","big-data","cassandra","complex-event-processing","elasticsearch","influxdb","kafka","kafka-streams","pattern-recognition","solr","spark","stream-processing"],"latest_commit_sha":null,"homepage":"https://logisland.github.io","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Hurence.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.rst","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-02-02T10:27:21.000Z","updated_at":"2025-01-22T20:30:38.000Z","dependencies_parsed_at":"2023-11-13T06:45:47.210Z","dependency_job_id":null,"html_url":"https://github.com/Hurence/logisland","commit_stats":{"total_commits":2131,"total_committers":36,"mean_commits":59.19444444444444,"dds":0.7550445800093852,"last_synced_commit":"9866d5ebedbb9820c1fcab7058787b02b343cd48"},"previous_names":[],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hurence%2Flogisland","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hurence%2Flogisland/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hurence%2Flogisland/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Hurence%2Flogisland/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Hurence","download_url":"https://codeload.github.com/Hurence/logisland/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247666015,"owners_count":20975788,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","big-data","cassandra","complex-event-processing","elasticsearch","influxdb","kafka","kafka-streams","pattern-recognition","solr","spark","stream-processing"],"created_at":"2024-09-24T21:56:44.253Z","updated_at":"2025-04-07T14:13:40.479Z","avatar_url":"https://github.com/Hurence.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"Logisland\n=========\n\n.. image:: https://travis-ci.org/Hurence/logisland.svg?branch=master\n   :target: https://travis-ci.org/Hurence/logisland\n\n\n.. image:: https://badges.gitter.im/Join%20Chat.svg\n   :target: https://gitter.im/logisland/logisland?utm_source=share-link\u0026utm_medium=link\u0026utm_campaign=share-link\n   :alt: Gitter\n\n\nDownload the `latest release build \u003chttps://github.com/Hurence/logisland/releases\u003e`_  and\nchat with us on `gitter \u003chttps://gitter.im/logisland/logisland\u003e`_\n\n\n**LogIsland is an event mining scalable platform designed to handle a high throughput of events.**\n\nIt is highly inspired from DataFlow programming tools such as Apache Nifi, but with a highly scalable architecture.\n\nLogIsland is completely open source and free even for commercial use. Hurence provides support if required.\n\n\nEvent mining Workflow\n---------------------\nHere is an example of a typical event mining pipeline.\n\n1. Raw events (sensor data, logs, user click stream, ...) are sent to Kafka topics by a NIFI / Logstash / *Beats / Flume / Collectd (or whatever) agent\n2. Raw events are structured in Logisland Records, then processed and eventually pushed back to another Kafka topic by a Logisland streaming job\n3. Records are sent to external short living storage (Elasticsearch, Solr, Couchbase, ...) for online analytics.\n4. Records are sent to external long living storage (HBase, HDFS, ...) for offline analytics (aggregated reports or ML models).\n5. Logisland Processors handle Records to produce Alerts and Information from ML models\n\n\nOnline documentation\n--------------------\nYou can find the latest Logisland documentation, including a programming guide,\non the `project web page. \u003chttp://logisland.readthedocs.io/en/latest/index.html\u003e`_\n\nOr on this `site web as well. \u003chttps://logisland.github.io/docs/\u003e`_\n\nThis README file only contains basic setup instructions.\n\nBrowse the `Java API documentation \u003chttp://logisland.readthedocs.io/en/latest/_static/apidocs/\u003e`_ for more information.\n\n\nYou can follow one getting started guide through the\n`apache log indexing tutorial \u003chttp://logisland.readthedocs.io/en/latest/tutorials/index-apache-logs.html\u003e`_.\n\n\nBuilding Logisland\n------------------\nto build from the source just clone source and package with maven (logisland requires a **maven 3.5.2** version and beyond)\n\n.. code-block:: sh\n\n    git clone https://github.com/Hurence/logisland.git\n    cd logisland\n    mvn clean package\n\nthe final package is available at `logisland-assembly/target/logisland-1.4.1-full-bin.tar.gz`\n\nYou can also download the `latest release build \u003chttps://github.com/Hurence/logisland/releases\u003e`_\n\n\nIf you want to build with opencv support, please install OpenCV first and then\n\n     mvn clean package -Dopencv\n\nQuick start\n-----------\n\nLocal Setup\n+++++++++++\nAlternatively you can deploy **logisland** on any linux server from which Kafka and Spark are available\n\nReplace all versions in the below code by the required versions (spark version, logisland version on specific HDP version, kafka scala version and kafka version etc.) \n\nThe Kafka distributions are available at this address: \u003chttps://kafka.apache.org/downloads\u003e \n\nLast tested version of scala version for kafka is: **2.11** with preferred release of kafka : **0.10.2.2**\n\nLast tested version of Spark is: **2.3.1** on Hadoop version: **2.7** \n\nBut you should choose the Spark version that is compatible with your environment and hadoop installation if you have one (for example Spark **2.1.0** on hadoop **2.7**). Note that hadoop 2.7 can run Spark 2.4.x, 2.3.x, 2.2.x, 2.1.x. Check at this URL what is available : http://d3kbcqa49mib13.cloudfront.net/\n\n.. code-block:: sh\n\n    # install Kafka \u0026 start a zookeeper node + a broker\n    curl -s https://www-us.apache.org/dist/kafka/\u003ckafka_release\u003e/kafka_scala_version\u003e-\u003ckafka_version\u003e.tgz | tar -xz -C /usr/local/\n    cd /usr/local/kafka_\u003cscala_version\u003e-\u003ckafka_version\u003e\n    nohup bin/zookeeper-server-start.sh config/zookeeper.properties \u003e zookeeper.log 2\u003e\u00261 \u0026\n    JMX_PORT=10101 nohup bin/kafka-server-start.sh config/server.properties \u003e kafka.log 2\u003e\u00261 \u0026\n\n    # install Spark (choose the spark version compatible with your hadoop distrib if you have one)\n    curl -s http://d3kbcqa49mib13.cloudfront.net/spark-\u003cspark-version\u003e-bin-hadoop\u003chadoop-version\u003e.tgz | tar -xz -C /usr/local/\n    export SPARK_HOME=/usr/local/spark-\u003cspark-version\u003e-bin-hadoop\u003chadoop-version\u003e\n\n    # install Logisland 1.4.1\n    curl -s https://github.com/Hurence/logisland/releases/download/v1.0.0-RC2/logisland-1.0.0-RC2-bin.tar.gz  | tar -xz -C /usr/local/\n    cd /usr/local/logisland-1.4.1\n\n    # launch a logisland job\n    bin/logisland.sh --conf conf/index-apache-logs.yml\n\nyou can find some **logisland** job configuration samples under `$LOGISLAND_HOME/conf` folder\n\n\nDocker setup\n++++++++++++\nThe easiest way to start is the launch a docker compose stack\n\n.. code-block:: sh\n\n    # launch logisland environment\n    cd /tmp\n    curl -s https://raw.githubusercontent.com/Hurence/logisland/master/logisland-framework/logisland-resources/src/main/resources/conf/docker-compose.yml \u003e docker-compose.yml\n    docker-compose up\n\n    # sample execution of a logisland job\n    docker exec -i -t logisland conf/index-apache-logs.yml\n\n\nHadoop distribution setup\n+++++++++++++++++++++++++\nLaunching logisland streaming apps is just easy as unarchiving logisland distribution on an edge node, editing a config with YARN parameters and submitting job.\n\n.. code-block:: sh\n\n    # install Logisland 1.4.1\n    curl -s https://github.com/Hurence/logisland/releases/download/v0.10.0/logisland-1.4.1-bin-hdp2.5.tar.gz  | tar -xz -C /usr/local/\n    cd /usr/local/logisland-1.4.1\n    bin/logisland.sh --conf conf/index-apache-logs.yml\n\n\nStart a stream processing job\n-----------------------------\n\nA Logisland stream processing job is made of a bunch of components.\nAt least one streaming engine and 1 or more stream processors. You set them up by a YAML configuration file.\n\nPlease note that events are serialized against an Avro schema while transiting through any Kafka topic.\nEvery `spark.streaming.batchDuration` (time window), each processor will handle its bunch of Records to eventually\n generate some new Records to the output topic.\n\nThe following `configuration.yml` file contains a sample of job that parses raw Apache logs and send them to Elasticsearch.\n\n\nThe first part is the `ProcessingEngine` configuration (here a Spark streaming engine)\n\n.. code-block:: yaml\n\n    version: 1.4.1\n    documentation: LogIsland job config file\n    engine:\n      component: com.hurence.logisland.engine.spark.KafkaStreamProcessingEngine\n      type: engine\n      documentation: Index some apache logs with logisland\n      configuration:\n        spark.app.name: IndexApacheLogsDemo\n        spark.master: yarn-cluster\n        spark.driver.memory: 1G\n        spark.driver.cores: 1\n        spark.executor.memory: 2G\n        spark.executor.instances: 4\n        spark.executor.cores: 2\n        spark.yarn.queue: default\n        spark.yarn.maxAppAttempts: 4\n        spark.yarn.am.attemptFailuresValidityInterval: 1h\n        spark.yarn.max.executor.failures: 20\n        spark.yarn.executor.failuresValidityInterval: 1h\n        spark.task.maxFailures: 8\n        spark.serializer: org.apache.spark.serializer.KryoSerializer\n        spark.streaming.batchDuration: 4000\n        spark.streaming.backpressure.enabled: false\n        spark.streaming.unpersist: false\n        spark.streaming.blockInterval: 500\n        spark.streaming.kafka.maxRatePerPartition: 3000\n        spark.streaming.timeout: -1\n        spark.streaming.unpersist: false\n        spark.streaming.kafka.maxRetries: 3\n        spark.streaming.ui.retainedBatches: 200\n        spark.streaming.receiver.writeAheadLog.enable: false\n        spark.ui.port: 4050\n      controllerServiceConfigurations:\n\nThen comes a list of `ControllerService` which are the shared components that interact with outside world (Elasticearch, HBase, ...)\n\n.. code-block:: yaml\n\n        - controllerService: datastore_service\n          component: com.hurence.logisland.service.elasticsearch.Elasticsearch_6_6_2_ClientService\n          type: service\n          documentation: elasticsearch service\n          configuration:\n            hosts: sandbox:9200\n            batch.size: 5000\n\nThen comes a list of `RecordStream`, each of them route the input batch of `Record` through a pipeline of `Processor`\nto the output topic\n\n.. code-block:: yaml\n\n      streamConfigurations:\n        - stream: parsing_stream\n          component: com.hurence.logisland.stream.spark.KafkaRecordStreamParallelProcessing\n          type: stream\n          documentation: a processor that converts raw apache logs into structured log records\n          configuration:\n            kafka.input.topics: logisland_raw\n            kafka.output.topics: logisland_events\n            kafka.error.topics: logisland_errors\n            kafka.input.topics.serializer: none\n            kafka.output.topics.serializer: com.hurence.logisland.serializer.KryoSerializer\n            kafka.error.topics.serializer: com.hurence.logisland.serializer.JsonSerializer\n            kafka.metadata.broker.list: sandbox:9092\n            kafka.zookeeper.quorum: sandbox:2181\n            kafka.topic.autoCreate: true\n            kafka.topic.default.partitions: 4\n            kafka.topic.default.replicationFactor: 1\n\nThen come the configurations of all the `Processor` pipeline. Each Record will go through these components.\nHere we first parse raw apache logs and then we add those records to Elasticsearch. Please note that the datastore processor makes\nuse of the previously defined ControllerService.\n\n.. code-block:: yaml\n\n          processorConfigurations:\n\n            - processor: apache_parser\n              component: com.hurence.logisland.processor.SplitText\n              type: parser\n              documentation: a parser that produce records from an apache log REGEX\n              configuration:\n                record.type: apache_log\n                value.regex: (\\S+)\\s+(\\S+)\\s+(\\S+)\\s+\\[([\\w:\\/]+\\s[+\\-]\\d{4})\\]\\s+\"(\\S+)\\s+(\\S+)\\s*(\\S*)\"\\s+(\\S+)\\s+(\\S+)\n                value.fields: src_ip,identd,user,record_time,http_method,http_query,http_version,http_status,bytes_out\n\n            - processor: es_publisher\n              component: com.hurence.logisland.processor.datastore.BulkPut\n              type: processor\n              documentation: a processor that indexes processed events in elasticsearch\n              configuration:\n                datastore.client.service: datastore_service\n                default.collection: logisland\n                default.type: event\n                timebased.collection: yesterday\n                collection.field: search_index\n                type.field: record_type\n\n\n\nOnce you've edited your configuration file, you can submit it to execution engine with the following cmd :\n\n.. code-block:: bash\n\n    bin/logisland.sh -conf conf/job-configuration.yml\n\n\nYou should jump to the `tutorials section \u003chttp://logisland.readthedocs.io/en/latest/tutorials/index.html\u003e`_ of the documentation.\nAnd then continue with `components documentation \u003chttp://logisland.readthedocs.io/en/latest/components.html\u003e`_\n\nContributing\n------------\n\nTo contribute please follow git hubflow : https://datasift.github.io/gitflow/TheHubFlowTools.html\n\nPlease review the `Contribution to Logisland guide \u003chttp://logisland.readthedocs.io/en/latest/developer.html\u003e`_ for information on how to get started contributing to the project.\n\n\n\nStart a stream processing job\n-----------------------------\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhurence%2Flogisland","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhurence%2Flogisland","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhurence%2Flogisland/lists"}