An open API service indexing awesome lists of open source software.

https://github.com/dmuth/splunk-lab

Learn Splunk by creating a lab instance in seconds. Includes Eventgen and Splunk's Machine Learning app!
https://github.com/dmuth/splunk-lab

Last synced: about 1 year ago
JSON representation

Learn Splunk by creating a lab instance in seconds. Includes Eventgen and Splunk's Machine Learning app!

Awesome Lists containing this project

README

          

# Splunk Lab

This project lets you stand up a Splunk instance in Docker on a quick and dirty basis.

But what is Splunk? Splunk is a platform for big data collection and analytics. You feed your events from syslog, webserver logs, or application logs into Splunk, and can use queries to extract meaningful insights from that data.

## Quick Start!

Paste either of these on the command line:

`bash <(curl -s https://raw.githubusercontent.com/dmuth/splunk-lab/master/go.sh)`

`bash <(curl -Ls https://bit.ly/splunklab)`

...and the script will print up what directory it will ingest logs from, your password, etc. Follow the on-screen
instructions for setting environment variables and you'll be up and running in no time! Whatever logs you had sitting in your `logs/` directory will be searchable in Splunk with the search `index=main`.

If you want to see neat things you can do in Splunk Lab, check out the Cookbook section.

Also, the script will craete a directory called `bin/` with some helper scripts in it. Be sure to check them out!

### Useful links after starting

- [https://localhost:8000/](https://localhost:8000/) - Default port to log into the local instance. Username is `admin`, password is what was set when starting Splunk Lab.
- [Splunk Dashboard Examples](https://localhost:8000/en-US/app/simple_xml_examples/contents) - Wanna see what you can do with Splunk? Here are some example dashboards.

## Features

- App databoards can be stored in the local filesystem (they don't dissappear when the container exits)
- Ingested data can be stored in the local filesystem
- Multiple REST and RSS endpoints "built in" to provide sources of data ingestion
- Integration with REST API Modular Input
- Splunk Machine Learning Toolkit included
- `/etc/hosts` can be appended to with local ip/hostname entries
- Ships with Eventgen to populate your index with fake webserver events for testing.

## Screenshots

These are screenshots with actual data from production apps which I built on top of Splunk Lab:













## Splunk Lab Cookbook

What can you do with Splunk Lab? Here are a few examples of ways you can use Splunk Lab:

### Ingest some logs for viewing, searching, and analysis

- Drop your logs into the `logs/` directory.
- `bash <(curl -Ls https://bit.ly/splunklab)`
- Go to https://localhost:8000/
- Ingsted data will be written to `data/` which will persist between runs.

### Ingest some logs for viewing, searching, and analysis but DON'T keep ingested data between runs

- `SPLUNK_DATA=no bash <(curl -Ls https://bit.ly/splunklab)`
- Note that `data/` will not be written to and launching a new container will cause `logs/` to be indexed again.
- This will increase ingestion rate on Docker for OS/X, as there are some issues with the filesystem driver in OS/X Docker.

### Play around with synthetic webserver data

- `SPLUNK_EVENTGEN=1 bash <(curl -Ls https://bit.ly/splunklab)`
- Fake webserver logs will be written every 10 seconds and can be viewed with the query `index=main sourcetype=nginx`. The logs are based on actual HTTP requests which have come into the webserver hosting my blog.

### Adding Hostnames into /etc/hosts

- Edit a local hosts file
- `ETC_HOSTS=./hosts bash <(curl -Ls https://bit.ly/splunklab)`
- This can be used in conjunction with something like Splunk Network Monitor to ping hosts that don't have DNS names, such as your home's webcam. :-)

### Get the Docker command line for any of the above

- Run any of the above with `PRINT_DOCKER_CMD=1` set, and the Docker command line that's used will be written to stdout.

### Run Splunk Lab in Development Mode with a bash Shell

This would normally be done with the script `./bin/devel.sh` when running from the repo,
but if you're running Splunk Lab just with the Docker image, here's how to do it:

`docker run -p 8000:8000 -e SPLUNK_PASSWORD=password1 -v $(pwd)/data:/data -v $(pwd)/logs:/logs --name splunk-lab --rm -it -v $(pwd):/mnt -e SPLUNK_DEVEL=1 dmuth1/splunk-lab bash`

This is useful mainly if you want to poke around in Splunk Lab while it's running. Note that you
could always just run `docker exec splunk-lab bash` instead of doing all of the above. :-)

## Splunk Apps Included

The following Splunk apps are included in this Docker image:

- Eventgen
- Splunk Dashboard Examples

- REST API Modular Input (requires registration)
- Wordcloud Custom Visualization
- Slack Notification Alert
- Splunk Machine Learning Toolkit
- Python for Scientific Computing (for Linux 64-bit)
- NLP Text Analytics
- Halo - Custom Visualization
- Sankey Diagram - Custom Visualization

All apps are covered under their own license. Please check the Apps page
for more info.

Splunk has its own license. Please abide by it.

## Free Sources of Data

I put together this curated list of free sources of data which can be pulled into Splunk
via one of the included apps:

- RSS
- Recent questions posted to Splunk Answers
- CNN RSS feeds
- Flickr's Public feed
- Public Photos
- Public photos tagged "cheetah"
- REST (you will need to set `$REST_KEY` when starting Splunk Lab)
- Non-streaming
- Philadelphia Public Transit API
- Regional Rail Train Data
- Coinbase API
- National Weather Service
- Philadelphia Forecast
- Philadelphia Hourly Forecast
- Alpha Vantage - Free stock quotes
- Streaming
- Meetup RSVPs
- RSVP Endpoint

## Apps Built With Splunk Lab

Since building Splunk Lab, I have used it as the basis for building other projects:

- SEPTA Stats
- Website with real-time stats on Philadelphia Regional Rail.
- Pulled down over 60 million train data points over 4 years using Splunk.
- Splunk Twint
- Splunk dashboards for Twitter timelines downloaded by Twint. This now a part of the TWINT Project.
- Splunk Yelp Reviews
- This project lets you pull down Yelp reviews for venues and view visualizations and wordclouds of positive/negative reviews in a Splunk dashboard.
- Splunk Glassdoor Reviews
- Similar to Splunk Yelp, this project lets you pull down company reviews from Glassdoor and Splunk them
- Splunk Telegram
- This app lets you run Splunk against messages from Telegram groups and generate graphs and word clouds based on the activity in them.
- Splunk Network Health Check
- Pings 1 or more hosts and graphs the results in Splunk so you can monitor network connectivity over time.
- Splunk Fitbit
- Analyzes data from your Fitbit
- Splunk for AWS S3 Server Access Logs
- App to analyize AWS S3 Access Logs

Here's all of the above, presented as a graph:

## Building Your Own Apps Based on Splunk Lab

A sample app (and instructions on how to use it) are in the
sample-app directory.
Feel free to expand on that app for your own apps.

## A Word About Security

HTTPS is turned on by default. Passwords such as `password` and 12345 are not permitted.

Please, for the love of god, use a strong password if you are deploying
this on a public-facing machine.

## FAQ

### How do I get a valid SSL cert on localhost?

Yes, you can!

First, install mkcert and then run `mkcert -install && mkcert localhost 127.0.0.1 ::1` to generate a local CA and a cert/key combo for localhost.

Then, when you run Splunk Lab, set the environment variables `SSL_KEY` and `SSL_CERT` and those files will be pulled into Splunk Lab.

Example: `SSL_KEY=./localhost.key SSL_CERT=./localhost.pem ./go.sh`

### How do I get this to work in Vagrant?

TL;DR If you're on a Mac, use OrbStack.

If you're running Docker in Vagrant, or just plain Vagrant, you'll run into issues because Splunk does some low-level stuff with its Vagrant directory that will result in errors in `splunkd.log` that look like this:

```
11-15-2022 01:45:31.042 +0000 ERROR StreamGroup [217 IndexerTPoolWorker-0] - failed to drain remainder total_sz=24 bytes_freed=7977 avg_bytes_per_iv=332 sth=0x7fb586dfdba0: [1668476729, /opt/splunk/var/lib/splunk/_internaldb/db/hot_v1_1, 0x7fb587f7e840] reason=st_sync failed rc=-6 warm_rc=[-35,1]
```

To work around this, disable sharing of Splunk's data directory by setting `SPLUNK_DATA=no`, like this:

`SPLUNK_DATA=no SPLUNK_EVENTGEN=yes ./go.sh`

By doing this, any data ingested into Spunk will not persist between runs. But to be fair, Splunk Lab is meant for development usage of Splunk, not long-term usage.

### Does this work on Macs?

Sure does! I built this on a Mac. :-)

For best results, run under OrbStack.

## Development

I wrote a series of helper scripts in `bin/` to make the process easier:

- `./bin/download.sh` - Download tarballs of various apps and splits some of them into chunks
- If downloading a new version of Splunk, edit `bin/lib.sh` and bump the `SPLUNK_VERSION` and `SPLUNK_BUILD` variables.
- `./bin/build.sh [ --force ]` - Build the containers.
- Note that this downloads packages from an AWS S3 bucket that I created. This bucket is set to "requestor pays", so you'll need to make sure the `aws` CLI app set up.
- If you are (re)building Splunk Lab, you'll want to use `--force`.
- `./bin/upload-file-to-s3.sh` - Upload a specific file to S3. For rolling out new versions of apps
- `./bin/devel.sh` - Build and tag the container, then start it with an interactive bash shell.
- This is a wrapper for the above-mentioned `go.sh` script. Any environment variables that work there will work here.
- **To force rebuilding a container during development** touch the associated Dockerfile in `docker/`. E.g. `touch docker/1-splunk-lab` to rebuild the contents of that container.
- `./bin/push.sh` - Tag and push the container.
- `./bin/create-1-million-events.py` - Create 1 million events in the file `1-million-events.txt` in the current directory.
- If not in `logs/` but reachable from the Docker container, the file can then be oneshotted into Splunk with the following command: `/opt/splunk/bin/splunk add oneshot ./1-million-events.txt -index main -sourcetype oneshot-0001`
- `./bin/kill.sh` - Kill a running `splunk-lab` container.
- `./bin/attach.sh` - Attach to a running `splunk-lab` container.
- `./bin/clean.sh` - Remove `logs/` and/or `data/` directories.
- `./bin/tarsplit` - Local copy of my pacakge from https://github.com/dmuth/tarsplit

### Building a New Version of Splunk

- Bump version number and build number in `bin/lib.sh`
- Run `./bin/build.sh`, use `--force` if necessary
- This can take several MINUTES, especially if no apps are cached locally
- Run `SPLUNK_EVENTGEN=yes SPLUNK_ML=yes ./bin/devel.sh`
- This will build and tag the container, and spawn an interactive shell
- Run `/opt/splunk/bin/splunk version` inside the container to verify the version number
- Go to https://localhost:8000/ and verify you can log into Splunk
- Run the query `index=main earliest=-1d` and verify Eventgen events are coming in
- Go to https://localhost:8000/en-US/app/Splunk_ML_Toolkit/contents and verify that the ML Toolkit has been installed.
- Type `exit` in the shell to shut down the server
- Run `./bin/push.sh` to deploy the image. This will take awhile.

### Building Container Internals

- Here's the layout of the `cache/` directory
- `cache/` - Where tarballs for Splunk and its apps hang out. These are downloaded when `bin/download.sh` is run for the first time.
- `cache/deploy/` - When creating a specific Docker image, files are copied here so the Dockerfile can ingest them. (Or rather hardlinked to the files in the parent directory.)
- `cache/build/` - 0-byte files are written here when a specific container is built, and on future builds, the age of that file is checked against the Dockerfile. If the Dockerfile is newer, then the container is (re-)built. Otherwise, it is skipped. This shortens a run of `bin/devel.sh` where no containers need to be built from 12 seconds on my 2020 iMac to 0.2 seconds.

### A word on default/ and local/ directories

I had to struggle with this for awhile, so I'm mostly documenting it here.

When in devel mode, `/opt/splunk/etc/apps/splunk-lab/` is mounted to `./splunk-lab-app/` via `go.sh`
and the entrypoint script inside of the container symlinks `local/` to `default/`.
This way, any changes that are made to dashboards will be propagated outside of
the container and can be checked in to Git.

When in production mode (e.g. running `./go.sh` directly), no symlink is created,
instead `local/` is mounted by whatever `$SPLUNK_APP` is pointing to (default is `app/`), so that any
changes made by the user will show up on their host, with Splunk Lab's `default/`
directory being untouched.

## Additional Reading

- Splunk Network Health Check

## Notes/Bugs

- The Docker containers are **dmuth1/splunk-lab** and **dmuth1/splunk-lab-ml**. The latter has all of the Machine Learning apps built in to the image. Feel free to extend those for your own projects.
- If I run `./bin/create-test-logfiles.sh 10000` and then start Splunk Lab on a Mac, all of the files will be Indexed without any major issues, but then the CPU will spin, and not from Splunk.
- The root cause is that the filesystem code for Docker volume mappings on OS/X's Docker implementation is VERY inefficient in terms of both CPU and memory usage, especially when there are 10,000 files involved. The overhead is just crazy. When reading events from a directory mounted through Docker, I see about 100 events/sec. When the directory is local to the container, I see about 1,000 events/sec, for a 10x difference.
- The HTTPS cert is self-signed with Splunk's own CA. If you're tired of seeing a Certificate Error every time you try connecting to Splunk, you can follow the instructions at https://stackoverflow.com/a/31900210/196073 to allow self-signed certificates for `localhost` in Google Chrome.
- Please understand the implications before you do this.

## Credits

- Splunk N' Box - Splunk N' Box is used to create entire Splunk clusters in Docker. It was the first actual use of Splunk I saw in Docker, and gave me the idea that hey, maybe I could run a stand-alone Splunk instance in Docker for ad-hoc data analysis!
- Splunk, for having such a fantastic product which is also a great example of Operational Excellence!
- Eventgen is a super cool way of generating simulating real data that can be used to generate dashboards for testing and training purposes.
- This text to ASCII art generator, for the logo I used in the script.
- The logo was made over at https://www.freelogodesign.org/
- Lars Wirzenius for a review of this README.

## Copyrights

- Splunk is copyright by Splunk, Inc. Please stay within the confines of the 500 MB/day free license when using Splunk Lab, unless you brought your own license along.
- The various apps are copyright by the creators of those apps.

## Contact

My email is doug.muth@gmail.com. I am also @dmuth on Twitter
and Facebook!