https://github.com/phact/dsefs-demo
https://github.com/phact/dsefs-demo
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/phact/dsefs-demo
- Owner: phact
- Created: 2017-07-05T19:47:01.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2018-10-18T13:25:11.000Z (over 7 years ago)
- Last Synced: 2025-06-20T16:11:28.382Z (12 months ago)
- Language: Shell
- Size: 28 MB
- Stars: 1
- Watchers: 2
- Forks: 4
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# dsefs-demo
This is a guide for how to use the power tools dsefs asset brought to you by the Vanguard team.
### Motivation
DSEFS is a relatively new part of DSE with particularly useful capabilities within the analytics space. The purpose of DSEFS is to prevent the need for DSE users with needs for a basic distributed file system to satisfy some of their real time DSE powered workloads from having to go install HDFS / call in a hadoop vendor. This asset is meant to demonstrate the basic capabilities of DSEFS.
### What is included?
This field asset includes sample usage of DSEFS in the following contexts:
* Puts and gets (without Spark)
* Puts and gets with Spark
* DSEFS loading, unloading, and joining to, from, and with Cassandra
* DSEFS for checkpointing of Spark Streaming jobs
### Business Take Aways
When a use case arises that requires a distributed filesystem, customers do not necesarlily have to get a third party hadoop vendor involved. Most of these cases can be covered by DSEFS.
Our intention is not to try to enter the hadoop market, but rather to ease the use cases that are mostly DSE powered but require a distributed file system as well.
### Technical Take Aways
Unlike it's precursor CFS, DSEFS stores the file data (sblocks) in the file system of the nodes in the cluster. It stores the file metatdata (inodes) in cassandra. CFS used to store both inodes and sblocks in cassandra leading to undesirable performance issues related to compactions, tombstones, and read and write throughput. The file system locations used by DSEFS are configurable in the dse.yaml and should be separate volumes than the cassandra data directory. DSEFS volumes can achieve significantly higher densities than cassandra since predictable latencies are less of an option. Up to 20TB per node is a good place to start.
DSEFS was designed to run with comparable (ballpark) performance to HDFS. As expected with a distributed file system, DSEFS performs better in therms of throughput with larger files than smaller files. When interacting with DSEFS programatically (today) we need to import denendencies from `dse.jar`. This will not always be the case as they will be added to the spark dependencies in artifactory, stay tuned.
These docs will dive deeper on the following functionality:
- Puts and Gets
- Spark dsefs interactions
- Checkpointing
## Startup Script
This Asset leverages
[simple-startup](https://github.com/jshook/simple-startup). To start the entire
asset run `./startup all` for other options run `./startup`