https://github.com/sunilsoni/cassandra-data-modeling
Basic Rules of Cassandra Data Modeling
https://github.com/sunilsoni/cassandra-data-modeling
cassandra cassandra-cql data-modeling time-series-database
Last synced: 10 days ago
JSON representation
Basic Rules of Cassandra Data Modeling
- Host: GitHub
- URL: https://github.com/sunilsoni/cassandra-data-modeling
- Owner: sunilsoni
- Created: 2018-02-11T07:04:47.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-02-15T17:34:09.000Z (about 7 years ago)
- Last Synced: 2025-03-26T08:06:50.770Z (28 days ago)
- Topics: cassandra, cassandra-cql, data-modeling, time-series-database
- Homepage: https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
- Size: 72 MB
- Stars: 56
- Watchers: 7
- Forks: 37
- Open Issues: 0
-
Metadata Files:
- Readme: README.adoc
Awesome Lists containing this project
README
= Cassandra Data Modeling
== https://wiki.apache.org/cassandra/DataModel[Introduction:]
Cassandra is a partitioned row store, where rows are organized into tables with a required primary key.
The first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the PK. Other columns may be indexed independent of the PK.
This allows pervasive denormalization to "pre-build" resultsets at update time, rather than doing expensive joins across the cluster.
== https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling[Basic Rules of Cassandra Data Modeling]
== https://youtu.be/B_HTdrTgGNs[Introduction To Apache Cassandra with Patrick McFadin]
== https://www.youtube.com/watch?v=tg6eIht-00M&t=2s[ Tech Talk: Cassandra Data Modeling TimeSeries with Patrick McFadin]
== https://www.youtube.com/watch?v=N2zIlVhKXTc&t=29s[ Introduction to Cassandra Data Model | Edureka]
== https://www.youtube.com/watch?v=Vv3QJxAdjic[Midwest.io 2014 - Time Series with Apache Cassandra - Patrick McFadin]
== https://www.youtube.com/watch?v=-zyZ35YyT_8[cassandra data modeling - Practical considerations @ netflix]
https://github.com/jaegertracing/jaeger/blob/master/plugin/storage/cassandra/schema/v001.cql.tmpl
Stackoverflow::
https://stackoverflow.com/questions/17945677/cassandra-uuid-vs-timeuuid-benefits-and-disadvantages[Cassandra UUID vs TimeUUID benefits and disadvantages?]
Sample Tables:CREATE TABLE sensor_readings (
sensorID uuid,
time_bucket int,
timestamp bigint,
reading decimal,
PRIMARY KEY ((sensorID, time_bucket), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);SELECT * FROM sensor_readings
WHERE sensorID = 53755080-4676-11e4-916c-0800200c9a66
AND time_bucket IN (1411840800, 1411844400)
AND timestamp >= 1411841700
AND timestamp <= 1411845300;CREATE TABLE IF NOT EXISTS ${keyspace}.traces (
trace_id blob,
span_id bigint,
span_hash bigint,
parent_id bigint,
operation_name text,
flags int,
start_time bigint,
duration bigint,
tags list>,
logs list>,
refs list>,
process frozen,
PRIMARY KEY (trace_id, span_id, span_hash)
)
WITH compaction = {
'compaction_window_size': '1',
'compaction_window_unit': 'HOURS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'
}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = ${trace_ttl}
AND speculative_retry = 'NONE'
AND gc_grace_seconds = 10800; -- 3 hours of downtime acceptable on nodes
CREATE TABLE IF NOT EXISTS ${keyspace}.duration_index (
service_name text, // service name
operation_name text, // operation name, or blank for queries without span name
bucket timestamp, // time bucket, - the start_time of the given span rounded to an hour
duration bigint, // span duration, in microseconds
start_time bigint,
trace_id blob,
PRIMARY KEY ((service_name, operation_name, bucket), duration, start_time, trace_id)
) WITH CLUSTERING ORDER BY (duration DESC, start_time DESC)
AND compaction = {
'compaction_window_size': '1',
'compaction_window_unit': 'HOURS',
'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy'
}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = ${trace_ttl}
AND speculative_retry = 'NONE'
AND gc_grace_seconds = 10800; -- 3 hours of downtime acceptable on nodes
Sequential writes can cause hot spots: If the application tends to write or
update a sequential block of rows at a time, the writes will not be distributed
across the cluster. They all go to one node. This is frequently a problem for
applications dealing with timestamped data