https://github.com/mjz86/string

A C++20 string implementation , with constexpr friendly design, SSO optimized, COW optimized, unicode and encoding awareness, string viewer, and with custom allocators
https://github.com/mjz86/string
cow encoding implementation optimized-memory sso unicode
Last synced: 3 months ago
JSON representation
A C++20 string implementation , with constexpr friendly design, SSO optimized, COW optimized, unicode and encoding awareness, string viewer, and with custom allocators
Host: GitHub
URL: https://github.com/mjz86/string
Owner: Mjz86
License: mit
Created: 2025-03-14T18:56:06.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-03-30T18:02:24.000Z (3 months ago)
Last Synced: 2025-03-30T19:19:52.648Z (3 months ago)
Topics: cow, encoding, implementation, optimized-memory, sso, unicode
Language: C++
Homepage:
Size: 567 KB
Stars: 12
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Reconsidering COW, a Modern C++20 String Implementation

**tl;dr:**

Less lifetime management (dangling views) while maintaining view (and value)

semantics and performance characteristics (no copying or allocation), also , the wrappers allow for tuning the sso size to any desired size , and all of them are compatible with each other. And

planned Unicode support.

# Abstract

This paper presents a custom string class designed for memory efficiency.

The paper is mainly focused on the 64-bit little-endian implementation, but other platforms also have equivalent functionality. The string class provides a way to store various encodings in different string types. It's allocator-aware and uses SSO and COW features while adding others with minimal overhead. The class is as `constexpr` friendly as the standard string, mostly comparable to the GCC implementation.

# Introduction

Many C++ programmers have worked with strings. Several have been forced to use either immutable value semantics with views or reference semantics and allocations for mutable strings. This string class bridges these, providing an owning tagged string view as its storage.

# Implementation Details

note that in the 32bit platform, there are two options,

one is big address space ( 56 bit) and one is small address space ( 24bit) , in the small case , the object size is 16bytes and in the large case 32bytes.

The basic layout can be thought of as the following (not valid C++, but its equivalent) (total of 32 bytes in my own implementation):

```c++

struct {

  (allocator reference as empty base);

  const char* begin_ptr;

  size_t  length;

  union{

    char  buffer[15];

    struct referencal_t{// the string is a reference to something

      char* data_block;

      size_t capacity:56;// in actuality only 7 bytes

    };

  };

  uint8_t  control_byte;

};

```

The control byte can be thought of as:

```c++

struct {

uint8_t  /*the negation of this is actually stored*/is_threaded:1;

  uint8_t  is_sharable:1;// this  indicates  that we are in a heap or litteral view , vs , sso or stack buffer or copying view.

 uint8_t is_ownerized:1;   /*this flag is not propagated by copy or share, only by move,in the non-move case,  if any side (dest or src) has this flag set to true, a memcpy and a potential allocation occurs (if sso or stack buffer is not large enough) */

 // to always disable cow and viewer for a specific string ,

 // to remove the reference_count checks,

 //controled  with  always_ownerize(bool flag_state),

 //  (is_sharable&&!is_ownerized) determines sharability

 // both of these flags are essential and they do not correlate.

  uint8_t has_null:1;// needed to share substrings

  uint8_t:1;//unused_for_now_

  uint8_t  encoding:3; // we are not a bag of bytes

};

```

The encoding flags are for knowing the encoding of the stored data.

The heap block can be thought of as:

```c++

struct {// this is just a 8 or 16 byte aligned char array with this layout

  size_t  reference_count;

  char heap_buffer[capacity];

};

```

# Invariants

1.  **At the string view state:**

    - `begin != buffer`.

    - `active union member == referencal_t`.

    - `begin != nullptr || 0 == length`.

    - `data_block == nullptr`.

    - `capacity == 0`.

    - `is_owner() == false`.

    - `is_ownerized == false`.

      the view optimization is essentially a cow string with static lifetime, therfore its not ownerized.

2.  **At the SSO string state:**

    - `begin == buffer`.

    - `active union member == buffer`.

    - `capacity == 15`.

    - `is_owner() == true`.

    - `is_sharable == false`.

    - `has_null == (length != 15)`.

3.  **Heap string state:**

    - `begin != buffer`.

    - `active union member == referencal_t`.

    - `begin != nullptr`.

    - `data_block == &heap_buffer`.

    - `capacity != 0`.

    - (`capacity` is almost always bigger than 15, but no guarantees are made)

    - `is_owner() ==(is_ownerized ||  (reference_count < 2))`.

    - ` !is_ownerized || reference_count < 2`.

    - `is_sharable == true`.

4.  **Stack buffer string state:**

    - `begin != buffer`.

    - `active union member == referencal_t`.

    - `begin != nullptr`.

    - `data_block != nullptr`.

    - `capacity != 0`.

    - `is_owner() == true`.

    - `is_sharable == false`.

Also, `[begin, end)` is a continuous sub-range of

`[data_block, data_block + capacity)` if and only if `data_block` is not

null and alive.

# Addressing COW and the Drawbacks

- note , if a string is showing bottlenecks on cow, try `ownerize()` or `always_ownerize(true)` or the `ownerized_string`.

- on the main string:

  All of the iterators, the methods, and functions only give constant

  references to the data.

- for mutatable iteration, consider the following:

  the `ownerized_string` wrapper is an equivalent of `std::string` ( null termination requirement is given via template flag to provide `c_str`).

  users can convert between these easily without any lifetime issues,

  and because its a wrapper, it wouldn't have code bloat.

  i think rust had a cow and non cow string as well.

```c++

struct ownerized_string /* in actuality,  this is a using statement of the real template wrapper*/{

private:

  //....

  mjz::string value;// is_ownerized being true is an invariant.

  // the standard string interface can be used( other than c_str )

};

```

i dont think the null terminator is important enough for a wrapper,

it shouldn't be dependent upon at all , its purely legacy.

but , its easy enough to add another template flag for it , so why not.

# How to do This Without My Help

The string design presented may be complex for readers to implement on their

own, but they can make it easily if they don't care about the object size or

indirections and allocation strategy (I needed the object size to be small,

but if not):

```c++

struct simpler_version{

  std::variant  // cow

                ,std::array         // sso

                ,std::span             // stack buffer

                > data;

  std::string_view view;

  encodings_t encoding;

  bool has_null;

  bool is_ownerized;

  // functionality....

};

```

# Thread-Safety (Opt-Out)

The string ensures to use atomic operations if `is_threaded` is true. For a

brief summary, the thread-safety grantees are similar to a `shared_ptr` of a

`std::string` if the flag is true.

# Constexpr Compatibility

The `reference_count` variable is stored as 8 bytes and is bitcasted (no

thread-safety in `constexpr` mode because of obvious reasons). Only the static

string view can be stored as a static constant expression value, but the

string is fully functional in a `constexpr` context. No `reinterpret_cast` is

used (the only time we do use it is for the thread-safe atomic operations

which are not `constexpr` anyways). A `constexpr` friendly allocator is provided

by the implemented library.

# C Interoperability

use the `c_string` wrapper for null terminated strings.

```c++

struct c_string {

//.... another using statement of the real template wrapper,

// null termination is an invariant

private:

 //....

 mjz::string value;

};

```

this has the `c_str` function.

# Features and Implementation

The built-in string viewer and shared substrings: the string is accessed via

the begin and length pair, they provide the minimal functionality of a string

viewer. A substring function may share the underlying data if and only if `!is_ownerized&&is_sharable`.

# Mutable Strings

- the main string:

  The string manages its resources and can be modified using the other part of

  the object. The functions ensure correct COW semantics, and they allocate when

  necessary. Almost all equipment functionality of `std::string` can be

  supported, except the following: the value of `operator[](i)`, `at(i)`,

  `*data()`, `front()`, `back()` cannot be mutated (see the historical COW

  reference for why). `c_str()` does not exist (`as_c_str()` does, but it can

  mutate and it only gives a pointer to `const`). I cannot prove a `const`

  alternative; I do not want pointers to temporaries nor relying on `has_null`

  being true. The "as" means that it modified it to be "as" requested (that's

  what the prefix in the name means). We can also use another name, but the

  function name isn't important in the design (`add_null_c_str`).

  note that the `data()` function will have output a null terminated `const char*` if and only if `has_null()`.

  also , if you called `as_c_str()` without failing, you know that `has_null()` is true.

  

 -for null terminated strings , we recommend the `c_string` wrapper.

  

- for mutatable strings we recommend `ownerized_string` wrapper .

# COW Overhead

- can be turned off for a specific string, or by using the (`ownerized_string`) .

Other than the destruction and construction, which may need a branch to the

non-view path if the optimizer doesn't realize triviality, the string const

overhead is similar to a view, which is minimal. We can easily make a view

out of this, no branching required. Actually, I had my previous string layout

with an SSO size of 31, but the downside was a branch in the const view path. I

discarded that and made all of the view information accessible without

branching. The only time where the overhead is felt is in modification of the

string content. I tried my best to get the modification function as efficient

as possible, but in these worse cases, I can't do much else: the string was

cold, the reference count was cold or has false sharing (contention when

modification trashes the cache line) (heap strings), the data was cold, the

data needed deallocating or reallocation.

But at last, any function that is marked const doesn't even think about the

storage location or strategy nor lifetime; it's as if it was a view all along.

And the ones that are not marked const are the ones who need to know about

other stuff. Also, there are some functions that are not const

(`remove_suffix` or `prefix`, `as_substring`) that only address the SSO part

and treat the other parts as views; these don't even need to know about COW nor

ownership.

# Built-In Stack Buffer Optimization (Advanced Users Only)

- use a safe wrapper of this feature for a better quality of life (`implace_string`) .

By using a stack buffer, you ensure that no allocation occurs as long as the

buffer is big enough. If not, allocation may occur. The users must ensure

that the buffer outlives the string object and the objects that it moved to or

a view that it was converted to, but unlike the Allocators, they don't need to

guarantee outliving a copy of it. Notice that copies are allowed to outlive

the buffer; this is because COW doesn't apply to stack buffers because of

obvious reasons. Also, this is not checked; it's raw performance of a span of

chars, and most users won't ever need such performance (lifetimes are hard;

this is discouraged), but some places (in the internals of my rope

implementation) may need it, so it's there.

# what is the type of the owner? 

### ( standard and custom string compatibility outside of the mjz library)( next experimental release)( another wrapper):

 * this feature is  currently not implemented, but after the implementation,  this should be a safe to use feature. 

 

   a wrapper  of our string type and another string type as a compatibility layer can be made.

   this is how it  works:

   the external string type acts as a storage mechanism, 

   the main string is stored alongside the external one .

   the data buffer of the external string is used with the "stack" buffer feature, 

   the external string only manages the buffer lifetime.

   the main string provides compatible api with the library, while in actuality, 

   the data was stored in another string type all along .

   when the external string is needed, 

   the string is shifted and the length is reduced( if necessary,  because of how we use a subrange of the data , we need to  make the subrange subrange the actual range again),

   then the external string is provided. 

   ( actually,  the `implace_string` is kinda just this , but with a standard array) 

# tunable sso , no code bloat, no big types:

- this will be provided with the name (`implace_string`) .

  ( a safe wrapper of the stack buffer )

  we may provide a safe wrapper ( the string would be a private member ,the buffer would be a private member) class that has a bigger sso buffer ,

  while also reusing all the code of the string , think of it like an implace vector ,

  this wouldn't need lifetime knowledge, so it would be for intermediate users .

  even if they never use the unsafe stack buffer directly.

  this is like the game industry's sso strategy, but with minimal code bloat,

  as a bonus, you can seamlessly pass this around without lifetime issues ( the is_sharable flag disables cow for the private sso buffer, so no dangling references)

  it can be converted to other wrappers of different sizes and to the string itself .

  also , if you remember, from the copy construction section,

  we would de-share automatically for these strings , essentially,

  the template size argument and type incompatibility dissappears (but does its job).

# The Optimizations of Remove Prefix/Suffix, Push Back, Pop Back, Push Front and Pop Front

- note that the null terminator retirement makes this harder for the `c_string`, mainly , sharing substings will break cow more with null terminator requirements.

Because the begin pointer is not limited to the beginning of the buffer, we

can use the view semantics to remove the extra character without memmove.

We also have 3 mods of the first position alignment of the range:

- Central buffer (`begin = buffer + (cap - len) / 2`)

- Front buffer (`begin = buffer`)

- Back buffer (`begin = buffer + cap - len`)

After that, the position may change, but we could append and prepend to the

range without memmove in many cases if we want to.

# false sharing consideration :

  when the `is_threaded` flag is true , we make the   buffer padded with `(std::hardware_destructive_interference_size-sizeof(size_t))=56`bytes at the beginning. ( the address of the  `heap_buffer` member is aligned to 64bits )

  

  the alignment requirement of the allocated data char array also goes up to 64bytes.

  this does increase overhead to about 64bytes per heap string with threading enabled, but the benefit would be that the string data itself is not in the same cache line as the reference count ,

  

 by doing this we ensure that the constant shared data stored in the string , is never affected by the reference count modifications that take place on that string.

 this is especially important when each string is a leaf of the rope , because the rope is often read concurrently and the substring sharing  is very important in the ropes design. 

 sadly because of layout compatability requirements,  the ownerized_string is also going to have the 64byte overhead in the thread-safe case , even tho its not technically necessary, the need for very efficient convention from owerized to main mandates this.

 

 the thread-safety option would be the default,  but considering that almost all the strings are at least max(16,N) (N of tunable sso) bytes when in heap ,

 and that the heap size exponentially grows , in normal standard strings and vectors ,

 i dont see why 64bytes is bad , especially considering the amout of subtle false sharing it reduces in the rope ( which is a crucial factor in the library,  because the rope performance is very important).

 

 

# Small String Optimization 

 * technicality,  tunable sso is the stack buffer optimization , but both of them have the same outcome, so they have the same name in this documentation. 

The 15 bytes of SSO capacity allows us to not allocate anything for small

strings.

# Copy on Write Optimization

- not available in (`ownerized_string`) .

Allows us to share most of the data, even sharing the substrings, reducing

fragmentation, allocations, and improving performance.

if the user suspected that a peice of code had false sharing ( thread contention on reference count) , we recommend the `ownerize()` method,

it should make the string the owner of the data , note that this does mostly nothing if we are the owner .

note that sharing is not applied in the copy constructor or assignment if the destination buffer ia large enough to hold the data , and is an owner , this is because we dont want to deallocate a hot buffer for no reason.

then we could make the strings with the most contention, force ownerized,using , `always_ownerize(true)` so , no one would ever change their reference count to false share,

this effectively kills cow for things that would suffer from it.

# Built-In String View Optimization

- not available in (`ownerized_string`) .

When initializing a string from a literal, no allocation is performed. For

example, in the following case, we do not allocate, but `std::string` does:

```c++

void fn(std::string str);

void mjz_fn(mjz::string str);

fn("im too long too fit in sso ............"s);

mjz_fn("im too long too fit in sso ............"_str);

```

# Unicode Support

While I haven't made that part in the library, we can easily support Unicode

or any other encoding just by using one of the 8 states of encoding flags (if

they were too small, we could use 1 bits ( `unused_for_now_` ) to add

support for 16 separate encodings, but I don't see any reason for supporting

more than 8 encodings at the same time). Strings with different encodings

may not interact; if they do, that's an error and will throw if you allow it.

# Exception Safety

In my library, almost everything is `noexcept`. I mainly wanted everything to be

testable in `constexpr` mode; therefore, I added a custom error encoding for

making the string an error value. But, if anyone wants exceptions, that's easy

to do with a wrapper (or a different class, but I currently really like

`noexcept`, so I won't do that for now).

# Allocators (Advanced Users Only)

While a generic implementation could allow any allocators, because of the

amount of templates in it, I made my library with an optional `constexpr`

friendly pmr-like allocator. The string would be 8 bytes more with it, but

it's beneficial for some contexts. Everything is `noexcept` in its API, and a

failure is a simple `nullptr` return. The Allocator object (memory resource

like) needs to outlive the string object, its copies, and its views,

i have a safe way of providing such allocators (  the Allocators are refrence counted, which is opt-out, but the benefit is that if a allocator is destroyed in debug mode,  the assertion ensure that no polymorphic refrences are alive , if not , the debug will crash on assertion,  this is better than a security vulnerability,  but , the reference counting is opt out if the user suspected overhead with doing a reference counted allocator).

i have both constexpr friendly memory resources , and a standard pmr adaptor if anyone is interested.

# Value Semantics

The string is a value type. In my library, all of the move and copy functions

are implemented and are `noexcept`. There's also a third one for const r-values,

but I won't touch on that implementation detail because this is more of a

nice-to-have thing. But as a summary, move moves everything. Share (const r-value)

shares (no alloc) sharables and copies (alloc) the non-sharables based on `(is_sharable&&!is_ownerized)`. Copy

does a memcpy (no alloc) if an allocation doesn't occur; if not, calls share.

# future versions:

- the string wrapper is available as an experimental feature, the wrapper accepts a template pram for the properties to ensure ( `sso_cap,has_null,is_ownerized`), the stack buffer with the appropriate length will be provided ( if cap is more than sso), the string would be null terminated if required, and the string would be easily mutable if required, the 3 different string wrappers can be made by using a specific template pram, or a combination of them if needed ( mutable and big sso for example),

`ownerized_string` is that wrapper with `is_ownerized=true`, `c_string` is that wrapper with `has_null=true` , and `implace_string` is that wrapper with  `sso_cap=N` , also , this way , the traits can be combined,  for example an owned implace string with size of 1024, note that the size of the `implace_string` object is around 32+N ( for 15
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mjz86/string

Awesome Lists containing this project

README