https://github.com/ElliotKillick/operating-system-design-review
Operating System Design Review: A systemic analysis of modern systems architecture
https://github.com/ElliotKillick/operating-system-design-review
Last synced: 8 months ago
JSON representation
Operating System Design Review: A systemic analysis of modern systems architecture
- Host: GitHub
- URL: https://github.com/ElliotKillick/operating-system-design-review
- Owner: ElliotKillick
- License: cc-by-sa-4.0
- Created: 2023-12-11T09:39:51.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-02-07T01:11:48.000Z (over 1 year ago)
- Last Synced: 2025-02-07T01:54:56.768Z (over 1 year ago)
- Language: HTML
- Homepage: https://elliotonsecurity.com
- Size: 2.31 MB
- Stars: 300
- Watchers: 6
- Forks: 24
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Operating System Design Review
Operating System Design Review is a modern exploration of operating system architecture focusing primarily on user-mode, starting at its origin: [the loader](#defining-loader-and-linker-terminology), and investigating other subsystems from there.
The intentions of this write-up are to:
1. Compare the Windows, Linux, and MacOS user-mode environments
- Providing perspective on architectural and ecosystem differences, how they coincide with the loader and the broader system, then draw conclusions and creating solutions based on our findings
2. Focus on the [concurrent](#what-is-concurrency-and-parallelism) design and properties of subsystems
- Including formal documentation on how the modern Windows loader functions in contrast to current open source Windows implementations, including Wine and ReactOS (they lack support for the "parallel loading" ability present in a modern Windows loader)
3. Educate, satisfy curiosity, and help fellow reverse engineers
- If you're looking for information on anything in particular, give this document a Ctrl + F or ⌘ + F
All of the information contained here covers Windows 10 22H2 and glibc 2.38 on Linux. In certain cases, facts were also verified on a fully up-to-date release of Windows 11. Some sections of this document additionally touch on MacOS, and now other operating systems, too.
**Author:** Elliot Killick (@ElliotKillick)
## Table of Contents
- [Operating System Design Review](#operating-system-design-review)
- [Table of Contents](#table-of-contents)
- [Parallel Loader Overview](#parallel-loader-overview)
- [High-Level Loader Synchronization](#high-level-loader-synchronization)
- [Windows Loader Module State Transitions Overview](#windows-loader-module-state-transitions-overview)
- [Constructors and Destructors Overview](#constructors-and-destructors-overview)
- [C# and .NET](#c-and-net)
- [The Root of `DllMain` Problems](#the-root-of-dllmain-problems)
- [The Problem with How Windows Uses DLLs](#the-problem-with-how-windows-uses-dlls)
- [Problem Solved?](#problem-solved)
- [Solution #1: API Sets Extension](#solution-1-api-sets-extension)
- [Solution #2: Organize Subsystems](#solution-2-organize-subsystems)
- [Solution #3: Reimplementation](#solution-3-reimplementation)
- [Summary](#summary)
- [Dependency Breakdown](#dependency-breakdown)
- [Further Research on Windows' Usage of DLLs](#further-research-on-windows-usage-of-dlls)
- [The DLL Host](#the-dll-host)
- [DLL Procurement](#dll-procurement)
- [One DLL, One Base Address](#one-dll-one-base-address)
- [DLLs as Data](#dlls-as-data)
- [Library Loading Locations Across Operating Systems](#library-loading-locations-across-operating-systems)
- [`LoadLibrary` vs `dlopen` Return Type](#loadlibrary-vs-dlopen-return-type)
- [Investigating the Idea of MT-Safe Library Initialization](#investigating-the-idea-of-mt-safe-library-initialization)
- [The Problem with How Windows Uses Threads](#the-problem-with-how-windows-uses-threads)
- [Problem Solved](#problem-solved-1)
- [Process Meltdown](#process-meltdown)
- [In-Process Inconsistencies](#in-process-inconsistencies)
- [Process Hanging Open](#process-hanging-open)
- [Crash](#crash)
- [Out-of-Process Inconsistencies](#out-of-process-inconsistencies)
- [Performance Degradation and Resource Inefficiency](#performance-degradation-and-resource-inefficiency)
- [Summary](#summary-1)
- [Further Research on Windows' Usage of Threads](#further-research-on-windows-usage-of-threads)
- [Securable Threads](#securable-threads)
- [Expensive Threads](#expensive-threads)
- [Multithreading is Insecure](#multithreading-is-insecure)
- [DLL Thread Routines Anti-Feature](#dll-thread-routines-anti-feature)
- [Synchronization Requirements](#synchronization-requirements)
- [Flimsy Thread-Local Data](#flimsy-thread-local-data)
- [The PEB Problem](#the-peb-problem)
- [Enforces Dynamic Initialization](#enforces-dynamic-initialization)
- [Adds Process Startup Overhead](#adds-process-startup-overhead)
- [Promotes Centralization](#promotes-centralization)
- [Elevates Backward Compatibility Risk](#elevates-backward-compatibility-risk)
- [Weakens Security](#weakens-security)
- [Accessing the PEB is Slow](#accessing-the-peb-is-slow)
- [Summary](#summary-2)
- [Procedure/Symbol Lookup Comparison (Windows `GetProcAddress` vs POSIX `dlsym` GNU Implementation)](#proceduresymbol-lookup-comparison-windows-getprocaddress-vs-posix-dlsym-gnu-implementation)
- [ELF Flat Symbol Namespace (GNU Namespaces and `STB_GNU_UNIQUE`)](#elf-flat-symbol-namespace-gnu-namespaces-and-stb_gnu_unique)
- [How Does `GetProcAddress`/`dlsym` Handle Concurrent Library Unload?](#how-does-getprocaddressdlsym-handle-concurrent-library-unload)
- [Lazy Linking Synchronization](#lazy-linking-synchronization)
- [Library Lazy Loading and Lazy Linking Overview](#library-lazy-loading-and-lazy-linking-overview)
- [GNU Loader Lock Hierarchy and Synchronization Strategy](#gnu-loader-lock-hierarchy-and-synchronization-strategy)
- [A Concurrency Bug in the Windows Loader!](#a-concurrency-bug-in-the-windows-loader)
- [`GetProcAddress` Can Perform Module Initialization](#getprocaddress-can-perform-module-initialization)
- [Windows Loader Initialization Locking Requirements](#windows-loader-initialization-locking-requirements)
- [Investigating the COM Server Deadlock from `DllMain`](#investigating-the-com-server-deadlock-from-dllmain)
- [On Making COM from `DllMain` Safe](#on-making-com-from-dllmain-safe)
- [Avoiding ABBA Deadlock](#avoiding-abba-deadlock)
- [Other Deadlock Possibilities](#other-deadlock-possibilities)
- [Conclusion](#conclusion)
- [Loader Enclaves](#loader-enclaves)
- [Module Information Data Structures](#module-information-data-structures)
- [Loader Components](#loader-components)
- [Locks](#locks)
- [State](#state)
- [Atomic State](#atomic-state)
- [Component Model Technology Overview](#component-model-technology-overview)
- [Microsoft Component Object Model (COM)](#microsoft-component-object-model-com)
- [Common Object Request Broker Architecture (CORBA)](#common-object-request-broker-architecture-corba)
- [GNU/Linux Component Frameworks and History](#gnulinux-component-frameworks-and-history)
- [MacOS Distributed Objects and NSXPCConnection](#macos-distributed-objects-and-nsxpcconnection)
- [Fun Facts](#fun-facts)
- [COMplications](#complications)
- [Computer History Perspective](#computer-history-perspective)
- [MS-DOS](#ms-dos)
- [Microsoft and UNIX History](#microsoft-and-unix-history)
- [An Alternate Reality](#an-alternate-reality)
- [Graphical User Interface](#graphical-user-interface)
- [Virtual Address Spaces](#virtual-address-spaces)
- [POSIX](#posix)
- [Microsoft Windows Complaints](#microsoft-windows-complaints)
- [The Process Lifetime](#the-process-lifetime)
- [Defining Loader and Linker Terminology](#defining-loader-and-linker-terminology)
- [What is Concurrency and Parallelism?](#what-is-concurrency-and-parallelism)
- [ABBA Deadlock](#abba-deadlock)
- [ABA Problem](#aba-problem)
- [Dining Philosophers Problem](#dining-philosophers-problem)
- [Reverse Engineered Windows Loader Functions](#reverse-engineered-windows-loader-functions)
- [`LdrpDrainWorkQueue`](#ldrpdrainworkqueue)
- [`LdrpDecrementModuleLoadCountEx`](#ldrpdecrementmoduleloadcountex)
- [`LdrpDropLastInProgressCount`](#ldrpdroplastinprogresscount)
- [`LdrpProcessWork`](#ldrpprocesswork)
- [License](#license)
## Parallel Loader Overview
When a library load contains more than one work item (i.e. a library with at least one dependency that is not already loaded into the process), the Windows loader will use its parallel loading ability to speed up library loading. The first work item of a load will always happen in series, on the same thread that called `LoadLibrary`, because the loader must begin to map and snap one library before it can find dependencies that it also needs to map and snap. [To start, see what a trace of one library with no new dependencies looks like.](data/windows/loadlibrary-trace.log)
Put simply, the parallel loader is a layer on top of the regular loader that calls `ntdll!LdrpQueueWork` to offload library loading work to other loader threads:
```
# "call ntdll!LdrpQueueWork" L9999999
ntdll!LdrpSignalModuleMapped+0x54:
00007ffa`56b208e0 e83bebffff call ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpMapAndSnapDependency+0x20d:
00007ffa`56b27b9d e87e78ffff call ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpLoadDependentModule+0xd63:
00007ffa`56b28943 e8d86affff call ntdll!LdrpQueueWork (00007ffa`56b1f420)
ntdll!LdrpLoadContextReplaceModule+0x126:
00007ffa`56b718e2 e839dbfaff call ntdll!LdrpQueueWork (00007ffa`56b1f420)
```
Everything else is infrastructure to support this work offloading mechanism.
The `ntdll!LdrpQueueWork` function is how modules are added to the `ntdll!LdrpWorkQueue` linked list data structure. Work processors (i.e. callers of `ntdll!LdrpProcessWork`) such as loader worker threads or the `ntdll!LdrpDrainWorkQueue` function, for instance in a [concurrent `LoadLibrary`](code/windows/loadlibrary-concurrent-work/README.md), access the `ntdll!LdrpWorkQueue` list to pick up a work item. Access to the `ntdll!LdrpWorkQueue` shared data structure is protected by the `ntdll!LdrpWorkQueueLock` critical section lock.
Each list entry in the `ntdll!LdrpWorkQueue` data structure is a `LDRP_LOAD_CONTEXT` structure. This structure is undocumented by Microsoft because its contents are not in the public debug symbols. Each `LDRP_LOAD_CONTEXT` structure relates directly to one module because a module's `LDR_DATA_TABLE_ENTRY` structure is allocated at the same time as its `LDRP_LOAD_CONTEXT` structure in the `LdrpAllocatePlaceHolder` function. In addition, the first member of each `LDRP_LOAD_CONTEXT` structure is a `UNICODE_STRING` of the `BaseDllName` according to the module that it relates to.
Loader worker threads are dedicated threads that are part of a thread pool for parallelizing lodaer work. These threads can be identified by checking whether the `LoaderWorker` flag is present the `TEB.SameTebFlags` of a thread.
Only mapping and snapping work can be offloaded for parallelized processing because [module initialization routines must execute in series](#what-is-concurrency-and-parallelism).
## High-Level Loader Synchronization
The high-level loader synchronization mechanisms responsible for controlling the loader are the `LdrpLoadCompleteEvent` and `LdrpWorkCompleteEvent` loader events in NTDLL.
When the loader sets the `LdrpLoadCompleteEvent` event, it is signalling the completion of a full library load or unload, or the completion of loader thread initialization. When `LdrpLoadCompleteEvent` is signalled, it directly correlates with `ntdllLdrpWorkInProgress` equalling zero and the decommissioning of the current thread as the load owner (`LoadOwner` flag in `TEB.SameTebFlags`). Here is a minimal reverse engineering of the `ntdll!LdrpDropLastInProgressCount` function showing this:
```c
NTSTATUS LdrpDropLastInProgressCount()
{
// Remove thread's load owner flag
PTEB CurrentTeb = NtCurrentTeb();
CurrentTeb->SameTebFlags &= ~LoadOwner; // 0x1000
// Load/unload is now complete
RtlEnterCriticalSection(&LdrpWorkQueueLock);
LdrpWorkInProgress = 0;
RtlLeaveCriticalSection(&LdrpWorkQueueLock);
// Signal completion of load/unload to any waiting threads
return NtSetEvent(LdrpLoadCompleteEvent, NULL);
}
```
When the loader sets the `LdrpWorkCompleteEvent` event, it is signalling that the loader has completed the mapping and snapping work on the entire work queue across all of the currently processing loader worker threads. When a loader worker thread starts, it atomically increments `ntdll!LdrpWorkInProgress` (in the `ntdll!LdrpWorkCallback` function) and when a loader worker thread ends, it atomically decrements `ntdll!LdrpWorkInProgress` (at the end of the `ntdll!LdrpProcessWork` function). This means that every increment to the `ntdll!LdrpWorkInProgress` reference counter past `1`, since that is the value `ntdll!LdrpDrainWorkQueue` initially sets `ntdll!LdrpWorkInProgress` to, indicates another loader worker thread processing a work item in parallel. Here is a minimal reverse engineering of where the `ntdll!LdrpProcessWork` function returns showing this:
```c
// Second argument of LdrpProcessWork: isCurrentThreadLoadOwner
// If the current thread is a loader worker (i.e. not a load owner)
if (!isCurrentThreadLoadOwner)
{
RtlEnterCriticalSection(&LdrpWorkQueueLock);
// If the work queue is empty AND we we are the last loader worker thread processing work
// There were some double negatives I had to sort out here in the reverse engineering
BOOL doSetEvent = &LdrpWorkQueue == LdrpWorkQueue.Flink && --LdrpWorkInProgress == 1
Status = RtlLeaveCriticalSection(&LdrpWorkQueueLock);
if ( doSetEvent )
return NtSetEvent(LdrpWorkCompleteEvent, NULL);
}
return Status;
```
Here are all the loader's usages of `LdrpLoadCompleteEvent` and `LdrpWorkCompleteEvent`:
```
0:000> # "ntdll!LdrpLoadCompleteEvent" L9999999
ntdll!LdrpDropLastInProgressCount+0x38:
00007ffd`2896d9c4 488b0db5e91000 mov rcx,qword ptr [ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]
ntdll!LdrpDrainWorkQueue+0x2d:
00007ffd`2896ea01 4c0f443577d91000 cmove r14,qword ptr [ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]
ntdll!LdrpCreateLoaderEvents+0x12:
00007ffd`2898e182 488d0df7e10e00 lea rcx,[ntdll!LdrpLoadCompleteEvent (00007ffd`28a7c380)]
```
```
0:000> # "ntdll!LdrpWorkCompleteEvent" L9999999
ntdll!LdrpDrainWorkQueue+0x18:
00007ffd`2896e9ec 4c8b35bdd91000 mov r14,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpProcessWork+0x1e4:
00007ffd`2896ede0 488b0dc9d51000 mov rcx,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpCreateLoaderEvents+0x35:
00007ffd`2898e1a5 488d0d04e20e00 lea rcx,[ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
ntdll!LdrpProcessWork$fin$0+0x7c:
00007ffd`289b5ad7 488b0dd2680c00 mov rcx,qword ptr [ntdll!LdrpWorkCompleteEvent (00007ffd`28a7c3b0)]
```
The `ntdll!LdrpCreateLoaderEvents` function creates both events. Only the `ntdll!LdrpDrainWorkQueue` function can wait (calling `ntdll!NtWaitForSingleObject`) on the `LdrpLoadCompleteEvent` or `LdrpWorkCompleteEvent` loader events. Only the `ntdll!LdrpDropLastInProgressCount` function sets `LdrpLoadCompleteEvent`. Only the `ntdll!LdrpProcessWork` function sets `LdrpWorkCompleteEvent`.
At event creation (`ntdll!NtCreateEvent`), `LdrpLoadCompleteEvent` and `LdrpWorkCompleteEvent` are configured to be [auto-reset events](https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-resetevent#:~:text=Auto%2Dreset%20event%20objects%20automatically%20change%20from%20signaled%20to%20nonsignaled%20after%20a%20single%20waiting%20thread%20is%20released.).
The loader never manually resets the `LdrpLoadCompleteEvent` and `LdrpWorkCompleteEvent` events (with `ntdll!NtResetEvent`).
The `ntdll!LdrpDrainWorkQueue` function takes one argument. This argument is a boolean, indicating whether the function should wait on `LdrpLoadCompleteEvent` or `LdrpWorkCompleteEvent` loader event before draining the work queue. **Please see my [reverse engineering of the `ntdll!LdrpDrainWorkQueue` function](#ldrpdrainworkqueue).**
What follows documents the parts of the loader that call `ntdll!LdrpDrainWorkQueue` (data gathered by searching disassembly for calls to the `ntdll!LdrpDrainWorkQueue` function) as either a load owner or a load worker:
```
ntdll!LdrUnloadDll+0x80: OWNER
ntdll!RtlQueryInformationActivationContext+0x43c: OWNER
ntdll!LdrShutdownThread+0x98: OWNER
ntdll!LdrpInitializeThread+0x86: OWNER
ntdll!LdrpLoadDllInternal+0xbe: OWNER
ntdll!LdrpLoadDllInternal+0x144: WORKER
ntdll!LdrpLoadDllInternal$fin$0+0x38: WORKER
ntdll!LdrGetProcedureAddressForCaller+0x270: OWNER
ntdll!LdrEnumerateLoadedModules+0xa7: OWNER
ntdll!RtlExitUserProcess+0x23: OWNER or WORKER
- Depends on `TEB.SameTebFlags`, typically `OWNER` if `LoadOwner` or `LoaderWorker` flags are absent, `TRUE` if either of these flags are present
ntdll!RtlPrepareForProcessCloning+0x23: OWNER
ntdll!LdrpFindLoadedDll+0x9127a: OWNER
ntdll!LdrpFastpthReloadedDll+0x9033a: OWNER
ntdll!LdrpInitializeImportRedirection+0x46d44: OWNER
ntdll!LdrInitShimEngineDynamic+0x3c: OWNER
ntdll!LdrpInitializeProcess+0x130a: OWNER
ntdll!LdrpInitializeProcess+0x1d0d: OWNER
ntdll!LdrpInitializeProcess+0x1e22: WORKER
ntdll!LdrpInitializeProcess+0x1f33: OWNER
ntdll!RtlCloneUserProcess+0x71: OWNER
```
Calls to the `ntdll!LdrpDrainWorkQueue` function do not always result in synchronizing on the relevant loader event.
Notably, there are many more instances of the loader potentially synchronizing on the entire load's completion rather than just the completion of mapping and snapping work. For example, thread initialization (`ntdll!LdrpInitializeThread`) always synchronizes on the `LdrpLoadCompleteEvent` loader event. The only parts of the loader that may synchronize on `LdrpWorkCompleteEvent` are `ntdll!LdrpLoadDllInternal`, `ntdll!LdrpInitializeProcess`, and `ntdll!RtlExitUserProcess`.
Here are the places where the loader completes all loader work (`ntdll!LdrpDropLastInProgressCount` function), which is where the `LdrpLoadCompleteEvent` is set. Although, many of these are edge cases with the invocations by `ntdll!LdrpLoadDllInternal`, or loader thread initialization/deinitialization by the `ntdll!!LdrpInitializeThread` and `ntdll!LdrShutdownThread` functions being the most common:
```
0:000> # "call ntdll!LdrpDropLastInProgressCount" L9999999
ntdll!LdrUnloadDll+0x99:
00007ffa`56b1fc89 e8eef10400 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!RtlQueryInformationActivationContext+0x463:
00007ffa`56b23243 e834bc0400 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrShutdownThread+0x20b:
00007ffa`56b2765b e81c780400 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeThread+0x218:
00007ffa`56b27950 e827750400 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpLoadDllInternal+0x24b:
00007ffa`56b2fc5f e818f20300 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrGetProcedureAddressForCaller+0x275:
00007ffa`56b40035 e842ee0200 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrEnumerateLoadedModules+0xae:
00007ffa`56b6ee6e e809000000 call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrShutdownThread$fin$2+0x1e:
00007ffa`56bb4f95 e8e29efbff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeThread$fin$2+0x15:
00007ffa`56bb4ff4 e8839efbff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpLoadDllInternal$fin$0+0x47:
00007ffa`56bb526e e8099cfbff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrEnumerateLoadedModules$fin$0+0x1b:
00007ffa`56bb5ee9 e88e8ffbff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpFindLoadedDll+0x917ae:
00007ffa`56bbf2ce e8a9fbfaff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpFastpthReloadedDll+0x90862:
00007ffa`56bc04e2 e895e9faff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeImportRedirection+0x464cf:
00007ffa`56bd89b3 e8c464f9ff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrInitShimEngineDynamic+0xe8:
00007ffa`56be0528 e84fe9f8ff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x183c:
00007ffa`56be358c e8ebb8f8ff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x1eda:
00007ffa`56be3c2a e84db2f8ff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
ntdll!LdrpInitializeProcess+0x1f8e:
00007ffa`56be3cde e899b1f8ff call ntdll!LdrpDropLastInProgressCount (00007ffa`56b6ee7c)
```
Here are the few places where the loader processes mapping and snapping work (`ntdll!LdrpProcessWork` function), which is where the `LdrpWorkCompleteEvent` is set:
```
0:000> # "call ntdll!LdrpProcessWork" L9999999
ntdll!LdrpLoadDependentModule+0x184c:
00007ffa`56b2942c e8bb6c0400 call ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpLoadDllInternal+0x13a:
00007ffa`56b2fb4e e899050400 call ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpDrainWorkQueue+0x17f:
00007ffa`56b70043 e8a4000000 call ntdll!LdrpProcessWork (00007ffa`56b700ec)
ntdll!LdrpWorkCallback+0x6e:
00007ffa`56b700ce e819000000 call ntdll!LdrpProcessWork (00007ffa`56b700ec)
```
## Windows Loader Module State Transitions Overview
`LDR_DDAG_NODE.State` or `LDR_DDAG_STATE` tracks a module's **entire lifetime** from beginning to end. With this analysis, I intend to extrapolate information based on the [known types given to us by Microsoft](https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_ddag_state.htm) (`dt _LDR_DDAG_STATE` command in WinDbg).
Each state represents a stage of loader work on a module. This table comprehensively documents where these state changes occur throughout the loader and which locks are present
A typical library load ranges from LdrModulesPlaceHolder to LdrModulesReadyToRun (may also include `LdrModulesMerged`), and a typical library unload ranges from LdrModulesUnloading to LdrModulesUnloaded.
LDR_DDAG_STATE States
State Changing Function(s)
Remarks
LdrModulesMerged (-5)
LdrpMergeNodes
LdrpModuleDatatableLock is held during this state change. See LdrModulesCondensed state for more information.
LdrModulesInitError (-4)
LdrpInitializeGraphRecurse
During DLL_PROCESS_ATTACH, if a module's DllMain returns FALSE for failure then this module state is set (any other return value counts as success). LdrpLoaderLock is held here.
LdrModulesSnapError (-3)
LdrpCondenseGraphRecurse
This function may set this state on a module if a snap error occurs. See the LdrModulesCondensed state for more information.
LdrModulesUnloaded (-2)
LdrpUnloadNode
Before setting this state, LdrpUnloadNode may walk LDR_DDAG_NODE.Dependencies, holding LdrpModuleDataTableLock to call LdrpDecrementNodeLoadCountLockHeld thus decrementing the LDR_DDAG_NODE.LoadCount of dependencies and recursively calling LdrpUnloadNode to unload dependencies. Loader lock (LdrpLoaderLock) is held here.
LdrModulesUnloading (-1)
LdrpUnloadNode
Set near the start of this function. This function checks for LdrModulesInitError, LdrModulesReadyToInit, and LdrModulesReadyToRun states before setting this new state. After setting state, this function calls LdrpProcessDetachNode. Loader lock (LdrpLoaderLock) is held here.
LdrModulesPlaceHolder (0)
LdrpAllocateModuleEntry
The loader directly calls LdrpAllocateModuleEntry until parallel loader initialization (LdrpInitParallelLoadingSupport) occurs at process startup. At which point (with exception to directly calling LdrpAllocateModuleEntry once more soon after parallel loader initialization to allocate a module entry for the EXE), the loader calls LdrpAllocatePlaceHolder (this function first allocates a LDRP_LOAD_CONTEXT structure), which calls through to LdrpAllocateModuleEntry (this function places a pointer to this module's LDRP_LOAD_CONTEXT structure at LDR_DATA_TABLE_ENTRY.LoadContext). The LdrpAllocateModuleEntry function, along with creating the module's LDR_DATA_TABLE_ENTRY structure, allocates its LDR_DDAG_NODE structure with zero-initialized heap memory. The module's data structures have been allocated with basic initialization.
LdrModulesMapping (1)
LdrpMapCleanModuleView
I've never seen this function get called; the state typically jumps from 0 to 2. Only the LdrpGetImportDescriptorForSnap function may call this function which itself may only be called by LdrpMapAndSnapDependency (according to a disassembly search). LdrpMapAndSnapDependency typically calls LdrpGetImportDescriptorForSnap; however, LdrpGetImportDescriptorForSnap doesn't typically call LdrpMapCleanModuleView. This state is set before mapping a memory section (NtMapViewOfSection). Mapping is the process of loading a file from disk into memory.
LdrModulesMapped (2)
LdrpProcessMappedModule
LdrpModuleDatatableLock is held during this state change. Mapping is complete.
LdrModulesWaitingForDependencies (3)
LdrpLoadDependentModule
This state isn't typically set, but during a trace, I was able to observe the loader set it by launching a web browser (Google Chrome) under WinDbg, which triggered the watchpoint in this function when loading app compatibility DLL C:\Windows\System32\ACLayers.dll. Interestingly, the LDR_DDAG_STATE decreases by one here from LdrModulesSnapping to LdrModulesWaitingForDependencies; the only time I've observed this. LdrpModuleDatatableLock is held during this state change.
LdrModulesSnapping (4)
LdrpSignalModuleMapped or LdrpMapAndSnapDependency
In the LdrpMapAndSnapDependency case, a jump from LdrModulesMapped to LdrModulesSnapping may happen. LdrpModuleDatatableLock is held during state change in LdrpSignalModuleMapped, but not in LdrpMapAndSnapDependency. Snapping is the process of resolving the library’s import address table (module imports and exports) to addresses in memory.
LdrModulesSnapped (5)
LdrpSnapModule or LdrpMapAndSnapDependency
In the LdrpMapAndSnapDependency case, a jump from LdrModulesMapped to LdrModulesSnapped may happen, which indicates the loader doesn't always bother recording the in-between LdrModulesSnapping state transition. LdrpModuleDatatableLock isn't held here in either case. Snapping is complete.
LdrModulesCondensed (6)
LdrpCondenseGraphRecurse
This function receives a LDR_DDAG_NODE as its first argument and recursively calls itself to walk LDR_DDAG_NODE.Dependencies. On every recursion, this function checks whether it can remove the passed LDR_DDAG_NODE from the graph. If so, this function acquires LdrpModuleDataTableLock to call the LdrpMergeNodes function, which receives the same first argument, then releasing LdrpModuleDataTableLock after it returns. LdrpMergeNodes discards the uneeded node from the LDR_DDAG_NODE.Dependencies and LDR_DDAG_NODE.IncomingDependencies DAG adjacency lists of any modules starting from the given parent node (first function argument), decrements LDR_DDAG_NODE.LoadCount to zero, and calls RtlFreeHeap to deallocate LDR_DDAG_NODE DAG nodes. After LdrpMergeNodes returns, LdrpCondenseGraphRecurse calls LdrpDestroyNode to deallocate any DAG nodes in the LDR_DDAG_NODE.ServiceTagList list of the parent LDR_DDAG_NODE then deallocate the parent LDR_DDAG_NODE itself. LdrpCondenseGraphRecurse sets the state to LdrModulesCondensed before returning. Note: The LdrpCondenseGraphRecurse function and its callees rely heavily on all members of the LDR_DDAG_NODE structure, which needs further reverse engineering to fully understand the inner workings and "whys" of what's occurring here. Condensing is the process of discarding unnecessary nodes from the dependency graph.
LdrModulesReadyToInit (7)
LdrpNotifyLoadOfGraph
This state is set immediately before this function calls LdrpSendPostSnapNotifications to run post-snap DLL notification callbacks. As the loader initializes nodes (i.e. modules) in the dependency graph (while loader lock is held), each node's state will transition to LdrModulesInitializing then LdrModulesReadyToRun (or LdrModulesInitError if initialization fails). The module is mapped and snapped but pending initialization (which includes any form of running code from the module).
LdrModulesInitializing (8)
LdrpInitializeNode
Set at the start of this function, immediately before linking a module into the InInitializationOrderModuleList list. After linking the module into the initialization order list, the loader calls the module's LDR_DATA_TABLE_ENTRY.EntryPoint. Loader lock (LdrpLoaderLock) is held here. Initializing is the process of running a module's initialization routines (i.e. module initializer including Windows DllMain).
LdrModulesReadyToRun (9)
LdrpInitializeNode
Set at the end of this function, before it returns. Loader lock (LdrpLoaderLock) is held here. The module is ready for use.
Findings were gathered by [tracing all `LDR_DDAG_STATE.State` values at load-time](analysis-commands.md#ldr_ddag_node-analysis) and tracing a library unload, as well as searching disassembly. See what a [LDR_DDAG_STATE trace log](data/windows/load-all-modules-ldr-ddag-node-state-trace.txt) looks like ([be aware of the warnings](analysis-commands.md#ldr_ddag_node-analysis)).
## Constructors and Destructors Overview
Constructors and destructors exist to facilitate dynamic initialization. Dynamic initialization is custom code that runs before accessing a resource. In the module scope, this code executes before the `main()` function or when a module is loaded.
Module constructors and destuctors are the operating system and language agnostic terms for describing this feature. On Unix, these may be referred to as initialization and finalization or termination routines/functions. In Windows DLLs, the functionally equivalent idea exists as `DLL_PROCESS_ATTACH` and `DLL_PROCESS_DETACH` calls to the `DllMain` function. Initialization and deinitialization/uninitialization routines or simply initializer and finalizer is also common terminology.
In addition to module load and unload, the Windows loader may call each module's `DllMain` at `DLL_THREAD_ATTACH` and `DLL_THREAD_DETACH` times. The Windows loader only calls these routines at thread start and exit. Windows doesn't run the `DLL_THREAD_ATTACH` of a DLL following `DLL_PROCESS_ATTACH`. Additionally, a [DLL loaded after thread start won't preempt that thread to run its `DllMain` with `DLL_THREAD_ATTACH`](https://learn.microsoft.com/en-us/windows/win32/dlls/dllmain#parameters). These calls can be disabled per-library as a performance optimization by calling `DisableThreadLibraryCalls` at `DLL_PROCESS_ATTACH` time.
Compilers commonly provide access to module initialization/deinitialization functions through compiler-specific syntax. In GCC or Clang, a programmer can create module constructors/destructors using the `__attribute__((constructor))` and `__attribute__((destructor))` functions or the [`_init` and `_fini` functions, historically](https://man7.org/linux/man-pages/man3/dlopen.3.html#NOTES). Modern GCC or Clang module constructors and destructors support specifying a priority like `__attribute__((constructor(101)))` or `__attribute__((destructor(101)))` (priorities of 100 and below are reserved for use by the operating system) in case a particular execution order is desired.
In C++, the constructor of an object is invoked whenever an instance of a class is created. Creating an instance of a class returns an object pointing to that instance. If an object is created in the global scope (C++ terminology) or the module scope (OS terminology) then its constructor is called during program or library initialization ([code example](code/windows/dll-init-order-test/dll-test.cpp)). If an object is created in a local scope like in a function, its constructor is called when program execution creates that object in the function. A constructor or class itself is neither inherently global nor local, it entirely depends on what context the object is created in.
Common use cases for dynamic initialization can include: [Communication with](https://github.com/reactos/reactos/blob/f10d40f9122b926bf01b5409a6d3c3d9d06806c3/dll/win32/kernel32/client/dllmain.c#L138) [another process](https://github.com/reactos/reactos/blob/3ecd2363a6d045a38aa68a1b5f17bb53ffaad3e4/win32ss/user/user32/misc/dllmain.c#L510) (for instance, the Windows API relies on a system-wide [`csrss.exe`](https://en.wikipedia.org/wiki/Client/Server_Runtime_Subsystem) [server](https://en.wikipedia.org/wiki/Client%E2%80%93server_model), which requires dynamic initialization on the side of the client), [creating an inter-process synchronization mechanism](https://learn.microsoft.com/en-us/windows/win32/sync/interprocess-synchronization) (Windows commonly uses inter-process event synchronization objects even when predominantly or only intra-process synchronization is or should be required), or initializing an implementation-dependent data structure such as a [critical section](https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-initializecriticalsection) (rather than storing the internal [POD](https://stackoverflow.com/a/146464) directly in your module, which would necessitate that ABI remain backward compatible forever or be versioned). The apparent reason for Microsoft not providing a method for statically initializing a Windows critical section is that developers [butchered the original POD definition](https://devblogs.microsoft.com/oldnewthing/20160826-00/?p=94185#:~:text=CRITICAL_SECTION) and when they wanted to go back and change it to something more sensible (i.e. [simply initializing to all zeros by default like GNU does](https://elixir.bootlin.com/glibc/glibc-2.38/source/nptl/pthread_mutex_init.c#L142-L147)), they couldn't without breaking [bug compatibility](https://en.wikipedia.org/wiki/Bug_compatibility) (there is also the kernel view on needing to keep track of mutex objects in its memory, which is obsolete ever since registrationless [futex](https://en.wikipedia.org/wiki/Futex) and futex-like mechanisms became a thing). In the common case where default mutex attributes are appropriate, [POSIX mutexes can be statically initialized with the `PTHREAD_MUTEX_INITIALIZER` macro](https://pubs.opengroup.org/onlinepubs/7908799/xsh/pthread_mutex_init.html); otherwise or when creating a mutex in dynamically allocated memory, dynamic initialization with the `pthread_mutex_init` function is necessary. A POSIX mutex is equivalent to a Windows critical section, whereas a Windows mutex object differs due to being an inter-process synchronization mechanism. Other dynamic initialization operations could include: Setting up a thread pool, background thread, or event loop to prepare early for concurrent operations. Reading configuration data from some persistent data source (e.g. an environment variable, a file, or a registry key). Tracing or logging events or setting it up. Controlling resource lifecycle (at initialization and destruction time). In addition, various domain-specific initialization and registration tasks. Generally, constructors effectively address [cross-cutting concerns](https://en.wikipedia.org/wiki/Cross-cutting_concern#Examples) in initialization (especially in the module scope when functionality is split across many tightly coupled libraries like in Windows).
Due to the useful position of constructors and destructors when run in the global scope, they may sometimes be used outside of dynamic initialization like for auditing or hooking purposes. The GNU loader specially provides the `LD_ADUIT` and `LD_PRELOAD` mechanisms for these purposes (with the latter having broad support across Unix-like systems). The GNU loader calls into an `LD_AUDIT` library at library load/unload and symbol resolution times to allow for hooking or monitoring. `LD_PRELOAD` allows easily hooking global scope symbol resolution. Windows allows for DLL notifications registration, offering similar functionality, though it is more limited. The always and early execution style of constructors when in the module scope also makes them an [attractive target for attackers](https://man.openbsd.org/dlopen.3#CAVEATS).
Constructors and destructors originate from object-oriented programming (OOP), a programming paradigm first introduced by the [Simula 67](https://en.wikipedia.org/wiki/Simula) language in 1962. C++, a modern object-oriented language, was originally designed in the early 1980s as an extension of C and received initial standardization in 1998. Constructors and destructors do not exist in the C standard. On Unix systems, the concept of code that runs when a module loads and unloads goes back to the 1990 [System V Application Binary Interface Version 4](https://www.bitsavers.org/pdf/att/unix/System_V_Release_4/0-13-933706-7_Unix_System_V_Rel4_Programmers_Guide_ANSI_C_and_Programming_Support_Tools_1990.pdf) (`DT_INIT` and `DT_FINI` section types, as well as `.init` and `.fini` special section names).
In the ELF executable format, module constructors and destructors are standardized by the System V ABI to be in the `.init` and `.fini` sections. Modern systems use the non-standard but common and generally agreed-upon [`.init_array`/`.fini_array` sections](https://maskray.me/blog/2021-11-07-init-ctors-init-array), or before that the deprecated `.ctors`/`.dtors` sections. Modern GCC built binaries only include `.init_array`/`.fini_array` and `.init`/`.fini` sections, they don't include the `.ctors`/`.dtors` sections (verified with `objdump -h` and `readelf --sections`). Individually exposing each routine in an array within the ELF file grants more control over initialization and finalization routine execution to a Unix-like loader over calling an opaque function for handling all initialization/finalization. This control and transparency lends itself to a pluggable interace that is useful in concepts such as constructor and destructor priority control ([the glibc loader does not use this per-routine knowledge to compensate for circular dependencies during module initialization](code/glibc/dlopen-init-interruption/README.md)). [A Unix-like loader loops through these routines contained in the ELF file.](https://elixir.bootlin.com/glibc/glibc-2.38/source/elf/dl-init.c#L58-L71)
The PE (Windows) executable format standard [does not define any sections specific to module initialization](https://learn.microsoft.com/en-us/windows/win32/debug/pe-format#special-sections); instead, a `DllMain` function or any module constructors/destructors are included with the rest of the program code in the `.text` section. MSVC optionally provides the [`init_seg`](https://learn.microsoft.com/en-us/cpp/preprocessor/init-seg) pragma to specify a section name with module constructors to run frist when compiling C++ code. However, such a section is only used if this pragma is explicitly specified by the programmer (unlikely) or in the niche cases MSVC will generate one itself (as stated by the documentation). The granularity this pragma provides is low with only `compiler`, `lib`, and `user` options. In contrast, the `.init_array`/`.fini_array` sections and `__attribute__((constructor(priority)))`/`__attribute__((destructor(priority)))` on Unix-like systems serve as a modular and robust means for controlling dynamic initializiation order.
The Windows loader calls a module's `LDR_DATA_TABLE_ENTRY.EntryPoint` at module initialization or deinitialization with the respective `fdwReason` argument (`DLL_PROCESS_ATTACH` or `DLL_PROCESS_DETACH`); it has no knowledge of `DllMain` or C++ constructors/destructors in the module scope. Merging these into one callable `EntryPoint` is the job of a compiler. For instance, [MSVC compiles a stub into your DLL (`dllmain_dispatch`) that calls any module constructors followed by `DllMain` with the `DLL_PROCESS_ATTACH` argument](code/windows/dll-init-order-test/exe-test.c) (and destructors, of course, in the reverse order). Constructors other than `DllMain`, of course, initialize in the order they are laid out in code. The word `Main` in `DllMain` indicates that `DllMain` will run as the last constructor in the module similar to how the `main` function of a program runs after all constructors. Still, I find `DllMain` to generally be a bad name because it may lead people to use constructors in ways that one might use the `main` function of a program due to the similar name (like `DllMain` is just `main` but in a DLL, which is not the case). I also find Microsoft's use of the term "entry point" (e.g. in `LDR_DATA_TABLE_ENTRY.EntryPoint`) to describe calling a module's constructor and destructor routines bad because an [entry point has a specific definition that refers to the start of program execution](https://en.wikipedia.org/wiki/Entry_point). This reason for this name stems from both an EXE and its DLLs having a `LDR_DATA_TABLE_ENTRY`. Especially since the Windows loader does accurately set the EXE's `EntryPoint` set to the program's main function (then [just above `EntryPoint` is the `DllBase` member of `LDR_DATA_TABLE_ENTRY`](https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_data_table_entry.htm), which conflates EXEs with DLLs but the other way around). So, a good question is posed by asking why the `LDR_DATA_TABLE_ENTRY` structure definition should be shared between an EXE and its DLLs at all seeing as the [as the GNU loader does not conflate these concepts](#analysis-commands.md#link_map-analysis) because, besides both being some code with data that is mapped into memory, these are completely different things. Up until one point in Windows history, [the `LDR_DATA_TABLE_ENTRY` structure definition was even shared between kernel and user-mode modules until separating into the `KLDR_DATA_TABLE_ENTRY` structure](https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_data_table_entry/index.htm): "The `LDR_DATA_TABLE_ENTRY` structure is NTDLL’s record of how a DLL is loaded into a process. In early Windows versions, this structure is similarly the kernel’s record of each module that is loaded for kernel-mode execution. The different demands of kernel and user modes eventually led to the separate definition of a `KLDR_DATA_TABLE_ENTRY`." The GNU loader calls legacy [`init` before going through the `init_array` functions](https://elixir.bootlin.com/glibc/glibc-2.38/source/elf/dl-init.c#L56) (the opposite of Windows where `DllMain` comes last after all other constructors, similar to how a `main` function would). All of these facts come together to paint a picture of Windows being too kernel-centric and monolithic, not considering the unique requirements of user-mode and correctly distinguishing between execution environments.
The name `DllMain` is inherited from [`LibMain`](https://learn.microsoft.com/en-us/archive/msdn-magazine/2000/july/under-the-hood-happy-10th-anniversary-windows#dlls-and-module-management), which along with Windows Exit Procedure (WEP) for exit, was its name in the 16-bit DLLs used by Windows 3.x (non-NT). When Windows was built for 16-bit applications (before Windows NT 3.1 and Windows 95, non-NT), [multitasking was cooperative not preemptive](https://web.archive.org/web/20150619005446/https://support.microsoft.com/en-us/kb/117567) so predictable scheduling meant there was no need for synchronization mechanisms such as loader lock. System libraries were also typically already loaded in the [single shared address space](#virtual-address-spaces) and that was the level tasks (what we now call "processes" were widely referred to as "tasks" before each application had an independent address space and execution context) would reference-count them on (**Nitpick:** Through the use of a [virtual machine](https://en.wikipedia.org/wiki/Virtual_DOS_machine), early Windows versions on MS-DOS could run a given application in protected mode, albeit with no true privilege separation, and multithreading, but applications had to specifically be written to support this functionality and [DOS was borked](https://retrocomputing.stackexchange.com/a/26228), that is why it got abandoned). Still, the MS-DOS EXE format allowed for a [pseudo-DLL](data/windows/timeline-verification) to specify whether module initialization should be "Global" or "Per-Process" (this information can be gathered using the old `exehdr` tool), which would have only lessened the room for module initialization issues in the global case. There was no dynamic linker in MS-DOS because the pseudo-DLLs of that time did not support imports as they did starting with Windows NT. This fact would have made it unnecessary to hold a lock while running a module initialization routine due to having no other DLLs depend on the initialization code being complete, at least not that the loader could have been aware of (a flag likely existed to ensure `FreeLibrary` could not unload a pseudo-DLL before it was done loading, but that is it). Obviously then, delay loading did not exist Windows 3.x (so, the loader was naturally at the top of the lock hierarchy). Delayed DLL loading was added to [Visual C++ 6.0](https://winworldpc.com/product/visual-c/6x) (1998) as the public header `delayimp.h` (although, the MSVC compiler Microsoft internally used to build Windows may have supported delay loading earlier). Delay loading a DLL was and still is done by the [`/DELAYLOAD` linker option](https://github.com/reactos/reactos/blob/1ea3af8959da6fcf34d3eb92885fe01ce18de83c/sdk/cmake/msvc.cmake#L302-L317), which under the hood uses `LoadLibrary`/`GetProcAddress` with address caching to implement the functionality. Later, delay loading was also integrated into the native loader (*When?*). DLL thread initializers/deinitializers `DLL_THREAD_ATTACH` and `DLL_THREAD_DETACH` didn't exist [until Windows NT 3.1](http://web.archive.org/web/20240308195249/http://bytepointer.com/resources/pietrek_peering_inside_pe.htm#:~:text=DllCharacteristics). `NtTerminateProcess` was also introduced [with Windows NT 3.1](https://www.geoffchappell.com/studies/windows/win32/ntdll/history/names310.htm#:~:text=NtTerminateProcess) seemingly as an incredibly poor and hasty but deliberate "design" decision. Anyway, the MS-DOS API likely was not spawning threads all the time like Win32 does since it was originally designed for use in a cooperatively multitasking system. COM (which tightly couples with the loader by placing itself at the top of the lock hierarchy in `CoFreeUnusedLibraries` and potenitally other places) wasn't a foundational Windows technology used pervasively within the Windows API [until Windows NT 4.0](https://bitsavers.computerhistory.org/pdf/microsoft/windows_NT_4.0/Solomon_-_Inside_Windows_NT_2ed_1998.pdf#:~:text=Component%20Object%20Model) (released in 1996). These properties of older systems largely mitigated issues arising especially from `LibMain` on Windows versions prior to Windows NT 3.1 and `DllMain` in later Windows versions. Official [Windows 3.x (non-NT) books](https://bitsavers.computerhistory.org/pdf/microsoft/windows_3.1/) at the time (specifically "Windows Programmers Reference Volume 2 Functions" released in 1992), provided no guidance on `LibMain` besides that the "`LibMain` function is called by the system to initialize a dynamic-link library (DLL)". Although, there was a note for WEP that explictly stated "The `FreeLibrary` function should not be called from within a WEP function". Additionally, we know from [Matt Pietrek's Windows Internals book](https://bitsavers.computerhistory.org/pdf/microsoft/windows_3.1/Pietrek_-_Windows_Internals_1993.pdf) (released in 1993, shortly before Windows NT 3.1 came out and long before the author [later became a Microsoft employee](https://en.wikipedia.org/wiki/Matt_Pietrek)) that "A common problem programmers encounter is that functions like `MessageBox()` won't work inside the `LibMain()` of an implicitly-linked DLL". The reason is that creating a window to [show a message box](https://elliotonsecurity.com/perfect-dll-hijacking/offlinescannershell-mpclient-dll-missing-export-error.png) requires initialization of the USER application message queue by the `InitApp()` function in USER. This message queue is not initialized in the `LibMain` of USER but by some setup work done before calling `WinMain` in the EXE (the book provides the relevant reverse engineered pseudocode of `C0W.ASM` to prove this): "For EXEs, the important parts of the startup code involves calling `InitTask()` and then `InitApp()`, which we cover momentarily. After those functions have been called, the EXE is completely initialized and ready to start its work as a Windows program." The book notes that initialization is done this way because a DLL "cannot own things that Windows associates with a task, like message queues" (i.e. a DLL may not exist for the full application lifetime) so it cannot own the application message queue. However, the core issue here is that there each task had a single, global application message queue and a DLL couldn't create and tear down its own, independent message queue instance to perform a GUI operation detached from the application (obviously, this is no longer the case in modern Windows). Instead of the application lifetime (that of the EXE), the message box can live [in the instance lifetime](#the-process-lifetime) (from when birth when the call to `MessageBox` is made to death when it returns, since `MessageBox` is a synchronous function), or for more complex GUI operations that continue in the DLL outside of its moudle initializer, [in the lifetime of the DLL](#the-process-lifetime) (from birth at `DLL_PROCESS_ATTACH` to death at `DLL_PROCESS_DETACH` for modern `DllMain`, since our DLL depends on the GUI subsystem). Thus, GUI operations not working from the `LibMain` of implicitly-linked DLLs was a consequence of tight coupling between the GUI subsystem and the operating system. [See here for information on Windows and Windows NT history.](#computer-history-perspective)
### C# and .NET
The [CLR loader](https://www.oreilly.com/library/view/essential-net-volume/0201734117/0201734117_ch02lev1sec5.html) uses a module's [`.cctor` section](https://web.archive.org/web/20170317220947/https://msdn.microsoft.com/en-us/library/aa290048(VS.71).aspx#vcconmixeddllloadingproblemanchor6) to initialize .NET assemblies. A .NET assembly is a layer of abstraction over an underlying native library. Each module `.cctor` section is the "managed module initializer" (i.e. assembly initializer). Microsoft uses the [managed module initializer to work around Windows issues surrounding loader lock](https://learn.microsoft.com/en-us/cpp/dotnet/initialization-of-mixed-assemblies#code) in .NET applications.
A static constructor in C# is unique from its C++ counterpart because [C# specifies](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/language-specification/classes#1512-static-constructors) that a static constructor, even when instance creation happens at the module scope, will initialize on-demand instead of at the start of a program or library:
> The static constructor for a closed class executes at most once in a given application domain. The execution of a static constructor is triggered by the first of the following events to occur within an application domain:
>
> - An instance of the class is created.
> - Any of the static members of the class are referenced.
>
> If a class contains the `Main` method (§7.1) in which execution begins, the static constructor for that class executes before the `Main` method is called.
C# also has finalizers (historically referred to as destructors in C#). The finalizer of an object will run if the garbage collector decides it can destroy the given object. Unlike low-level languages with manual memory mangement like C++, finalization is not typically necessary because the garbage collector traces memory allocations to do clean up. Garbage collectors delay resources cleanup like freeing memory as a function of how they work, it is a trade-off they make in exchange for easier programming. This delay extends to finalizers or destructors where these routines will not run until the garbage collector destroys the object. For unmanaged or system resources such as "windows, files, and network connections" (e.g. closing a database connection) [Microsoft documentation](https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/finalizers#using-finalizers-to-release-resources) condones the use of finalizers saying "you should use finalizers to free those resources". However, [starting with .NET 5, finalizers are not run at application exit](https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/finalizers#:~:text=.NET%205%20(including%20.NET%20Core)%20and%20later%20versions%20don%27t%20call%20finalizers%20as%20part%20of%20application%20termination.). The decision not to call destructors or finalizers at .NET runtime exit appears to have come down to an issue with [reachable objects + unjoined background threads](https://github.com/dotnet/runtime/issues/16028) still using those objects (Java appears to have fixed this issue by [replacing finalizers with cleaners](https://openjdk.org/jeps/421#Alternative-techniques), which will only call the cleanup action of an object once it becomes unreachable). The root issue, in the described case, is the unjoined thread that is still running when .NET shutdown occurs (similar to what we explored in "The Problem with How Windows Uses Threads"). Also for releasing unmanaged resources (i.e. external to the .NET runtime so they won't be garbage collected, like a Windows API file handle), an application can register for the [`AppDomain.ProcessExit`](https://learn.microsoft.com/en-us/dotnet/api/system.appdomain.processexit) event to perform cleanup before the .NET runtime exits in the process and a library assembly can use the [`AppDomain.DomainUnload`](https://learn.microsoft.com/en-us/dotnet/api/system.appdomain.domainunload) event to get the same functionality for its lifetime (this works because [a .NET assembly cannot unload without unloading the entire domain](https://learn.microsoft.com/en-us/dotnet/standard/assembly/load-unload)). Starting with .NET 5, an assembly can be dynamically loaded into a `AssemblyLoadContext`, which on `Unload`, free all the assemblies in that load context and call [`Unloading` events](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.loader.assemblyloadcontext.unloading) for cleanup. Assemblies in the [default assembly load context](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.loader.assemblyloadcontext.default) cannot be unloaded. Due to the nature of garbage collected languages, the cleanup of especially expensive or contested system resources is best performed by prescribing that users of your subsystem call a `Shutdown`, `Close`, `Disconnect`, etc. method on the relevant object when they are done using it, if possible. Although this approach cannot scale with libraries since they depend on each other and must be destructed in the reverse order they were constructed (even if you hack it by employing expensive reference counting on the individual resource-level, this approach falls apart with circular references or reference cycles), applications can use this technique. If you find your application consuming lots of limited or contended system resouces though, then you may want to reconsider using a garbage collected language since so-called [two-phase initialization (or cleanup) is an anti-pattern](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rnr-two-phase-init).
C# supports the [`ModuleInitializer` attribute](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/attributes/general#moduleinitializer-attribute) for initialization code that is to run when the assembly loads even when that assembly is a library (like traditional static constructors). Presumably, C# module initializers require protection from a global CLR initialization lock. In C# 9 and .NET 5 (released together in 2020), [module initializers were added to the language and runtime out of necessity](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-9.0/module-initializers#summary).
The unexpected initialization time of C# static constructors can cause unforseen problems similar to how Windows delay loading does for operating system initializers. For instance, a static constructor ["call is made in a locked region based on the specific type of the class"](https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/static-constructors#remarks). So, if creating an instance of a class for the first time happens at an unexpected time (perhaps via proxy through another call) like when the thread is holding a lock, and there exists a static constructor (of the same class type) that acquires the same external lock in the reverse order, then lock order inversion and consequently ABBA deadlock can occur. CLR lazy loading/initialization does have a couple significant mitigating factors that make it safer than native library lazy loading, namely: Firstly, lazy initialization can only occur upon instance creation which is necessarily more expected because it's already known that typical per-instance object constructors will run at instance creation time (unlike native library lazy loading where the initialization can potentially happen on every call to a DLL import). Though, this does leave the other, less common, static constructor trigger of referencing a member in a static class somewhat up in the air as to its safety at the given time. Secondly, static constructors are split into their own routines and initialize with granular, per-instance MT-safe synchronization instead of a broadly serializing "CLR static constructor lock", thus decreasing the chance of trying to reenter initialization or deadlocks. Lazy initialization can still become problematic if your lazy initializer routine accidentally tries to lazily initialize itself again (this issue is typically an artifact of circular dependencies). In reagard to libary loading, a synchronized, lazily initializing global type (e.g. a C# static constructor) should never load or unload libraries (or higher level .NET assemblies) to ensure that the OS loader (also CLR loader for .NET assemblies) sensibly remains at the top of the lock hierarchy. This steadfast rule must be in place to maintain lock hierarchy. If some data is only accessed from a single threaded, though, then lazy initialization may not require sychronization (synchronization is mandatory for C# static constructors and is the [the default for `Lazy` types](https://learn.microsoft.com/en-us/dotnet/api/system.lazy-1?view=net-9.0#thread-safety)). Note that Microsoft documentation breaks this sensible idea on lock hierarchy by [recommending programmers call `LoadLibrary` from lazy static constructors](https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/static-constructors#usage). Regardless of synchronization, modules with significant [cross-cutting concerns](https://en.wikipedia.org/wiki/Cross-cutting_concern#Examples) should [never lazily initialize](https://devblogs.microsoft.com/oldnewthing/20070815-00/?p=25573), instead initializing at module load-time, or preferably initializing at compile-time if possible while having little to no dependencies. From purely a performance point of view, lazy initializers could introduce ["measurable overhead"](https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/proposals/csharp-9.0/module-initializers#motivation) because the language or runtime must internally perform atomic or synchronized checks to decide whether or not the initializer needs to be run on every pass (this cost is at odds with the benefit of potentially never having to run the initializer if, for example, the application goes down a different code path or errors out early... I'm looking at you, Rust `lazy_static` and `once_cell`). With all these factors in mind, it can generally be safe to use a syncrhonized, lazily initializing global type as long as an application has a clear structure that ensures a lazy initializer routine will not depend on itself through some means (directly or indirectly), and that this thinking extends to subsystems that your code depends on (e.g. the OS loader).
The CLR loader, particularly the fact that it intentionally runs outside the OS loader, is a hack because only one of these two components can be at the top of the lock hierarchy and since the OS loader starts first, it should take precedence. By the CLR loader placing itself higher in the lock hierarchy than the OS loader, the CLR becomes tighly coupled with the OS loader. Ideally, the CLR under C# should be able to, as a modular subsystem, safely abstract from the OS without worrying about low-level concerns within the native loader. In particular, it should ideally be possible for C# to use the same constructors and destructors as C++ because Microsoft has tighly integrated .NET into Windows thus making it possible to accidentally utilize the technology when the programmer didn't intend to, such as via [COM interop](https://en.wikipedia.org/wiki/COM_Interop) (there are likely some cases where the Windows API internally uses .NET through COM interop in an in-process server).
## The Root of `DllMain` Problems
The Windows loader, in contrast to Unix-like loaders, is more vulnerable to correctness issues, and deadlock or crash scenarios for a variety of architectural reasons. "The Root of `DllMain` Problems" (or, more casually, "`DllMain` Rules Rewritten") provides a fundamental understanding of `DllMain` hurdles and why they exist. It improves on Microsoft's ["DLL Best Practices"](https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices#general-best-practices), often referred to as the "`DllMain` rules" informally. "DLL Best Practices" originally dates back to a technologically ancient [2006 Microsoft document](data/windows/dll-best-practices/README.md) for providing guidance on what actions are safe to perform from `DllMain`, as well as other module initializers and finalizers, or constructors and destructors running in the module scope of a library. The architectural reasons for `DllMain` issues on Windows ([specifically Windows NT](#computer-history-perspective)) include:
- Windows uniquely positions the loader at the bottom of any external lock hierarchy
- This placement is completely backwards because the loader is the first thing to start in a process
- Indeed, it is not a deliberate design decision, but rather one made as an afterthought due to [Windows' misuse of DLLs](#the-problem-with-how-windows-uses-dlls)
- As a result, the threat of [ABBA deadlock](#abba-deadlock) makes it unsafe to acquire any external lock (not used by NTDLL) from inside the loader without knowing how that lock is implemented
- Windows is the ultimate monolith
- The [broadness of the Windows API](https://en.wikipedia.org/wiki/Criticism_of_Microsoft#Vendor_lock-in) (thousands of DLLs in `C:\Windows\System32`, including everything from file creation to WinHTTP) in combination with its [lack of a clear separation between components](#the-problem-with-how-windows-uses-dlls) leads to operating-system-wide [dependency breakdown](#dependency-breakdown)
- Despite Windows prioritizing libraries and shared processes over programs and small processes at the operating-system-level, its library dependency infrastructure is significantly less robust and more tightly coupled than its Unix counterpart
- The Windows threading implementation [meshes with the loader at thread startup and exit](#dll-thread-routines-anti-feature) (`DLL_THREAD_ATTACH` and `DLL_THREAD_DETACH`)
- The synchronization requirement this added to threads broke the library subsystem lifetime, which led to [Microsoft condoning thread termination](https://devblogs.microsoft.com/oldnewthing/20150814-00/?p=91811) as a synchronization model and [Windows leaving the process in an inconsistent state at process exit thus breaking module destructors](#process-meltdown)
- Despite Windows prioritizing multithreading over multiprocessing at the operating-system-level, its threading implementation is significantly less robust and more prone to deadlocks than its Unix counterpart
- The monolithic architecture of the Windows API may cause the loader's lock hierarchy to become nested within the lock hierarchy of a separate subsystem; if this nesting interleaves with another thread nesting in the opposite order, ABBA deadlock is the result
- The [COM and loader subsystems exhibit tight coupling](#on-making-com-from-dllmain-safe) whereby Microsoft's implementation of COM may interact with the loader while holding the COM lock, an issue that becomes increasingly problematic due to the Windows API's extensive use of COM behind the scenes (including much of the Windows [User API](https://learn.microsoft.com/en-us/windows/win32/api/winuser/), [Windows Shell](https://learn.microsoft.com/en-us/windows/win32/api/_shell/), and [WinHTTP AutoProxy](https://learn.microsoft.com/en-us/windows/win32/winhttp/autoproxy-issues-in-winhttp#security-risk-mitigation) to name a few)
- The heavy use of [thread-local data](#flimsy-thread-local-data) throughout the Windows API can lock its users to the unspecified thread that loaded the library
- Windows kernel mode and user mode closely integrate (NT and NTDLL), whereas [Unix began with modularity as a core value](https://en.wikipedia.org/wiki/Unix_philosophy)
- This value carried through to the formalization of Unix in the POSIX and C standards, and the System V ABI specification
- Windows overrelies on dynamic initialization and dynamic operations in general
- It is always best practice for robustness and performance to initialize statically (i.e. at compile time) over dynamically (using module initializers and finalizers including Windows `DllMain`) if feasible
- Windows commonly requires dynamic initialization even for core system functionality, such as [initializing a critical section](https://learn.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-initializecriticalsection)
- The Windows Process Environment Block (PEB), along with its common use by getter functions like `GetProcessHeap`, artificially [enforces dynamic initialization](#the-peb-problem)
- The multithreading-first design of Windows can increase contention when accessing popular shared resources such as the process heap, which may require some Windows components [dynamically create their own resources](https://web.archive.org/web/20140805104223/https://blogs.msdn.com/b/oleglv/archive/2003/10/28/56142.aspx#:~:text=static%20CRT%20allocates%20its%20heap%20in%20DllMain%20of%20the%20owning%20DLL)
- In contrast, POSIX data structures commonly provide a [static initialization option](#constructors-and-destructors-overview), and Unix prioritizes multiprocessing through minimal and pluggable processes
- Unexpected library loading
- Inherently, delay loading may unexpectedly cause library loading when a programmer didn't intend, thus leading to [an array of potential issues that could deadlock or crash a process](#library-lazy-loading-and-lazy-linking-overview)
- MacOS previously supported lazy loading until Apple removed it, likely due to scenarios where it becomes an anti-feature and any performance gains not being worth the trade-off
- Windows institutes that [creating a process can load libraries into the existing process](#library-loading-locations-across-operating-systems)
- Windows runtime libraries commonly implement a poor [thread-safe](https://en.wikipedia.org/wiki/Thread_safety) implementations that restrict concurrency
- Notable runtime components in Windows such as [`atexit` registration and its callbacks](code/glibc/atexit/README.md) are not designed with deadlock-free thread safety in mind (while runtime components are not directly part of the loader, their implementations may use or integrate with it)
- Historical library loader issues
- Poor ability for reentrancy during module initialization
- The Windows API heavily relying on dynamic library loading, especially past its intended use case of loading extension libraries, requires a loader with robust reentrancy capabilities
- Microsoft mostly built the legacy loader to be reentrant; however, how it performed module initialization was subject to crashes or correctness issues due to the loader's poor ability to enforce the correct order of operations when initializing modules upon being reentered (**this issue was at the heart of the previous "`DllMain` Best Practices"**)
- [Starting with Windows 8](https://www.geoffchappell.com/studies/windows/km/ntoskrnl/inc/api/ntldr/ldr_ddag_node.htm), the loader maintains a dependency graph throughout its operation thus *significantly* resolving out-of-order module initialization problems that can occur when loading a library from `DllMain`, but there could potentially still be issues due to circular dependencies (["dependency loops"](https://learn.microsoft.com/en-us/windows/win32/dlls/dllmain#remarks:~:text=dependency%20loops))
- The legacy loader could only [walk the dependency graph](https://learn.microsoft.com/en-us/archive/blogs/mgrier/the-nt-dll-loader-dll_process_attach-reentrancy-step-1-loadlibrary) while immediately collapsing it into a linked list thus giving the module initialization order (this list was [only formed by walking the import address tables (IATs)](https://github.com/reactos/reactos/blob/513f3d179cff234821c359db034409e94a278320/dll/ntdll/ldr/ldrpe.c#L369-L371) at the start of a load and [was not able to dynamically adjust](https://github.com/reactos/reactos/blob/513f3d179cff234821c359db034409e94a278320/dll/ntdll/ldr/ldrinit.c#L694-L698) to the reentrant case of someone calling `LoadLibrary` from `DllMain`)
- The Windows `GetModuleHandle` function was broken
- [`GetModuleHandle` from `DllMain`](#getprocaddress-can-perform-module-initialization) can be problematic because it assumes a DLL is already loaded when it may not be yet or has only partially loaded (since the loader cannot know in advance that a given DLL depends on another DLL if it does not dynamically link to it)
- With the release of an `Ex` function and other patchwork, Microsoft has mostly fixed this issue, but it is still not as robust as its GNU loader counterpart or POSIX only defining `dlopen` with no flag for this functionality
- Backward and forward compatibility
- The loader runs `DllMain` code under that DLL's activation context (`LDR_DATA_TABLE_ENTRY.EntryPointActivationContext`) for application compatibility, if it has one, which [could cause unforseen issues](https://devblogs.microsoft.com/oldnewthing/20080910-00/?p=20933) due to conflicting requirements laid out by another activation context
The result of these architectural facts and faults, among other consequences, is that subsystems are often unable to construct and destruct safely. Outcomes of this simple fact are far-reaching with impact that can be seen throughout all facets of the operating system. Subsystems are often left employing delicate hacks (including, if you are Microsoft, making subsystems more reliant on the kernel and broader system leading to tighter coupling), introducing anti-patterns, and falling back on design decisions that are deficient in performance to work around what is a fundamental failing of the operating system. Alternatively, a subsystem could unknowingly perform an unsafe action in its module routines, which may work until a rare but possible race condition is met, a single point of failure that constantly challenges the soundness and robustness of Windows with every additional module. On an ad-hoc basis, Microsoft, adds blockers to ensure common actions that can fail or may be risky from a Windows module routine, cannot proceed. However, the checks necessary to implement these blockers are only possible due to the tight coupling that is prevalent in Windows, can add up to have a negative effect on performance when done at run-time, and can never create a fully correct system as long as root issues persist. It is unfortunate that while a DLL is the executable unit Windows is architected around, `DllMain` exists as one of the most fragile parts of Windows.
With a newfound understanding of `DllMain` woes and the greater perspective gained from admiring how Unix-like operating systems get it right, you can reason about the safety of performing a given action from `DllMain`, module initializers and finalizers, or other constructors and destructors running in the module scope of a library.
**Alpha Notice**: This work is currently considered to be of alpha quality. The document, including this section, is incomplete: there are still strong arguments I need to add and some sections of this document could probably be written better.
## The Problem with How Windows Uses DLLs
A DLL or library is [modular](https://learn.microsoft.com/en-us/troubleshoot/windows-client/setup-upgrade-and-drivers/dynamic-link-library#dll-advantages) code that processes can load to use the contained functionality. A linker can connect libraries together to create dependencies between them. Defining dependencies between libraries requires careful management of the dependency tree to avoid creating conflicts such as circular dependencies.
How the Windows operating system scatters functionality across multiple libraries leads to the uncontrolled creation of dependencies. In particular, DLLs on Windows lack a clear separation of components causing nearly *everything to depend on everything else* (if not directly, then by proxy through a dependent DLL). It's this lack of organization between Windows libraries that dooms what a library is supposed to be and transforms the Windows API into a monolithic beast.
As a hack to workaround this root issue, Microsoft (ab)uses the "delay loading" Windows feature to stop dependency loops. However, [delay loading or library lazy loading, is an inherently broken feature at the operating system level](#library-lazy-loading-and-lazy-linking-overview). Thus, delay loading only moves the issue to being an equally as bad but manageable problem. This delay loading hack is pervasive throughout virtually all parts of the Windows API. We will now give a quick walkthrough of common DLLs, core to Windows' functioning, which exhibit the described hack:
```
> dumpbin /imports C:\Windows\System32\kernel32.dll
...
Section contains the following delay load imports:
RPCRT4.dll
00000001 Characteristics
00000001800B7A48 Address of HMODULE
00000001800BF000 Import Address Table
000000018009D0E0 Import Name Table
000000018009D268 Bound Import Name Table
0000000000000000 Unload Import Name Table
0 time date stamp
0000000180025D2D 16C RpcAsyncCompleteCall
0000000180025D09 211 RpcStringBindingComposeW
0000000180025CF7 176 RpcBindingFromStringBindingW
0000000180025C6C 16E RpcAsyncInitializeHandle
0000000180025D1B 2E I_RpcExceptionFilter
0000000180025D3F 186 RpcBindingSetAuthInfoExW
0000000180025D87 94 Ndr64AsyncClientCall
0000000180025D63 16B RpcAsyncCancelCall
0000000180025D75 174 RpcBindingFree
0000000180025D51 215 RpcStringFreeW
...
```
The most common Windows DLL after `NTDLL.dll`, `KERNEL32.dll`, contains one of these hacks for loading `RPCRT4.dll`, the RPC runtime. `RPCRT4.dll` immediately depends on `KERNEL32.dll`, and Microsoft chose `KERNEL32.dll` as the DLL to break the immediate dependency loop. Additionally, `KERNEL32.dll` delays the loading of its `RPCRT4.dll` dependency to ensure the RPC runtime and its dependencies aren't unnecessarily loaded into all processes that load `KERNEL32.dll` (which is all standard Windows processes, not including pico processes).
Worse, `KERNEL32.dll` immediately depends on `KernelBase.dll`, which in turn depends on `ntdll.dll` starting with Windows 7. In modern Windows, we can see `KernelBase.dll` is stuffed with delay loading hacks that lead back to an astounding 18 DLLs including: `KERNEL32.dll` (a direct circular dependency), `advapi32.dll`, `apisethost.appexecutionalias.dll`, `appxdeploymentclient.dll`, `bcryptPrimitives.dll`, `capauthz.dll`, `daxexec.dll`, `deviceaccess.dll`, `efswrt.dll`, `feclient.dll`, `gpapi.dll`, `mrmcorer.dll`, `ntdsapi.dll`, `sechost.dll`, `twnapi.appcore.dll`, `user32.dll`, `windows.staterepositoryclient.dll`, `windows.staterepositorycore.dll`, and `windows.storage.dll`.
Here is the same hack in a couple more DLLs central to the Windows API, including the core DLL to the [User API](https://learn.microsoft.com/en-us/windows/win32/api/winuser/) (which encompases many other Windows APIs):
```
user32.dll Delay Loads:
api-ms-win-power-setting-l1-1-0.dll -> powrprof.dll
api-ms-win-power-base-l1-1-0.dll -> powrprof.dll
api-ms-win-service-private-l1-1-0.dll -> sechost.dll
MSIMG32.dll
WINSTA.dll
ext-ms-win-edputil-policy-l1-1-0.dll -> edputil.dll
```
And some more, this time in the Advanced Windows 32 Base API DLL, used for [security calls](code/windows/library-init-lazy-load/lib1.c) and to [provide access to the Windows Registry](https://en.wikipedia.org/wiki/Windows_Registry#Programs_or_scripts):
```
advapi32.dll Delay Loads:
CRYPTSP.dll
WINTRUST.dll
CRYPTBASE.dll
SspiCli.dll
USER32.dll
CRYPT32.dll
bcrypt.dll
api-ms-win-security-lsalookup-l1-1-0.dll -> sechost.dll
api-ms-win-security-credentials-l1-1-0.dll -> sechost.dll
api-ms-win-security-credentials-l2-1-0.dll -> sechost.dll
api-ms-win-security-provider-l1-1-0.dll -> ntmarta.dll
api-ms-win-devices-config-l1-1-1.dll -> cfgmgr32.dll
```
Practically every DLL you look at in the Windows API is swamped with these delay loading hacks. Specifically, there are ~3000 DLLs (`.dll` files) in `C:\Windows\System32` (not including subdirectories). Of those approximately 3000 DLLs, some are ["resource-only DLLs"](https://learn.microsoft.com/en-us/cpp/build/creating-a-resource-only-dll) (e.g. `imageres.dll`), which can be excluded (I have not bothered though, since the figure is already staggering). By my measurement, this means **over half** of Windows DLL (1663 exactly, within `C:\Windows\System32` not including subdirectories) exhibit a delay loading hack. Note that this figure only includes DLLs that directly include a delay load, not DLLs that immediately depends on another DLL that includes a delay load. For a comprehensive list of affected DLLs, see the [final output](data/windows/dll-deps-research/delay-loads.txt) of the [`dumpbin-delay-loads.ps1` script](data/windows/dll-deps-research/dumpbin-delay-loads.ps1).
The vast quantity of circular dependencies all through out the DLLs that make up the Windows API breaks the vital and commonly ascribed modularity benefit of the DLL. This tight coupling leads to a variety of poor outcomes for the operating system and software running on it. For obvious reasons, it would be undesirable to load so many libraries for even the simplest "Hello, World!" class of applications. As such, Microsoft needed a remedy and decided to move the problem, instead of fixing it, with [delay loading](#library-lazy-loading-and-lazy-linking-overview).
**NOTE:** Work on the [Dependency Breakdown](#dependency-breakdown) section is pending to separate the definition of lazy library loading from arguments against it in this document. The work here is INCOMPLETE, ALPHA QUALITY, and I still have my strongest arguments to add.
For further research on Windows' misuse of DLLs, [see here](#more-research-on-windows-usage-of-dlls).
### Problem Solved?
Identifying issues is important, but it's even more valuable to pair that with ideas for solutions. So, let's come with some actionable solutions for the root issue we explored here!
#### Solution #1: API Sets Extension
With Windows 7 came the introduction of [API Sets](https://www.geoffchappell.com/studies/windows/win32/apisetschema/index.htm). API sets are an application compatibility mechanism designed as an altenative to activation contexts for finding the correctly versioned DLL to load (i.e. to help in the fight against DLL Hell). API sets are promising because they neatly sort the Windows API into smaller and more modular units.
As it stands, an API set is merely an alias that maps to a real DLL on disk. In this solution, we propose that API set DLL names become or more closely imitate real DLLs. Perhaps they could be called "virtual DLLs". With enough granularity, the hope is that Windows DLLs would naturally lose their circular dependencies because they were using separate parts of the same DLL.
A caveat to this solution exists if the circular dependency is formed because a specific API in one "real DLL A" requires functionality from "real DLL B" while "real DLL B" also requires functionality from "real DLL A" within the scope of that API call. In this case, no level of API granularity could break the circular dependency. Before eaching per-API granularity, there could also be other practical refactoring limitations due to the underlying implementation.
#### Solution #2: Organize Subsystems
In cases where increasing the granularity in the set of APIs provided by a library fails to remove circular dependencies, it may be warranted to reorganize by merging multiple libraries into one. Generally, it is always possible to remove a dependency cycle by decoupling subsystems and employing [cycle breaking strategies](https://en.wikipedia.org/wiki/Acyclic_dependencies_principle#Cycle_breaking_strategies).
#### Solution #3: Reimplementation
With the realization that the Windows API more closely resembles a dog chasing its own tail, planets orbiting an NT kernel star, or simply a web of libraries more than it does a directed acyclic graph (DAG) of modular subsystems, reimplementing large parts of the Windows API to use a different backend becomes a viable solution.
Wine is already making progress here by making translating DirectX into Vulkan (DKVK) and Windows audio/video APIs into using GStreamer or FFmpeg. This solution also comes with other benefits such as typically improving performance and efficiency over the Microsoft Windows alternatives.
#### Summary
Solving the problem we explored here will likey require a multifaceted resolution involving all three solutions. Once the Windows API is organized, it will be up to Microsoft Windows developers to remain conscientious about the dependenices their subsystems create. The vastness of the Windows API doesn't make coming up with the best solution to its circular depenendency problem easy, but I maintain that it is possible and that application compatibility can come along for the ride.
## Dependency Breakdown
**NOTE:** Work in progress.
## Further Research on Windows' Usage of DLLs
### The DLL Host
A DLL or library is modular code that processes can load to use the contained functionality. If this was the extent to how Windows, like any other operating system, utilized DLLs, then it would be correct. However, Windows' usage of DLLs goes far beyond their intended use. Introducing, the DLL host.
In Windows, DLL hosts are programs that serve only to host other DLLs that provide the core functionality of an application or service. Common DLL hosts include `rundll32.exe`, `svchost.exe`, `taskhostw.exe`, and [COM surrogates](https://learn.microsoft.com/en-us/windows/win32/com/dll-surrogates) such as `dllhost.exe`.
DLL hosts are prevalent throughout Windows, with `svchost.exe` alone accounting for **over half** (55% or 70/126 processes by [my measurement](data/windows/dll-deps-research/process-count.ps1)) of all proceses on the system upon booting up Windows.
Clearly, Windows really likes DLL hosts and specifically [shared service processes](https://en.wikipedia.org/wiki/Svchost.exe). But why? No other operating system has the concept of a DLL host and they seem to get along just fine.
Well for a start, we know processes are more expensive on Windows than on Unix systems. Looking at the `Private Bytes` consumed by even the most minimal of proccesses in Process Explorer verifies this to be the case:
```
AggregatorHost.exe | 912K
smss.exe | 1072K
svchost.exe | 1284K
svchost.exe | 1292K
svchost.exe | 1384K
```
Even the smallest processes are eating up about 1 MiB or more of memory each! The plethora of highly interconnected DLLs making up the Windows API would also certainly certainly contribute to slower process start times.
Going further, another reason for shared services could be that being in the process allows for faster communication between similar services (especially since the base overhead of a Windows system call as well as the system calls themselves are generally known to be higher on Windows than on Unix systems). This was the same motivation for in-process COM servers. By running `tasklist /svc | findstr ,`, we can find shared service hosts containing multiple service:
```
lsass.exe 860 KeyIso, SamSs, VaultSvc
svchost.exe 984 BrokerInfrastructure, DcomLaunch, PlugPlay,
Power, SystemEventsBroker
svchost.exe 928 RpcEptMapper, RpcSs
svchost.exe 2672 BFE, mpssvc
svchost.exe 456 OneSyncSvc_59a4b,
PimIndexMaintenanceSvc_59a4b,
UnistoreSvc_59a4b, UserDataSvc_59a4b
```
That's interesting, out of all the shared service processes, **only five (including `lsass.exe`) are actually hosting multiple services in one process**. This is in stark contrast to [the large number of unrelated services that previous Windows versions packed into one process](https://web.archive.org/web/20190428105316/https://blogs.msdn.microsoft.com/larryosterman/2005/09/09/shared-services/):
```
1280 svchost.exe Svcs: AudioSrv,BITS,CryptSvc,Dhcp,dmserver,ERSvc,EventSystem,helpsvc,lanmanserver,lanmanworkstatio
n,Netman,Nla,RasMan,Schedule,seclogon,SENS,SharedAccess,ShellHWDetection,srservice,TapiSrv,Themes,W32Time,winmgmt,wuause
rv,WZCSVC
```
The simplest explanation for Microsoft no longer packing many services into one process like they use to is valuing the robustness of a separate virtual address space for each service over the expense that comes with it. One megabyte of memory, while not nothing, isn't nearly as valuable as it was when the average system was sporting fewer gigabytes of RAM than it was today. As a result, Windows shared services mostly appears to be a relic of the past and I wouldn't be surprised if Microsoft does away with them entirely at one point. Said in another way, using a separate process for each service brings Windows closer to [microservice architecture](https://azure.microsoft.com/en-ca/solutions/microservice-applications) because "one component’s failure won’t break the whole app" (broadly—the term microservice can take on more meaning in the cloud context).
Beyond robustness, multiple DLLs operating independently in a process with their own threads could actually hurt performance by causing unnecessary contention on in-demand resources like the process heap lock. Windows DLLs or threads sometimes use a private or local heap to help with this issue (see heaps in Windbg with `!heap` command). However, Windows API calls that create heap allocations implictly often makes full heap separation unattainable in practice. Concerns regarding process heap lock contention are especially pertinent because the Windows NT Heap implementation doesn't implement any measure to reduce blocking like the [glibc heap does with per-thread arenas](https://elixir.bootlin.com/glibc/glibc-2.38/source/malloc/malloc.c#L1) (and Microsoft's attempts at implementing a more concurrent and performant heap, [like the Segment Heap](https://github.com/microsoft/Windows-Dev-Performance/issues/39#issuecomment-729313323), [have not worked out](https://github.com/microsoft/Windows-Dev-Performance/issues/106)).
Another victim of the DLL host that cannot go understated is ease of debugging. There will always be bugs, so it's crucial to be proactive in maximizing correctness and minimizing complexity so bugs can be fixed as quickly as they're spotted. A DLL host stands in the way of debugging for multiple reasons, most obviously, a shared address space makes determining the source of memory corruption bug challenging if multiple compnents or services operate in a single address space. But also, [Microsoft won't be able to send Windows Error Reporting (WER) reports for crashes](https://devblogs.microsoft.com/oldnewthing/20130104-00/?p=5643) because that's tracked by the EXE hosting the DLL (as well as other notable concerns like making application compatibility more difficult).
Shared service processes use service DLLs. Since a service DLL exists solely for the purpose of allowing multiples services to exist in one process, one would not expect DLLs to take a dependency on a service DLL. A service DLL is more like an EXE in that `svchost.exe` delegates control of the application lifetime to it. So, depending on a DLL that works like its an EXE is surely a recipe for circular dependencies, which are bad. Alas, upon searching, [I did find some DLLs depending on service DLLs](data/windows/dll-deps-research/dlls-depending-on-service-dlls.txt) (this search only being in `C:\Windows\System32`, not including subdirectories).
Once again, DLLs provide a false promise of modularity. Bad Microsoft—bad.
### DLL Procurement
Windows will load and execute a DLL from practically anywhere, which as you can imagine, does not fare well for the security of the operating system and frequently invents security vulnerabilites that could never exist on other systems.
[See here for more information.](#library-loading-locations-across-operating-systems)
### One DLL, One Base Address
Today, Windows still does not support per-process address space layout randomization (ASLR) of libraries. It's absence effectively makes this crucial exploit mitigation useless for defending against privilege escalation, including sandbox escape (e.g. from a web browser), on Windows. This weakness markedly tips the scales in favor of the attacker (e.g. in a ROP attack).
This requirement exists because how Windows works, I believe in relation to the operating system's heavy usage of shared memory and historical reasons, mandates all image mappings to be at the same address in virtual memory across processes.
[See here for more information.](https://cloud.google.com/blog/topics/threat-intelligence/six-facts-about-address-space-layout-randomization-on-windows/#:~:text=Fact%202%3A%20Windows%20loads%20multiple%20instances%20of%20images%20at%20the%20same%20location%20across%20processes%20and%20even%20across%20users%3B%20only%20rebooting%20can%20guarantee%20a%20fresh%20random%20base%20address%20for%20all%20images)
### DLLs as Data
Microsoft confused memory-mapped files with libraries thus giving us the [resource-only DLL](https://learn.microsoft.com/en-us/cpp/build/creating-a-resource-only-dll).
Turning a pointer into a search through a lookup table for that pointer is a diabolical level of bloat.
[See here for more information.](#loadlibrary-vs-dlopen-return-type)
## Library Loading Locations Across Operating Systems
The Windows loader is searching for DLLs to load in a [vast (and growing) number of places](https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order#standard-search-order-for-unpackaged-apps). Strangely, Windows uses the `PATH` environment variable for locating programs (similar to Unix-like systems) as well as DLLs. Microsoft's decision to retain (since this one originates from backward compatibility with CP/M DOS) the current working directory ("the current folder") in this list of places is an accident (or worse, a security incident) waiting to happen, particularly when running applications from untrusted CWDs in a shell (e.g. CMD or PowerShell). This Microsoft documentation still doesn't cover all the possible locations, though, because while debugging the loader during a `LoadLibrary`, I saw `LdrpSendPostSnapNotifications` eventually calls through to `SbpRetrieveCompatibilityManifest` (this *isn't* part of a notification callback). This `Sbp`-prefixed function searches for [application compatibility shims](https://doxygen.reactos.org/da/d25/dll_2appcompat_2apphelp_2apphelp_8c.html) in SDB files which [may result in a compat DLL loading](https://pentestlab.blog/2019/12/16/persistence-application-shimming/). Also to do with application compatibility, [WinSxS and activation contexts](https://learn.microsoft.com/en-us/windows/win32/sbscs/activation-contexts) (DLLs in `C:\Windows\WinSxS`) exist to load versioned DLLs typically based on the [application's manifest](https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests) (these are usually embedded in the binary). A process calling the `CreateProcess` family of functions or `WinExec` is subject to loading [AppCert DLLs](https://attack.mitre.org/techniques/T1546/009/). When secure boot is disabled in Windows 8 or greater, [AppInit DLLs](https://learn.microsoft.com/en-us/windows/win32/dlls/secure-boot-and-appinit-dlls) can load DLLs into any process. The plethora of possible search locations contributes to [DLL Hell](https://en.wikipedia.org/wiki/DLL_Hell) and DLL hijacking (also known as DLL preloading or DLL sideloading) problems in Windows, the latter of which makes vulnerabilities due to a privileged process *accidentally* loading an attacker-controlled library more likely (I've personally seen how common these and similar Windows-specific vulnerabilities are especially in LOB applications that enterprises use).
The GNU/Linux ecosystem differs due to system package managers (e.g. APT or DNF). All programs are built against the same system libraries (this is possible because all the packages are open source). Proprietary apps are generally statically linked (typically using musl and not glibc as the libc implementation) or come with all their necessary libraries. The *trusted directories* for loading libraries and the configuration file for adding directories to the search path can be found in the [`ldconfig`](https://man7.org/linux/man-pages/man8/ldconfig.8.html) manual. Beyond that, you can set the `LD_LIBRARY_PATH` environment variable to choose other places the loader should search for libraries and `LD_PRELOAD` or `LD_AUDIT` to specify libraries to load before any other library (including `libc`) with the difference being that libraries specified by the latter run first and can receive callbacks to monitor the loader's actions. Loading libraries based on environment variables is a default feature that may optionally be turned off during compilation (and is always disabled for `setuid` binaries). Binaries can include an `rpath` to specify additional run-time library search paths.
On Windows, statically linking system DLLs is unsupported and a copyright infringment because the Windows software license doesn't permit bundling Microsoft's libraries with your own application. Bringing your own system DLLs (e.g. from a different Windows version) is also unsupported because the internals of how they interact with the operating system and other tightly coupled DLLs can change. Microsoft keeps userland backward compatibility by ensuring Windows system libraries stay the same in their APIs and relevent internals. Since static linking and bringing your own libraries is a strong suit of Unix systems, I suggest capitalizaing on that advantage by employing this linking or library distribution approach for third-party, business, enterprise, or proprietary applications (of course, some things like an audio client and server pair still need to communicate compatibly, but that should easily be solved by versioning the protocol and because Linux has now converged on Pipewire for audio and video... also great libraries like [SDL](https://www.libsdl.org) have existed for a long time now). User-mode API stability on GNU/Linux has historically been a problem for adoption (e.g. since glibc has no problem with breaking backward compatibility) but it is a non-issue when taking a typical Unix system's other strengths into account. Linus Torvalds is very adimnant about the kernel [not breaking userland](https://unix.stackexchange.com/a/235532)