Windows 10 Parallel Loading Breakdown

One of the unnoticed improvements of Window 10 is the parallel library loading support in ntdll.dll. This feature decreases process startup times by using multiple threads to load libraries from disk into memory.

How Windows 10 Implements Parallel Loading

Windows 10 implements parallel loading by creating a thread pool of worker threads when the process initializes. The parent process defines the number of worker threads by defining in the PEB->ProcessParameters->LoaderThreads (ULONG) field. ntdll!LdrpInitializeExecutionOptions can further override the LoaderThreads field by querying the Image File Execution Options (IFEO) registry key HKLM\Software\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\<image.exe>\MaxLoaderThreads.

Interestingly, Windows 10 contains a default entry for chrome.exe with MaxLoaderThreads set to 1 to disable parallel loading.

Figure 1: Querying the IFEO registry key for MaxLoaderThreads

The initial thread in the process executing ntdll!LdrInitializeThunk will be referred to as the master thread. Threads created by the master thread in the thread pool will be referred to as worker threads.

ntdll!LdrpInitParallelLoadingSupport and ntdll!LdrpCreateLoaderEvents are called to initialize the following structures:

• ntdll!LdrpWorkQueue (LIST_ENTRY)
• ntdll!LdrpWorkQueueTail (LIST_ENTRY)
• ntdll!LdrpWorkQueueLock (CRITICAL_SECTION)
• ntdll!LdrpRetryQueue (LIST_ENTRY)
• ntdll!LdrpRetryQueueTail (LIST_ENTRY)
• ntdll!LdrpLoadCompleteEvent (HANDLE)
• ntdll!LdrpWorkCompleteEvent (HANDLE)

Figure 2: Initializing the work queue structures

Figure 3: Creating the synchronization events

After ntdll loads kernel32.dll and kernelbase.dll are loaded, ntdll!LdrpEnableParallelLoading is called to set up the necessary events and worker pool. One interesting thing to note is that kernel32.dll and kernelbase.dll are loaded even if the process does not require it.

How Windows 10 Mitigates Parallel Loading Hazards

There are a lot of hazards when it comes to parallel loading and code hooking. In order to mitigate against corrupting memory or compatibility issues, Windows detects if a process is hooked before enabling parallel loading.

ntdll!LdrpEnableParallelLoading calls ntdll!LdrpDetectDetour to determine if the process being hooked. If a hook is detected, ntdll!LdrpDetourExist is set to true and the thread pool is drained and released.

Hooks are detected by examining the first 16 bytes of the functions defined in ntdll!LdrpCriticalLoaderFunctions:

• ntdll!NtOpenFile
• ntdll!NtCreateSection
• ntdll!ZqQueryAttributes
• ntdll!NtOpenSection
• ntdll!ZwMapViewOfSection

The first 16 bytes of these functions are compared to ntdll!LdrpThunkSignature.

This data is an array of the first 16 bytes of each function copied by ntdll!LdrpCaptureCriticalThunks which is called near the start of ntdll!LdrpInitializeProcess.

ntdll!LdrpEnableParallelLoading validates the number of worker threads to be between [1, 16] and creates a thread pool with one less than LoaderThreadsworkers threads since the master thread will also perform the work of loading DLLs. If LoaderThreads is 0, it will be set to the default value of 4; if the LoaderThreads is larger than 16, it is set to the max value of 16.

The worker thread idle timeout is set to 30 seconds. Programs which execute in less than 30 seconds will appear to hang due to ntdll!TppWorkerThreadwaiting for the idle timeout before the process terminates.

Figure 4: Creating the thread pool

ntdll!LdrpWorkCallback is registered as the thread pool work callback function. When work is available, the worker thread will call ntdll!LdrpWorkCallback which calls ntdll!LdrpProcessWork.

The thread will either map (ntdll!LdrpMapDllSearchPath or ntdll!LdrpMapDllFullPath) a DLL or snap (ntdll!LdrpSnapModule) a DLL based on the value _LDR_DDAG_NODE.State.

Mapping is the process of loading a file from disk into memory. Snapping is the process of resolving the library’s import address table.

At the end of every mapping procedure, ntdll!LdrpSignalModuleMapped is called which will queue the snap action by calling ntdll!LdrpQueueWork.

The work queue is defined by ntdll!LdrpWorkQueue which is a doubly linked list (LIST_ENTRY) of an opaque structure, LDRP_LOAD_CONTEXT.

This structure is allocated by ntdll!LdrpAllocatePlaceHolder and contains a variety of information such as the DLL name, a _LDR_DATA_TABLE_ENTRY structure, a pointer to the import address table (IAT), the activation context, and the control flow guard (CFG) function pointer [1][2].

Figure 5: Partially documented LDRP_LOAD_CONTEXT structure

At this point, worker threads in the thread pool will pull work off the queue and perform the appropriate action (mapping or snapping). If the worker finds a new dependency, it will queue up more work. Work is added and removed from the queue in a last-in first-out (LIFO) manner.

Once the thread pool has been initialized, the master thread continues on with ntdll!LdrpMapAndSnapDependency which will map the first level of explicit imports with a call to ntdll!LdrpLoadDependentModule. As the master thread loads imports, the work queue is filled up with secondary library dependencies for worker threads to process.

The master thread will perform the same map and snap work actions as the worker thread by calling ntdll!LdrpDrainWorkQueue.

Figure 6: Overview of master thread enabling parallel loading, mapping and snapping first level of dependencies, and joining with slave threads

ntdll!LdrpDrainWorkQueue serves as a synchronization point for the master thread as it joins in performing work added to ntdll!LdrpWorkQueue and returns when there is no more work to be completed. At this point, all of the dependencies have been resolved and loaded.

After all of the dependencies are mapped, ntdll!LdrpPrepareModuleForExecution is called which condenses the dependency graph with a call to ntdll!LdrpCondenseGraph. As the graph is traversed, callbacks are notified with ntdll!LdrpSendPostSnapNotifications which execute any callbacks registered with AppCompat (Shim Engine) or Application Verifier.

Once the callbacks are completed, ntdll!LdrpInitializeNode is called which initializes thread local storage (TLS) with a call to ntdll!LdrpCallTlsInitializers and finally every library’s entry point (typically DllMain) is called by ntdll!LdrpCallRoutine.

File: C:\Windows\System32\ntdll.dll
Version: 10.0.15063.447
SHA256: 2B8D65907A2811121EA75DB44BC540D0AF198C1991C30886A365001123F16B7D

References:

[1] https://stackoverflow.com/questions/42789199/why-there-are-three-unexpected-worker-threads-when-a-win32-console-application-s

[2] https://conference.hitb.org/hitbsecconf2017ams/materials/D2T1%20-%20Bing%20Sun%20and%20Chong%20Xu%20-%20Bypassing%20Memory%20Mitigation%20Using%20Data-Only%20Exploitation%20Techniques.pdf