Inference Request Queue
The Inference Request Queue is a critical component of The Domain that manages and prioritizes incoming inference requests. It ensures efficient utilization of computational resources and maintains a smooth flow of inference tasks.
Overview
When an inference request is received, either directly from the Inference DB or through the Packet DB, it is added to the Inference Request Queue. This queue is then processed at regular intervals, with requests being dequeued and assigned to available LLM workers for inference processing.
The queue is designed to handle and prioritize requests based on various factors, such as the time remaining until the request's deadline, the similarity between requests, and the availability of compatible LLM models and workers.
The Inference Request Queue plays a crucial role in ensuring that inference requests are processed in a timely and efficient manner, while also optimizing resource utilization and preventing overloading of the system.
Request Processing
The processing of the Inference Request Queue is handled by the processInferenceRequestQueue
function, which is debounced to improve performance and prevent unnecessary computations. This function is executed at regular intervals, as determined by the inferenceRequestQueueDebounceMs
setting.
Here's a high-level overview of the request processing flow:
Step 1
Retrieve available inference requests from the Inference DB that meet the following criteria:
- The request has not expired (i.e.,
endingAt
is in the future) - The request is not already being processed
Step 2
Determine the required LLM models based on the acceptedModels
property of each inference request.
Step 3
Check the availability of LLM workers for the required models using the LLM Engine's getWorkerAvailability
method.
Step 4
Filter the available inference requests to include only those that can be processed by the available LLM workers.
Step 5
If there are no available requests, exit the processing loop.
Step 6
Sort the available requests based on their endingAt
time, prioritizing requests with earlier deadlines.
Step 7
Group similar requests based on the requestSimilarityTimeWindowMs
setting, which determines the time window within which requests are considered similar.
Step 8
Randomly select a request from the group of similar requests.
Step 9
Mark the selected request as being processed by adding its requestId
to the inferenceIdsInProcess
array.
Step 10
Transmit a peerStatusUpdate
packet indicating that inference is in progress, along with the selected model name.
Step 11
Initiate the inference process by invoking the LLM Engine's runInferenceNonStreaming
method with the selected request and model.
Step 12
Upon completion of the inference process, handle the result by:
- If successful, transmit a
peerStatusUpdate
packet with the inference completion status and tokens-per-second (TPS) metric. - Save the inference result to the Inference DB.
Step 13
Remove the processed requestId
from the inferenceIdsInProcess
array.
Step 14
Recursively call processInferenceRequestQueue
to continue processing the queue.
Request Prioritization
Prioritization of inference requests is achieved through a combination of techniques:
-
Deadline-based Prioritization: Requests with earlier
endingAt
times are prioritized over those with later deadlines. -
Request Similarity Detection: Requests that are similar in terms of their creation time (within the
requestSimilarityTimeWindowMs
window) are grouped together. One request is then randomly selected from this group for processing. This helps maintain fairness and prevents starvation of similar requests. -
Model Availability: Requests are filtered based on the availability of compatible LLM workers. Requests requiring models without available workers are temporarily skipped.
This prioritization strategy ensures that critical requests with tight deadlines are processed first, while also accounting for fairness and efficient resource utilization.
Examples
Let's consider an example scenario where three inference requests are received within a short time window:
In this scenario, the Inference Request Queue would process Request 3 first, as it has the earliest deadline (09:08
). Request 1 would be processed next, followed by Request 2.
However, if Request 1 and Request 2 are considered similar (based on the requestSimilarityTimeWindowMs
setting), they would be grouped together, and one of them would be randomly selected for processing after Request 3.
The Inference Request Queue's prioritization strategy ensures that urgent requests are handled promptly, while also maintaining fairness and preventing starvation of similar requests.
Performance Considerations
The performance of the Inference Request Queue is critical for maintaining a smooth inference processing pipeline. To optimize performance, the following techniques are employed:
-
Debouncing: The
processInferenceRequestQueue
function is debounced using theinferenceRequestQueueDebounceMs
setting. This prevents excessive computations and allows the system to batch and prioritize requests more efficiently. -
Efficient Data Structures: The
inferenceIdsInProcess
array is used to keep track of requests currently being processed. This allows for quick lookups and prevents duplicate processing of the same request. -
Asynchronous Processing: Inference processing is performed asynchronously, allowing the system to continue handling other tasks while inferences are being computed.
Using these techniques, the Inference Request Queue can handle a high volume of inference requests while maintaining optimal performance and resource utilization.