LLM Engine
The LLM Engine is the core component of Rakis responsible for managing language model inference tasks. It handles the loading, scaling, and execution of inference tasks across multiple workers and models.
Overview
The LLM Engine is built around the LLMEngine
class, which serves as the central point for managing language model workers and processing inference requests. It provides a unified interface for interacting with various language models and enables efficient utilization of available resources.
The key responsibilities of the LLM Engine include:
- Worker Management: The LLM Engine handles the creation, scaling, and termination of language model workers. It dynamically adjusts the number of workers based on the workload and available resources.
- Inference Task Processing: Inference requests are received and distributed across available workers, ensuring efficient utilization of resources and load balancing.
- Model Support: The LLM Engine supports a wide range of language models, enabling users to leverage the most suitable model for their needs.
- Scaling and Load Balancing: The engine automatically scales the number of workers based on demand, ensuring optimal performance and resource utilization.
Architecture
The LLM Engine follows a modular architecture, separating concerns and enabling easy extensibility. The core components of the LLM Engine architecture are:
- LLMEngine: The central class that manages the overall lifecycle of language model workers and handles inference requests.
- LLMWorker: Represents an individual worker responsible for executing inference tasks using a specific language model.
- MLCEngine: A wrapper around the
@mlc-ai/web-llm
library, which provides the underlying language model capabilities.
The LLMEngine
class orchestrates the creation and management of LLMWorker
instances. Each LLMWorker
encapsulates a specific language model and handles inference tasks for that model. The MLCEngine
class acts as a bridge between the LLMWorker
and the underlying language model API provided by the @mlc-ai/web-llm
library.
Key Features
Worker Management
The LLM Engine provides a robust worker management system that enables dynamic scaling of language model workers based on demand. The scaleLLMWorkers
method allows you to specify the desired number of workers for a particular language model, and the engine automatically handles the creation or termination of workers as needed.
The LLM Engine automatically scales the number of workers based on the workload, ensuring efficient resource utilization and optimal performance.
Inference Task Processing
The runInference
and runInferenceNonStreaming
methods of the LLMEngine
class enable you to submit inference requests to available workers. The engine intelligently distributes the requests across available workers, ensuring load balancing and efficient resource utilization.
const response = await engine.runInferenceNonStreaming({
modelName: "gemma-2b-it-q4f16_1",
messages: [{ role: "user", content: "What is the capital of France?" }],
});
console.log(response.result); // Prints the inference result
Model Support
The LLM Engine supports a wide range of language models, allowing you to leverage the most suitable model for your use case. The list of available models is defined in the availableModels
constant within the types.ts
file.
You can easily extend the LLM Engine to support additional language models by
updating the availableModels
constant and providing the necessary
integration code.
Logging and Monitoring
The LLM Engine provides comprehensive logging and monitoring capabilities, allowing you to track the engine's activities and performance. The LLMEngineLogEntry
type defines the structure of log entries, which capture various events such as worker loading, inference start/completion, and errors.
You can access the engine logs using the getEngineLogs
method, which retrieves the specified number of recent log entries.
const recentLogs = engine.getEngineLogs(100);
console.log(recentLogs);
Usage
To use the LLM Engine, follow these steps:
Step 1
Import the necessary classes and types:
import { LLMEngine, LLMModelName } from "./llm/types";
Step 2
Create an instance of the LLMEngine
class:
const engine = new LLMEngine();
Step 3
Scale the number of workers for the desired language models using the scaleLLMWorkers
method:
await engine.scaleLLMWorkers("gemma-2b-it-q4f16_1", 2);
Step 4
Submit inference requests using the runInference
or runInferenceNonStreaming
methods:
const response = await engine.runInferenceNonStreaming({
modelName: "gemma-2b-it-q4f16_1",
messages: [{ role: "user", content: "What is the capital of France?" }],
});
console.log(response.result); // Prints the inference result
Advanced Usage
The LLM Engine provides additional functionality and customization options for advanced use cases:
- Worker Monitoring: The
getWorkerStates
method returns the current state of all workers, including their model name, inference status, and loading progress. This information can be used for monitoring and resource management purposes. - Worker Availability: The
getWorkerAvailability
method returns the availability of workers for specific language models, allowing you to make informed decisions about scaling and load balancing. - Inference Cancellation: The
abortWorkerInference
method enables you to cancel an ongoing inference task for a specific worker, providing fine-grained control over inference execution. - Logging and Monitoring: The LLM Engine provides comprehensive logging and monitoring capabilities through the
LLMEngineLogEntry
type and thesearchEngineLogs
andgetEngineLogs
methods, allowing you to track engine activities and performance.