The demand for deploying large language models (LLMs) on mobile devices is rising, driven by the need for privacy, reduced latency, and efficient bandwidth usage. However, the extensive memory and computational requirements of LLMs pose significant challenges. Enter LinguaLinked, a new system, developed by a group of researchers from UC Irvine, designed to enable decentralized, distributed LLM inference across multiple mobile devices, leveraging their collective capabilities to perform complex tasks efficiently.
The Challenge
Deploying LLMs like GPT-3 or BLOOM on mobile devices is challenging due to:
- Memory Constraints: LLMs require substantial memory, often exceeding the capacity of individual mobile devices.
- Computational Limitations: Mobile devices typically have limited processing power, making it difficult to run large models.
- Privacy Concerns: Sending data to centralized servers for processing raises privacy issues.
LinguaLinked's Solution
LinguaLinked addresses these challenges with three key strategies:
-
Optimized Model Assignment:
- The system segments LLMs into smaller subgraphs using linear optimization to match each segment with a device's capabilities.
- This ensures efficient use of resources and minimizes inter-device data transmission.
-
Runtime Load Balancing:
- LinguaLinked actively monitors device performance and redistributes tasks to prevent bottlenecks.
- This dynamic approach ensures efficient use of all available resources, enhancing overall system responsiveness.
-
Optimized Communication:
- Efficient data transmission maps guide the flow of information between devices, maintaining the model's structural integrity.
- This method reduces latency and ensures timely data processing across the network of mobile devices.
A single large language model (LLM) is split into different parts (or segments) and distributed across multiple mobile devices. This approach allows each device to handle only a fraction of the total computation and storage requirements, making it feasible to run complex models even on devices with limited resources. Here's a breakdown of how this works:
Model Segmentation and Distribution
- Model Segmentation:
- The large language model is transformed into a computational graph where each operation within the network is represented as a node.
- This graph is then partitioned into smaller subgraphs, each capable of functioning independently.
- Optimized Model Assignment:
- Using linear optimization, these subgraphs (or model segments) are assigned to different mobile devices.
- The assignment considers each device's computational and memory capabilities, ensuring efficient resource use and minimizing data transmission overhead between devices.
- Collaborative Inference Execution:
- Each mobile device processes its assigned segment of the model.
- Devices communicate with each other to exchange intermediate results as needed, ensuring the overall inference task is completed correctly.
- Optimized communication strategies are employed to maintain the integrity of the original model structure and ensure efficient data flow.
Example Scenario
Imagine a large language model like GPT-3 being split into several parts. One mobile device might handle the initial token embeddings and the first few layers of the model, while another device processes the middle layers, and a third device completes the final layers and generates the output. Throughout this process, devices share intermediate outputs to ensure the complete model inference is executed seamlessly.