LinguaLinked: Empowering Mobile Devices with Distributed Large Language Models
The demand for deploying large language models (LLMs) on mobile devices is rising, driven by the need for privacy, reduced latency, and efficient bandwidth usage. However, the extensive memory and computational requirements of LLMs pose significant challenges. Enter LinguaLinked, a new system, developed by a group of researchers from UC Irvine, designed to enable decentralized, distributed LLM inference across multiple mobile devices, leveraging their collective capabilities to perform complex tasks efficiently.
The Challenge
Deploying LLMs like GPT-3 or BLOOM on mobile devices is challenging due to:
- Memory Constraints: LLMs require substantial memory, often exceeding the capacity of individual mobile devices.
- Computational Limitations: Mobile devices typically have limited processing power, making it difficult to run large models.
- Privacy Concerns: Sending data to centralized servers for processing raises privacy issues.
LinguaLinked's Solution
LinguaLinked addresses these challenges with three key strategies:
-
Optimized Model Assignment:
- The system segments LLMs into smaller subgraphs using linear optimization to match each segment with a device's capabilities.
- This ensures efficient use of resources and minimizes inter-device data transmission.
-
Runtime Load Balancing:
- LinguaLinked actively monitors device performance and redistributes tasks to prevent bottlenecks.
- This dynamic approach ensures efficient use of all available resources, enhancing overall system responsiveness.
-
Optimized Communication:
- Efficient data transmission maps guide the flow of information between devices, maintaining the model's structural integrity.
- This method reduces latency and ensures timely data processing across the network of mobile devices.
A single large language model (LLM) is split into different parts (or segments) and distributed across multiple mobile devices. This approach allows each device to handle only a fraction of the total computation and storage requirements, making it feasible to run complex models even on devices with limited resources. Here's a breakdown of how this works: