How Multi‑Task LLM Services Slash Deployment Costs by 90% in Industrial NLP
This article summarizes Quincy Qu's COLING 2025 industry‑track paper that proposes a three‑stage multi‑task LLM framework for large‑scale NLP services, achieving performance comparable to single‑task models while reducing overall service cost by up to 90.9%.
Paper Overview
The work, authored by Quincy Qu, a senior algorithm engineer at Ctrip, was accepted to the COLING 2025 industry track, a top conference in natural language processing and computational linguistics. It addresses the challenge of deploying numerous NLP tasks in industrial settings where each request must be timely and accurate, leading to linear growth in compute, memory, and operational costs.
Motivation and Challenges
Traditional single‑task services require separate networks and pipelines for each task, causing heavy development and maintenance effort, increased latency, and excessive resource consumption, especially when scaling large language models (LLMs). Multi‑task approaches can suffer from data imbalance, task heterogeneity, negative transfer, and difficulty applying early‑stop strategies across tasks.
Proposed Three‑Stage Framework
The authors introduce a three‑stage framework:
Filter out dissimilar or low‑resource tasks to prevent negative transfer.
Fine‑tune high‑resource tasks individually.
Fine‑tune the combined task mixture, allowing each task to run for an appropriate number of training epochs, thus enabling early stopping for low‑resource tasks while avoiding under‑fitting of high‑resource ones.
This strategy leverages a shared multi‑task LLM service, reducing deployment workload and memory usage compared with independent single‑task services.
Experimental Results
Extensive experiments demonstrate that the proposed method attains performance close to dedicated single‑task baselines. Gains stem primarily from the sampling strategy, task filtering, and domain‑specific continual pre‑training.
Key Contributions
Introduced a multi‑task service framework that utilizes LLMs to handle multiple NLP tasks with performance comparable to single‑task models.
Conducted comprehensive experiments showing the framework’s practicality across various benchmarks and evaluating the importance of each component (task selection, sampling strategy, etc.).
Deployed the model in production to serve 11 downstream tasks, achieving up to a 90.9% reduction in total service cost relative to single‑task deployments.
For full details, the original paper can be downloaded from the COLING 2025 proceedings.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
