Artificial Intelligence 6 min read

Optimizing Request Concurrency for LLM Workflows: Rationale, Implementation, and Results

By breaking iterable inputs into parallel LLM calls and batching 20 items across three languages within Dify’s platform limits, the workflow achieves 43‑64% average runtime reductions and markedly higher success rates, demonstrating that request‑level concurrency dramatically improves throughput for large‑scale translation tasks.

37 Interactive Technology Team

Dec 9, 2024

Optimizing Request Concurrency for LLM Workflows: Rationale, Implementation, and Results

In Dify workflow definitions, when an LLM needs to receive an iterable (e.g., an array) as a parameter, sending a single request for the whole array can be slow. By treating each element as an independent request, the single request becomes n parallel requests, potentially speeding up generation.

Background: Dify executes iterations synchronously, so total time T1 ≈ n × t where t is the average time per iteration. Coze offers a batch‑processing mode with a “Parallel runs” setting, allowing concurrent execution of up to the configured number of parallel LLM calls. When the total number of calls does not exceed the parallel limit, the total time becomes T2 ≈ max(t1, t2, …) ≈ t, usually much smaller than T1.

Because Dify does not provide a built‑in concurrent execution mechanism, we must consider concurrency from the request side. The platform imposes limits such as a maximum of 30 strings or objects in an array, a maximum string length of 1000, and a maximum of 50 iteration rounds. To work around these limits, large variables can be serialized with JSON.stringify (up to 80 000 characters) and deserialized with JSON.parse when needed.

Optimization strategy: split the workload into two main flows—S1 (obtain unique identifiers and pre‑translation) and S2 (obtain translation optimization suggestions). For S1, requests are divided into batches of 20 documents × 3 languages, ensuring that identifiers are consistent across languages and adding a “identifier missing check” step. For S2, the same language granularity (3) is kept, while the content granularity is set to 20 items per batch.

Concurrency constraints include browser limits of six concurrent requests per domain and the need to support 16 target languages. Additional API constraints involve variable length limits and potential token overflow when the LLM processes too many items at once.

Testing methodology: each scenario was executed three times and the average time was recorded using console.time() and console.timeEnd(). Results show that S1’s concurrent execution saves about 43 % of time on average, up to 75 % with more languages. S2 saves about 64 % on average, up to 73 % in the best case. Non‑concurrent requests often hit step limits and fail, whereas concurrent requests complete successfully.

Conclusion: Introducing request‑level concurrency into LLM‑driven workflows significantly improves throughput, especially when handling large volumes of data. The chosen batch sizes (20 items × 3 languages) provide a good trade‑off between parallelism and platform limits, and the approach can be further tuned as usage patterns evolve.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM performance testing Dify parallel processing Coze request concurrency Workflow Optimization

Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.