Hybrid Cloud Architecture for Efficient and Cost-Effective Large Language Model Deployment

  • Qi Xin University of Pittsburgh, United States
Keywords: Large Language Models, Cloud Computing, Hybrid Deployment, Edge Computing, Cost Optimization

Abstract

Large Language Models (LLMs) have achieved remarkable success across natural language tasks, but their enormous computational requirements pose challenges for practical deployment. This paper proposes a hybrid cloud–edge architecture to deploy LLMs in a cost-effective and efficient manner. The proposed system employs a lightweight on-premise LLM to handle the bulk of user requests, and dynamically offloads complex queries to a powerful cloud-hosted LLM only when necessary. We implement a confidence-based routing mechanism to decide when to invoke the cloud model. Experiments on a question-answering use case demonstrate that our hybrid approach can match the accuracy of a state-of-the-art LLM while reducing cloud API usage by over 60%, resulting in significant cost savings and a ~40% reduction in average latency. We also discuss how the hybrid strategy enhances data privacy by keeping sensitive queries on-premise. These results highlight a promising direction for organizations to leverage advanced LLM capabilities without prohibitive expense or risk, by intelligently combining local and cloud resources.

Downloads

Download data is not yet available.

References

T. Brown et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.

M. Zhang, B. Yuan, H. Li, and K. Xu, “LLM-CloudComplete: Leveraging cloud computing for efficient large language model-based code completion,” Journal of Artificial Intelligence General Science, vol. 5, no. 1, pp. 295–326, 2024.

A. Iyengar and P. Adusumilli, “Bigger isn’t always better: How hybrid AI pattern enables smaller language models,” IBM Cloud Blog, Oct. 2023.

Y. Liu, H. Zhang, Y. Miao, V. Le, and Z. Li, “OptLLM: Optimal assignment of queries to large language models,” arXiv preprint arXiv:2405.15130, 2024.

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176, 2023.

C. Ding et al., “A cloud-edge collaboration framework for cognitive service,” IEEE Transactions on Cloud Computing, vol. 10, no. 3, pp. 1489–1499, 2020.

J. Yao et al., “Edge-cloud polarization and collaboration: A comprehensive survey for AI,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 7, pp. 6866–6886, 2022.

Z. Zhou et al., “A survey on efficient inference for large language models,” arXiv preprint arXiv:2404.14294, 2024.

Z. Yang et al., “PerLLM: Personalized inference scheduling with edge-cloud collaboration for diverse LLM services,” arXiv preprint arXiv:2405.14636, 2024.

M. Zhang et al., “EdgeShard: Efficient LLM inference via collaborative edge computing,” arXiv preprint arXiv:2405.14371, 2024.

F. Piccialli, D. Chiaro, P. Qi, V. Bellandi, and E. Damiani, “Federated and edge learning for large language models,” Information Fusion, vol. 117, p. 102840, 2025.

Y. Zheng, Y. Chen, B. Qian, X. Shi, Y. Shu, and J. Chen, “A review on edge large language models: design, execution, and applications,” ACM Computing Surveys, vol. 57, no. 8, pp. 1–35, 2025.

F. Dennstädt et al., “Implementing large language models in healthcare while balancing control, collaboration, costs and security,” npj Digital Medicine, vol. 8, Art. no. 143, 2025.

Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model security and privacy: The good, the bad, and the ugly,” High-Confidence Computing, vol. 4, no. 2, Art. no. 100211, 2024.

Published
2025-09-22
Abstract views: 66 times
Download PDF: 21 times
How to Cite
Xin, Q. (2025). Hybrid Cloud Architecture for Efficient and Cost-Effective Large Language Model Deployment. Journal of Information Systems and Informatics, 7(3), 2182-2195. https://doi.org/10.51519/journalisi.v7i3.1170
Section
Articles