Hybrid Cloud Architecture for Efficient and Cost-Effective Large Language Model Deployment

Qi Xin

doi:10.51519/journalisi.v7i3.1170

Hybrid Cloud Architecture for Efficient and Cost-Effective Large Language Model Deployment

Qi Xin University of Pittsburgh, United States

DOI: 10.51519/journalisi.v7i3.1170

Keywords: Large Language Models, Cloud Computing, Hybrid Deployment, Edge Computing, Cost Optimization

Abstract

Large Language Models (LLMs) have achieved remarkable success across natural language tasks, but their enormous computational requirements pose challenges for practical deployment. This paper proposes a hybrid cloud–edge architecture to deploy LLMs in a cost-effective and efficient manner. The proposed system employs a lightweight on-premise LLM to handle the bulk of user requests, and dynamically offloads complex queries to a powerful cloud-hosted LLM only when necessary. We implement a confidence-based routing mechanism to decide when to invoke the cloud model. Experiments on a question-answering use case demonstrate that our hybrid approach can match the accuracy of a state-of-the-art LLM while reducing cloud API usage by over 60%, resulting in significant cost savings and a ~40% reduction in average latency. We also discuss how the hybrid strategy enhances data privacy by keeping sensitive queries on-premise. These results highlight a promising direction for organizations to leverage advanced LLM capabilities without prohibitive expense or risk, by intelligently combining local and cloud resources.

Downloads

Download data is not yet available.

References

T. Brown et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.

M. Zhang, B. Yuan, H. Li, and K. Xu, “LLM-CloudComplete: Leveraging cloud computing for efficient large language model-based code completion,” Journal of Artificial Intelligence General Science, vol. 5, no. 1, pp. 295–326, 2024.

A. Iyengar and P. Adusumilli, “Bigger isn’t always better: How hybrid AI pattern enables smaller language models,” IBM Cloud Blog, Oct. 2023.

Y. Liu, H. Zhang, Y. Miao, V. Le, and Z. Li, “OptLLM: Optimal assignment of queries to large language models,” arXiv preprint arXiv:2405.15130, 2024.

L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,” arXiv preprint arXiv:2305.05176, 2023.

C. Ding et al., “A cloud-edge collaboration framework for cognitive service,” IEEE Transactions on Cloud Computing, vol. 10, no. 3, pp. 1489–1499, 2020.

J. Yao et al., “Edge-cloud polarization and collaboration: A comprehensive survey for AI,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 7, pp. 6866–6886, 2022.

Z. Zhou et al., “A survey on efficient inference for large language models,” arXiv preprint arXiv:2404.14294, 2024.

Z. Yang et al., “PerLLM: Personalized inference scheduling with edge-cloud collaboration for diverse LLM services,” arXiv preprint arXiv:2405.14636, 2024.

M. Zhang et al., “EdgeShard: Efficient LLM inference via collaborative edge computing,” arXiv preprint arXiv:2405.14371, 2024.

F. Piccialli, D. Chiaro, P. Qi, V. Bellandi, and E. Damiani, “Federated and edge learning for large language models,” Information Fusion, vol. 117, p. 102840, 2025.

Y. Zheng, Y. Chen, B. Qian, X. Shi, Y. Shu, and J. Chen, “A review on edge large language models: design, execution, and applications,” ACM Computing Surveys, vol. 57, no. 8, pp. 1–35, 2025.

F. Dennstädt et al., “Implementing large language models in healthcare while balancing control, collaboration, costs and security,” npj Digital Medicine, vol. 8, Art. no. 143, 2025.

Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, and Y. Zhang, “A survey on large language model security and privacy: The good, the bad, and the ugly,” High-Confidence Computing, vol. 4, no. 2, Art. no. 100211, 2024.

Download PDF

Published

2025-09-22

Abstract views: 66 times

Download PDF: 21 times

How to Cite

Xin, Q. (2025). Hybrid Cloud Architecture for Efficient and Cost-Effective Large Language Model Deployment. Journal of Information Systems and Informatics, 7(3), 2182-2195. https://doi.org/10.51519/journalisi.v7i3.1170

Download Citation

Issue

Vol 7 No 3 (2025): September

Section

Articles

This work is licensed under a Creative Commons Attribution 4.0 International License.

I certify that I have read, understand and agreed to the Journal of Information Systems and Informatics (Journal-ISI) submission guidelines, policies and submission declaration. Submission already using the provided template.
I certify that all authors have approved the publication of this and there is no conflict of interest.
I confirm that the manuscript is the authors' original work and the manuscript has not received prior publication and is not under consideration for publication elsewhere and has not been previously published.
I confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
I confirm that the paper now submitted is not copied or plagiarized version of some other published work.
I declare that I shall not submit the paper for publication in any other Journal or Magazine till the decision is made by journal editors.
If the paper is finally accepted by the journal for publication, I confirm that I will either publish the paper immediately or withdraw it according to withdrawal policies
I Agree that the paper published by this journal, I transfer copyright or assign exclusive rights to the publisher (including commercial rights)