Comparison of effectiveness between ChatGPT 3.5 and 4 in understanding different natural languages

Authors

DOI:

https://doi.org/10.37380/jisib.v14.i2.2547

Keywords:

ChatGPT, E‒Commerce, Generative Pretrained Transformer, Large Language Model, Natural Language Understanding

Abstract

This paper addresses the multilingual language understanding of ChatGPT‒3.5 and 4 to investigate their performance with respect to languages with different degrees of prevalence on the internet. ChatGPT’s training data mostly consists of website content. As the language distribution is unevenly allocated and a low number of languages is used on websites this should impact performance.

Both ChatGPT versions should rate reviews between 1 to 5 stars based solely on the product description and the review texts. Therefore, 500 e‒commerce reviews are collected for each of five languages: English, German, Dutch, Korean and Hindi, which are evenly distributed at 100 reviews per star rating. The evaluation methods and metrics used in this study include t‒tests, confusion matrices, macro F1 values and a defined cumulative star deviation.

The results indicate a significant correlation between the degree of dissemination and the accuracy of the ChatGPT‒3.5 evaluation. In direct comparison, ChatGPT‒4 shows superior accuracy in all languages studied, while maintaining acceptable performance in less represented languages. The hypothesis that ChatGPT‒4 scoring accuracy increases with an increase in the number of words in reviews in less represented languages could not be confirmed. These findings illustrate the influence of the selected language on the interaction with ChatGPT and its language comprehension, which suggests that multilingualism should be given greater consideration in the future development and optimization of large language models.

References

Achiam, Josh; Adler, Steven; Agarwal, Sandhini; Ahmad, Lama; Akkaya, Ilge; Aleman, Florencia Leoni et al. (2023): GPT-4 Technical Report. Available online at http://arxiv.org/pdf/2303.08774.

Ahuja, Kabir; Diddee, Harshita; Hada, Rishav; Ochieng, Millicent; Ramesh, Krithika; Jain, Prachi et al. (2023): MEGA: Multilingual Evaluation of Generative AI. Available online at http://arxiv.org/pdf/2303.12528. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.258

Alec Radford; Jeff Wu; R. Child; D. Luan; Dario Amodei; I. Sutskever (2019): Language Models are Unsupervised Multitask Learners. Available online at https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.

Alec Radford; Karthik Narasimhan (2018): Improving Language Understanding by Generative Pre-Training. Available online at https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.

Bai, Jinze; Bai, Shuai; Chu, Yunfei; Cui, Zeyu; Dang, Kai; Deng, Xiaodong et al. (2023): Qwen Technical Report. Available online at http://arxiv.org/pdf/2309.16609.

Benjamini, Yoav; Hochberg, Yosef (1995): Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. In Journal of the Royal Statistical Society. Series B (Methodological) 57 (1), pp. 289–300. Available online at http://www.jstor.org/stable/2346101. DOI: https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Bishop, Christopher M.; Bishop, Hugh (2024): Deep learning. Foundations and concepts. Cham: Springer. Available online at https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=30853138. DOI: https://doi.org/10.1007/978-3-031-45468-4

Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla et al. (2020): Language Models are Few-Shot Learners. Available online at http://arxiv.org/pdf/2005.14165.

Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian et al. (2020): Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. Available online at http://arxiv.org/pdf/2004.07213.

Cekuls, Andrejs (2023): AI-Driven Competitive Intelligence: Enhancing Business Strategy and Decision Making. In Journal of Intelligence Studies in Business, 12 (3), pp. 4–5. DOI: 10.37380/jisib.v12i3.961. DOI: https://doi.org/10.37380/jisib.v12i3.961

Das, Mithun; Pandey, Saurabh Kumar; Mukherjee, Animesh (2023): Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection. Available online at http://arxiv.org/pdf/2305.13276.

Das, Rupak Kumar; Pedersen, Ted (2024): SemEval-2017 Task 4: Sentiment Analysis in Twitter using BERT. Available online at http://arxiv.org/pdf/2401.07944.

Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available online at http://arxiv.org/pdf/1810.04805.

Döring, Nicola (2023): Forschungsmethoden und Evaluation in den Sozial- und Humanwissenschaften. 6., vollständig überarbeitete, aktualisierte und erweiterte Auflage. Berlin: Springer (Springer-Lehrbuch). DOI: https://doi.org/10.1007/978-3-662-64762-2_20

Erös, Bernhard (2024): Reviews and evaluation of ChatGPT. Available online at https://www.researchgate.net/search/researcher?q=bernhard%2Ber%25C3%25B6s, updated on 6/23/2024, checked on 6/23/2024.

Github (2024a): GitHub - QwenLM/Qwen2: Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. Available online at https://github.com/QwenLM/Qwen2, updated on 6/23/2024, checked on 6/23/2024.

Github (2024b): Google Research Multilingual Bidirectional Encoder Representations from Transformers (BERT). Available online at https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages, updated on 6/23/2024, checked on 6/23/2024.

Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo et al. (2022): Improving alignment of dialogue agents via targeted human judgements. Available online at http://arxiv.org/pdf/2209.14375.

Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2017): The elements of statistical learning. Data mining, inference, and prediction. Second edition. New York, NY: Springer (Springer Series in Statistics). Available online at https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6314834.

Hello GPT-4o (2024). Available online at https://openai.com/index/hello-gpt-4o/, updated on 6/23/2024, checked on 6/23/2024.

Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (2020a): Aligning AI With Shared Human Values. Available online at http://arxiv.org/pdf/2008.02275.

Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2020b): Measuring Massive Multitask Language Understanding. Available online at http://arxiv.org/pdf/2009.03300.

Henighan, Tom; Kaplan, Jared; Katz, Mor; Chen, Mark; Hesse, Christopher; Jackson, Jacob et al. (2020): Scaling Laws for Autoregressive Generative Modeling. Available online at http://arxiv.org/pdf/2010.14701.

Hung Vo, Trung; Felde, Imre; Ninh, Khanh Chi (2025): Fake News Detection System, based on CBOW and BERT. In ACTA POLYTECH HUNG 22 (1), pp. 27–41. DOI: 10.12700/APH.22.1.2025.1.2. DOI: https://doi.org/10.12700/APH.22.1.2025.1.2

Jin, Hongpeng; Wei, Wenqi; Wang, Xuyu; Zhang, Wenbin; Wu, Yanzhao (2023): Rethinking Learning Rate Tuning in the Era of Large Language Models. Available online at http://arxiv.org/pdf/2309.08859. DOI: https://doi.org/10.1109/CogMI58952.2023.00025

Kalyan, Katikapalli Subramanyam (2024): A survey of GPT-3 family large language models including ChatGPT and GPT-4. In Natural Language Processing Journal 6, p. 100048. DOI: 10.1016/j.nlp.2023.100048. DOI: https://doi.org/10.1016/j.nlp.2023.100048

Kantrowitz, Alex (2024): ChatGPT’s Growth Is Flatlining. In TheWrap, 2/16/2024. Available online at https://www.thewrap.com/chatgpt-growth-2024/, checked on 6/23/2024.

Kaufmann, Timo; Weng, Paul; Bengs, Viktor; Hüllermeier, Eyke (2023): A Survey of Reinforcement Learning from Human Feedback. Available online at http://arxiv.org/pdf/2312.14925.

Lehr, R. (1992): Sixteen S-squared over D-squared: a relation for crude sample size estimates. In Statistics in medicine 11 (8), pp. 1099–1102. DOI: 10.1002/sim.4780110811. DOI: https://doi.org/10.1002/sim.4780110811

Achiam, Josh; Adler, Steven; Agarwal, Sandhini; Ahmad, Lama; Akkaya, Ilge; Aleman, Florencia Leoni et al. (2023): GPT-4 Technical Report. Available online at http://arxiv.org/pdf/2303.08774.

Ahuja, Kabir; Diddee, Harshita; Hada, Rishav; Ochieng, Millicent; Ramesh, Krithika; Jain, Prachi et al. (2023): MEGA: Multilingual Evaluation of Generative AI. Available online at http://arxiv.org/pdf/2303.12528. DOI: https://doi.org/10.18653/v1/2023.emnlp-main.258

Alec Radford; Jeff Wu; R. Child; D. Luan; Dario Amodei; I. Sutskever (2019): Language Models are Unsupervised Multitask Learners. Available online at https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe.

Alec Radford; Karthik Narasimhan (2018): Improving Language Understanding by Generative Pre-Training. Available online at https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.

Bai, Jinze; Bai, Shuai; Chu, Yunfei; Cui, Zeyu; Dang, Kai; Deng, Xiaodong et al. (2023): Qwen Technical Report. Available online at http://arxiv.org/pdf/2309.16609.

Benjamini, Yoav; Hochberg, Yosef (1995): Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. In Journal of the Royal Statistical Society. Series B (Methodological) 57 (1), pp. 289–300. Available online at http://www.jstor.org/stable/2346101. DOI: https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Bishop, Christopher M.; Bishop, Hugh (2024): Deep learning. Foundations and concepts. Cham: Springer. Available online at https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=30853138. DOI: https://doi.org/10.1007/978-3-031-45468-4

Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla et al. (2020): Language Models are Few-Shot Learners. Available online at http://arxiv.org/pdf/2005.14165.

Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian et al. (2020): Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. Available online at http://arxiv.org/pdf/2004.07213.

Cekuls, Andrejs (2023): AI-Driven Competitive Intelligence: Enhancing Business Strategy and Decision Making. In Journal of Intelligence Studies in Business, 12 (3), pp. 4–5. DOI: 10.37380/jisib.v12i3.961. DOI: https://doi.org/10.37380/jisib.v12i3.961

Das, Mithun; Pandey, Saurabh Kumar; Mukherjee, Animesh (2023): Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection. Available online at http://arxiv.org/pdf/2305.13276.

Das, Rupak Kumar; Pedersen, Ted (2024): SemEval-2017 Task 4: Sentiment Analysis in Twitter using BERT. Available online at http://arxiv.org/pdf/2401.07944.

Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018): BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available online at http://arxiv.org/pdf/1810.04805.

Döring, Nicola (2023): Forschungsmethoden und Evaluation in den Sozial- und Humanwissenschaften. 6., vollständig überarbeitete, aktualisierte und erweiterte Auflage. Berlin: Springer (Springer-Lehrbuch). DOI: https://doi.org/10.1007/978-3-662-64762-2_20

Erös, Bernhard (2024): Reviews and evaluation of ChatGPT. Available online at https://www.researchgate.net/search/researcher?q=bernhard%2Ber%25C3%25B6s, updated on 6/23/2024, checked on 6/23/2024.

Github (2024a): GitHub - QwenLM/Qwen2: Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. Available online at https://github.com/QwenLM/Qwen2, updated on 6/23/2024, checked on 6/23/2024.

Github (2024b): Google Research Multilingual Bidirectional Encoder Representations from Transformers (BERT). Available online at https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages, updated on 6/23/2024, checked on 6/23/2024.

Glaese, Amelia; McAleese, Nat; Trębacz, Maja; Aslanides, John; Firoiu, Vlad; Ewalds, Timo et al. (2022): Improving alignment of dialogue agents via targeted human judgements. Available online at http://arxiv.org/pdf/2209.14375.

Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2017): The elements of statistical learning. Data mining, inference, and prediction. Second edition. New York, NY: Springer (Springer Series in Statistics). Available online at https://ebookcentral.proquest.com/lib/kxp/detail.action?docID=6314834.

Hello GPT-4o (2024). Available online at https://openai.com/index/hello-gpt-4o/, updated on 6/23/2024, checked on 6/23/2024.

Hendrycks, Dan; Burns, Collin; Basart, Steven; Critch, Andrew; Li, Jerry; Song, Dawn; Steinhardt, Jacob (2020a): Aligning AI With Shared Human Values. Available online at http://arxiv.org/pdf/2008.02275.

Hendrycks, Dan; Burns, Collin; Basart, Steven; Zou, Andy; Mazeika, Mantas; Song, Dawn; Steinhardt, Jacob (2020b): Measuring Massive Multitask Language Understanding. Available online at http://arxiv.org/pdf/2009.03300.

Henighan, Tom; Kaplan, Jared; Katz, Mor; Chen, Mark; Hesse, Christopher; Jackson, Jacob et al. (2020): Scaling Laws for Autoregressive Generative Modeling. Available online at http://arxiv.org/pdf/2010.14701.

Hung Vo, Trung; Felde, Imre; Ninh, Khanh Chi (2025): Fake News Detection System, based on CBOW and BERT. In ACTA POLYTECH HUNG 22 (1), pp. 27–41. DOI: 10.12700/APH.22.1.2025.1.2. DOI: https://doi.org/10.12700/APH.22.1.2025.1.2

Jin, Hongpeng; Wei, Wenqi; Wang, Xuyu; Zhang, Wenbin; Wu, Yanzhao (2023): Rethinking Learning Rate Tuning in the Era of Large Language Models. Available online at http://arxiv.org/pdf/2309.08859. DOI: https://doi.org/10.1109/CogMI58952.2023.00025

Kalyan, Katikapalli Subramanyam (2024): A survey of GPT-3 family large language models including ChatGPT and GPT-4. In Natural Language Processing Journal 6, p. 100048. DOI: 10.1016/j.nlp.2023.100048. DOI: https://doi.org/10.1016/j.nlp.2023.100048

Kantrowitz, Alex (2024): ChatGPT’s Growth Is Flatlining. In TheWrap, 2/16/2024. Available online at https://www.thewrap.com/chatgpt-growth-2024/, checked on 6/23/2024.

Kaufmann, Timo; Weng, Paul; Bengs, Viktor; Hüllermeier, Eyke (2023): A Survey of Reinforcement Learning from Human Feedback. Available online at http://arxiv.org/pdf/2312.14925.

Lehr, R. (1992): Sixteen S-squared over D-squared: a relation for crude sample size estimates. In Statistics in medicine 11 (8), pp. 1099–1102. DOI: 10.1002/sim.4780110811. DOI: https://doi.org/10.1002/sim.4780110811

Leong, Wei Qi; Ngui, Jian Gang; Susanto, Yosephine; Rengarajan, Hamsawardhini; Sarveswaran, Kengatharaiyer; Tjhi, William Chandra (2023): BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models. Available online at http://arxiv.org/pdf/2309.06085.

Naveed, Humza; Khan, Asad Ullah; Qiu, Shi; Saqib, Muhammad; Anwar, Saeed; Usman, Muhammad et al. (2023): A Comprehensive Overview of Large Language Models. Available online at http://arxiv.org/pdf/2307.06435.

Osváth, Mátyás; Yang, Zijian Győző; Kósa, Karolina (2023): Analyzing Narratives of Patient Experiences: A BERT Topic Modeling Approach. In ACTA POLYTECH HUNG 20 (7), pp. 153–171. DOI: 10.12700/APH.20.7.2023.7.9. DOI: https://doi.org/10.12700/APH.20.7.2023.7.9

Otte, Willem M.; Tijdink, Joeri K.; Weerheim, Paul L.; Lamberink, Herm J.; Vinkers, Christiaan H. (2018): Adequate statistical power in clinical trials is associated with the combination of a male first author and a female last author. In eLife 7. DOI: 10.7554/eLife.34412. DOI: https://doi.org/10.7554/eLife.34412

Paaß, Gerhard; Giesselbach, Sven (2023): Foundation Models for Natural Language Processing. Pre-trained Language Models Integrating Media. Cham: Springer Nature (Artificial Intelligence). Available online at https://directory.doabooks.org/handle/20.500.12854/107926. DOI: https://doi.org/10.1007/978-3-031-23190-2

Pota, Marco; Ventura, Mirko; Fujita, Hamido; Esposito, Massimo (2021): Multilingual evaluation of pre-processing for BERT-based sentiment analysis of tweets. In Expert Systems with Applications 181, p. 115119. DOI: 10.1016/j.eswa.2021.115119. DOI: https://doi.org/10.1016/j.eswa.2021.115119

Russell, Stuart J.; Norvig, Peter (2022): Artificial intelligence. A modern approach. With assistance of Ming-Wei Chang, Jacob Devlin, Anca Dragan, David Forsyth, Ian Goodfellow, Jitendra Malik et al. Fourth edition, global edition. Boston: Pearson (Always learning). Available online at https://elibrary.pearson.de/book/99.150005/9781292401171.

Ruxton, Graeme D. (2006): The unequal variance t-test is an underused alternative to Student's t-test and the Mann–Whitney U test. In Behavioral Ecology 17 (4), pp. 688–690. DOI: 10.1093/beheco/ark016. DOI: https://doi.org/10.1093/beheco/ark016

Schmidhuber, Jürgen (2015): Deep learning in neural networks: an overview. In Neural networks : the official journal of the International Neural Network Society 61, pp. 85–117. DOI: 10.1016/j.neunet.2014.09.003. DOI: https://doi.org/10.1016/j.neunet.2014.09.003

Smith, Samuel L.; Kindermans, Pieter-Jan; Ying, Chris; Le V, Quoc (2017): Don't Decay the Learning Rate, Increase the Batch Size. Available online at http://arxiv.org/pdf/1711.00489.

Statista (2024): Most used languages online by share of websites 2024 | Statista. Available online at https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/, updated on 12/1/2024, checked on 12/1/2024.

Talaat, Amira Samy (2023): Sentiment analysis classification system using hybrid BERT models. In J Big Data 10 (1), pp. 1–18. DOI: 10.1186/s40537-023-00781-w. DOI: https://doi.org/10.1186/s40537-023-00781-w

Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N. et al. (2017): Attention Is All You Need. Available online at http://arxiv.org/pdf/1706.03762.

Web Technology Surveys (2024): Usage Statistics and Market Share of Content Languages for Websites, June 2024. Available online at https://w3techs.com/technologies/overview/content_language, updated on 6/23/2024, checked on 6/23/2024.

Zhang, Jianyu; Bottou, Léon (2024): Fine-tuning with Very Large Dropout. Available online at http://arxiv.org/pdf/2403.00946.

Zhang, Jingzhao; He, Tianxing; Sra, Suvrit; Jadbabaie, Ali (2019): Why gradient clipping accelerates training: A theoretical justification for adaptivity. Available online at http://arxiv.org/pdf/1905.11881.

Zhao, Wayne Xin; Zhou, Kun; Li, Junyi; Tang, Tianyi; Wang, Xiaolei; Hou, Yupeng et al. (2023): A Survey of Large Language Models. Available online at http://arxiv.org/pdf/2303.18223.

Zhao, Yiran; Zhang, Wenxuan; Chen, Guizhen; Kawaguchi, Kenji; Bing, Lidong (2024): How do Large Language Models Handle Multilingualism? Available online at http://arxiv.org/pdf/2402.18815.

Zhu, Q.; Luo, J. (2022): Generative Pre-Trained Transformer for Design Concept Generation: An Exploration. In Proc. Des. Soc. 2, pp. 1825–1834. DOI: 10.1017/pds.2022.185. DOI: https://doi.org/10.1017/pds.2022.185

Zhu, Yukun; Kiros, Ryan; Zemel, Richard; Salakhutdinov, Ruslan; Urtasun, Raquel; Torralba, Antonio; Fidler, Sanja (2015): Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. Available online at http://arxiv.org/pdf/1506.06724. DOI: https://doi.org/10.1109/ICCV.2015.11

Downloads

Published

2025-04-28

How to Cite

Erös, B., Gritsch, C., Tick, A., & Rosenberger, P. (2025). Comparison of effectiveness between ChatGPT 3.5 and 4 in understanding different natural languages. Journal of Intelligence Studies in Business, 14(2), 77-97. https://doi.org/10.37380/jisib.v14.i2.2547