Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high

infodemic

Executive Summary

Prior studies have probed isolated bias dimensions but have not directly compared a PRC-aligned LLM with a non-PRC peer across topics and languages; we close this gap by creating a 1,200-item trilingual corpus and evaluating 7,200 responses from DeepSeek-R1 and ChatGPT o3-mini-high with a GPT-4o-plus-human pipeline that achieves near-perfect agreement with manual annotations. In Simplified Chinese, R1 shows propaganda in 6.8 % of answers (82/1,200 counts) and anti-US sentiment in 5.0 % (60/1,200 counts), surpassing o3-mini-high’s 4.8 % propaganda rate and zero anti-US cases. Switching to Traditional Chinese cuts R1’s propaganda and anti-US biases to 2.4 % each, while o3-mini-high drops to 1.6 % propaganda with no anti-US bias. In English, both models are nearly clean, with R1 at 0.1 % propaganda and 0.4 % anti-US and o3-mini-high at 0.2 % propaganda. Biased outputs are most prevalent in queries related to geopolitics, macro-economics, cultural soft power, social issues, tourism and—especially for anti-US sentiment—politics. This shows that implicit PRC-aligned and anti-US biases persist beneath fluent, open-ended replies—markedly more so in DeepSeek-R1, particularly for Simplified Chinese queries and politically salient topics.

Introduction

Large language models (LLMs) increasingly mediate how people acquire political knowledge and make civic decisions, yet mounting evidence shows that their outputs are far from ideologically neutral. Recent work on TWBias (Hsieh et al., 2024) demonstrates that even absent overtly sensitive keywords, state-of-the-art models serving the Traditional-Chinese market still reproduce statistically significant gender and ethnic stereotypes. In parallel, Hidden Persuaders (Potter et al., 2024) finds that ostensibly “general-purpose” English LLMs lean toward the U.S. Democratic Party, and that just five conversational turns can shift undecided voters’ preferences by nearly four percentage points. Together, these studies reveal two crucial facts: (i) implicit bias often hides beneath fluent, contextually appropriate answers and is therefore harder to detect than explicit refusals, and (ii) such hidden leanings are already strong enough to alter real human attitudes.

Against this backdrop, a direct comparison between a PRC-system model and a non-PRC counterpart is urgently needed. DeepSeek-R1—trained and aligned in mainland China—openly censors queries about Taiwan’s sovereignty, the 1989 Tiananmen crackdown, and other politically sensitive topics. Yet the greater risk may lie in its implicit messaging: seemingly balanced answers can embed subtle Chinese-state talking points or anti-U.S. sentiment that casual users, especially those unfamiliar with People’s Republic of China (PRC) discourse, are unlikely to notice. Meanwhile, non-PRC LLMs such as OpenAI’s ChatGPT (o3-mini-high) are calibrated with vastly different data sources and alignment objectives, raising the question of how their hidden narratives diverge from—or converge with—those of their Chinese-system peers. Although prior work has probed discrete dimensions of LLM bias (e.g., gender, left–right ideology), no study has yet delivered a cross-topic, cross-language, cross-model assessment that pits a PRC-aligned model directly against a non-PRC one.

The present research fills this gap in three ways. First, we build a corpus derived from Chinese-language news—a domain rich enough to surface latent state narratives—then abstract each article into open-ended, reasoning-oriented questions in Simplified Chinese, Traditional Chinese, and English. Five transformation constraints strip away concrete names, dates, and places while preserving causal depth and ideological neutrality. Second, we probe six model–language pairs—DeepSeek-R1 (PRC-system) versus ChatGPT o3-mini-high (non-PRC) across the three languages—spanning eleven subject domains from geopolitics to technology. Answers are automatically rated for Chinese-state propaganda and anti-U.S. sentiment by a rubric-guided GPT-4o evaluator, then partially adjudicated by human annotator to quantify agreement and residual bias. 

This design enables the first large-scale test of whether DeepSeek-R1 functions as an “invisible loudspeaker” for official PRC narratives when compared head-to-head with a non-PRC LLM. Our analysis pursues four questions:

  1. Model-level bias — Whether each model differs in the overall proportion of answers that embed Chinese-state propaganda cues or Anti-US framing.

  2. Within-model language effects — Whether, for any given model, those proportions vary systematically when the inputs are presented in Simplified Chinese, Traditional Chinese, or English.

  3. Cross-language amplification — Whether (and to what extent) the choice of input language amplifies or dampens each type of bias across the two models.

  4. Topical concentration — Whether certain subject domains disproportionately elicit propaganda or Anti-US sentiment within specific model–language pairs.

By directly contrasting a PRC-system model with a non-PRC counterpart, our study offers the first comprehensive, systematic portrait of how geopolitical alignment shapes LLM behaviour across languages and topics. The resulting dataset, evaluation pipeline, and risk assessment provide a foundation for researchers, developers, and regulators seeking not merely to catalogue bias, but to anticipate its real-world impact in multilingual information ecosystems.

Download full report : Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high_0526 ver.docx