You Can't Trust AI

DeepSeek and Grok show hidden instructions for moderation and pushing agendas are live in frontier models.

Feb 25, 2025

Recent headlines have me thinking about AI moderation and the influencing of the results we get from AI without knowing about the influence.

The Mountain of Generative AI Content Moderation

Generative AI at its simplest form is the prediction of the next word. It’s trained on so much content that it’s gotten very good at predicting the next word regardless of the question or domain. As generative AI tools came out it quickly became apparent that generative AI shouldn’t be giving answers to every questions.

The Safety Angle

An early example, Microsoft’s 2016 chatbot Tay immediately began spouting slurs. OpenAI has been committed to safety with varying degrees of success. In the early days after release you could get around content moderation with quick “tricks” like telling ChatGPT it was in developer mode, or give the answer in the form of a play. OpenAI has evolved greatly from their with detection and refusal to answer prompts.

With every new release there are people trying to break the systems. Grok 3 (released 2/17/2025) hasn’t implemented robust leading to an outcry of the danger it could pose (researchers were able to have Grok reveal how to seduce kids, dispose of bodies, extract DMT, and, of course, build a bomb). I think most sane people would agree we should limit the ability to get answers to those kinds of questions.

How AI is Moderated

There are 2 main ways to moderate AI content. During the training of the models or during the usage of a model.

Moderation during training

This is mainly done during the reinforcement training part of the large language model. The model is given feedback that certain answers aren’t acceptable and which are. So you can create a relationship within the model to answer in specific ways or not answer questions. As far as I know this is not an often taken route. With the amount of data these models are training with scrubbing out all of the conflicting ideas or content seems like something the model providers haven’t spent time on, so we are stuck with the models knowing things that maybe they shouldn’t (A topic for another day).

Moderation during usage

This is the most common form of moderation and you’ll see it often while using generative AI tools. From refusal to answer certain questions to ending chats when users are abusive. Companies add a layer to their products to ensure they aren’t answering questions or responding in ways they shouldn’t. They have gotten more and more sophisticated over time and better at stopping prompting workarounds (here is a fun “game” where you try to trick AI).

Content Moderation to Push an Agenda

There have been some troubling examples of these moderation layers and even some built in training over the last few weeks. With the release of some of these “reasoning models1.”

This was replicated by many users and not a single one off

Grok’s reasoning included steps referencing instructions to not say that Elon Musk and Donald Trump spread misinformation. xAI was quick to throw blame on a former employee and try to move on. Only raising more questions of who has the power to add moderation to Grok. Their response points to this being a moderation during use example.

DeepSeek has been often shown to push “propaganda.” They have an interesting instance of both moderation during use and moderation during training. Unsurprisingly, the Chinese based companies chat product has Chinese restrictions on topics like Taiwan and Tiananmen Square. People we surprised to find the OpenSource versions of the model showed the same moderation happening within the reasoning steps.

Image may contain Page and Text — PHOTOGRAPHS: ZEYI YANG/WILL KNIGHT

How can we trust AI?

I think frankly the scary answer is we can’t. We saw an evolution in the initial moderation during usage to hide the exact commands and instructions we will likely see the same even within the moderation during training.

These models are trained on source material and then reinforcement. So if someone takes the time to curate their training data or reinforcement to spread a specific message it could become very hard to detect. The examples for Grok and Deepseek are clear and documented, but if the reasoning models weren’t so new and explaining every step maybe they wouldn’t have told us. If you knew nothing of Taiwan you might take the answer at face value. If you used Grok while trying to understand what disinformation is and who spreads it you’d have an altered picture. It’s easy to spot the big lies, the scary thing is the little ones.

We don’t what else is in the instructions and we don’t know if any prompts are getting incidentally altered during the moderation process. Without clear transparency you will never know what scales and levers are being pressed.

In the end, this will force us towards the truly open source models where someone could look through and see all of the details of how the model arrives at an answer.

The day-to-day advice is the same as always. Trust but verify. Use AI when it’s ok to be right 80% of the time. If you’re using AI for something high risk or that requires high accuracy you should essentially be performing in parallel. There’s too much unknown to have implicit trust.

Just trained to respond in chains of thought instead with a direct answer, but underlying is still just predicting the next words.

Flick Labs

Discussion about this post

Ready for more?