hpr_transcripts/hpr4394.txt

Episode: 4394
Title: HPR4394: Digital Steganography Intro
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4394/hpr4394.mp3
Transcribed: 2025-10-26 00:12:01

---

This is Hacker Public Radio Episode 4394 for Thursday the 5th of June 2025.
Today's show is entitled Digital Stagnography Intro.
It is hosted by Mightby Mike and is about 33 minutes long.
It carries a clean flag.
The summary is, I take a very high-level look at digital stagnography.
What is it?
Digital stagnography?
Why digital?
Well, stagnography goes back thousands of years and it's got quite a long and storied
history that would totally deserve its own episode, honestly.
That's one reason.
For another thing, I don't really know that much about it.
I know some of the common ciphers that were used, great stories, there's lots of cool stories.
Check the Wikipedia page for that.
It's a long page.
It's got a lot of examples.
This goes back, as I said, who knows how long.
But I just want to talk about digital stagnography because I know it better.
It's a more interesting topic in some ways because of the proliferation of text data
on the internet.
So I want to talk about digital stagnography.
I think it's far more interesting because there's plenty of opportunity for innovation.
That's not always a good thing.
A lot of the innovation comes from bad actors.
It's also a rich field for study and it will be for some time to come.
Not that I want to get way in the weeds here today.
I just want to cover the basics of digital stagnography.
Starting with what is it?
Stagnography is hiding things in plain sight.
That's the view from 5,000 feet.
The first thing I usually say about it is in response to the question that I hear
most often, which is what is the difference between cryptography and stagnography?
Honestly, what they mean to ask is what is the difference between encryption and stagnography?
And it's a great question because they're similar in a way, but they are distinct.
So encryption, here comes the hand-wavy definitions to stay out of the weeds.
Encryption is making data unreadable.
So you're going to obfuscate the data somehow such that I could literally take a secret
and do something to it and put it in a file and hand it to you and you can take that file
and go off and work for weeks and not be able to figure out what the message was because
math.
So making it unreadable to you, obfuscating it, that's encryption.
Stagnography is hiding it in plain sight.
So I can hide data in very public places like social media using stagnography.
Hiding in plain sight sometimes also includes encryption, though, for example, and this is
where the confusion comes in to some extent.
When you're hiding a text message in some sort of media, then of course you have the opportunity
to instead of hiding the text, you can hide the encrypted text to do so.
So it's pretty common.
It's built into some of the software, for example, an optional step to encrypt the data before
you embed it into a file, very common.
And there can be a lot of things that you would hide with stagnography.
A text message is just one, but it's a perfect example to use a lot of times.
But to sum up the distinction, encryption is about making it unreadable and stagnography
is about hiding it, like right under your nose.
So you're hiding in digital stagnography, you're hiding a message in something.
And that's something is usually a file.
It can be a text file, an image, an audio file, a video file, and more.
Any type of file, really, those that I mentioned are very common, though.
In particular, image files are probably the most common, because they're well-known
and very effective techniques for doing it are out there, and it works well enough.
I'll come back to that later, because I also want to mention that there's no need to limit
our thinking here to files per se.
You can hide information in anything that emits a signal or communicates.
A network protocol, for example, has very formal ways of communicating.
You can hide things in the digital world almost anywhere.
My main point was that it's not limited to files.
That's just the most common use case, by far.
So when it comes to hiding, I keep using that word hiding, trying to keep things simple.
The approach is easy to hide things in a file.
The way you should think of these, I guess, is in terms of evaluating them.
They have properties, right?
There's three basic properties that you want to maximize.
You want to consider.
There's integrity, which means the message integrity of the secret that I've put into
the carrier file, let's just say, for example, was it encoded accurately?
That's important, right?
Or maybe it's not.
Was it encoded accurately?
This should not be a lossy process of hiding the message in the carrier.
And they're stealth, that's important.
You want to be able to know that it wasn't tampered with.
I'll go back to that same example case.
I'm hiding a text message inside an image, let's say.
Changes to that image may or may not destroy my secret, or the ability of me to read that secret that's in there.
It often does, but that's what I mean by robustness.
A lot of these techniques are going to be brittle, and it's a bit misleading.
It's not always a matter of the technique per se, but the decisions that the human operator has made about how to do this.
For example, the relative size of the secret that you're hiding with respect to the size of the, let's say, the file that you're embedding it in.
Anyway, where do we see this in the real world?
Individuals can use it, and that's really the examples that I'm citing most.
For example, to communicate in a hidden way.
You can use it, for example, to post a message in a very public place, like on social media.
Companies can use stegonography.
Companies commonly use stegonography often, or most often, I should say,
in ways that are not obvious, by the way, I defined it.
Watermarking, for example, is literally a stegonography.
It's hiding something and something else.
So digital rights management is the typical use case for companies to use stegonography.
And, of course, there are those malevolent actors as well that have classically used stegonography.
Thread actors are where we learn about new techniques, but not that often, because, frankly, who's going to catch them, and how are you going to catch them?
The basic idea is that you're hiding data in plain sight.
So who's going to notice if you do a good job, they won't.
And so it makes perfect sense that the instances of thread actors using this, they use it for just a few basic things as far as people have been able to detect anyway.
One obvious one is for communications.
Between maybe a malicious program you've managed to sneak inside a company's network,
communicating with the outside world where a command and control server is located.
This can be done by just sending, well, I'll give you an example.
The Ocean Lotus group famously did that for years, probably still does.
They were using it to exfiltrate data in images.
Images because they are so commonly sent around, back and forth through firewalls, and innocently often a lot of employees sending pictures back and forth, sending it through email servers, and so on.
Exfiltrating data is a common use of stegonography for that reason, communicating with C2 servers as well.
And the third reason for which there are not too many examples, it's not that common, is smuggling in malicious code.
Going in the opposite direction, a malicious payload actually is only a few famous examples of this.
There was POW load about five years ago, where PNG image files were used to bring in malicious power shell scripts.
There was another example about five years ago where in the headers of JPEG files were used to smuggle in PHP code.
This was Brazilian bank being robbed.
Anyway, that's the idea that is not caught too often, I have a feeling it happens more often.
And this PHP code, it basically opened up a web browser, a text browser, links for those of you that are old enough to know about links.
But those are the common examples, those three.
The most famous communicating with C2 servers was probably the Loki bot, that was about five or six years ago, plenty of other examples though.
Typically, it's things that are commonly going through the firewall, like through port 80, HTML files, images, emails, that sort of thing, the most common kinds of traffic.
So let's talk about file formats.
That's an integral part of understanding stegonography is knowing about the thing you're embedding your secret in in order to figure out an effective way to do it.
So at the end of the day, of course, computers just have a bunch of ones and zeros, logical zeros and ones inside the machine.
And what it means is dependent on how you interpret it.
For example, text files, I'm saying this in the computer science kind of way, not like my dad talks about word documents, a text file, you typically interpreting all those zeros and ones to be data that's been encoded using an ASCII character setting coding or unicode, something like that.
So you're interpreting the way those those ones and zeros mean, binary files are most files, and we're talking about hiding data in both types of files here.
It's worth knowing the difference between lossy and lossless file formats that are used to hide data in also the example I mentioned of hiding a secret and posting it on social media.
It's important to know the platform in that case and what they do to data that's coming in.
If you upload an image, how is it changed because it almost always is on the major social media platforms?
And that particular way is that going to mess up the method or the technique that you've chosen to use?
So it's important to know about the file formats. We're not going to dive into it here, but generally speaking, the features of the file formats are what is exploited.
So for images, for example, there's often file markers, the simplest example, poor man's stegonography for images.
If you simply echo your message onto the end of a JPEG file, append it right onto the end of there, like it was a text file, that poor man's stegonography because a JPEG viewer, the viewer software is going to display only what it thinks is the content.
Which is delimited by that end of content marker. So anything past there, it just won't display. It's still in the file though. That's the simplest example.
So back to the serious stuff. Hiding text in text is really much more useful these days because the proliferation of text on the internet.
And I'm not even talking about large language models here. I'm talking about the proliferation of social media, the proliferation of network protocols.
There's a lot more text-based content on the internet than you realize and new types all the time.
Also, conceptually, it's very simple to explain and to understand how it's working in the case of hiding text in text.
Even to the point where you can easily pick your own method or mix and match.
There's a few common methods that people use with text so far. Feel free to come up with a new one.
The first one I ever heard of, I'm pretty sure it was the first. I think it was from about 2011.
Probably should have looked this up before I started recording.
But the sky published a really simple C program. It doesn't even have any include library, just include standard IO.
It's a one-page program, not even one K. I don't think it compiles every time on every platform. So simple.
It uses a series of spaces and tabs to encode the data, much like dots and dashes in a Morse code of the pre-digital age.
And that's how it hides the message in the text file. It pads the end of sentences with extra characters, tabs and spaces.
And also empty lines that you may have left between paragraphs. If you put in a couple new lines there, there's space to add lots of spaces and tabs there.
And text file viewers are almost never going to show you the extra spaces and tabs.
If there's one or more, you usually see a gap between words or characters.
But at the end of a sentence, it's not going to typically show you that.
The exception to that is usually IDE's that programmers use because in some programming language is the white space matters, like in Python, a very common programming language.
So it will literally show you those. And you can see that. There are other ways to see it, of course, too.
That was the first approach. A more modern version of that is to use Unicode, which has some characters that are zero width.
Meaning that when they're displayed, they don't take up any width. You don't see them visually, but they're there.
They're definitely there. If you look in a hex editor, you'll see them.
But to the naked eye, they're just there. And they're more than two of them. So you don't have to stick to a scheme like Morse code.
You can more efficiently encode your data. And that's why they're popular.
Now, even though I talked about the distinction between text files and binary files, I still got to mention this.
When it comes to binary documents, like PDF files or Word documents, a common way of hiding information is with fonts.
You can mix very similar fonts like encode some of the characters using a certain italic thin font or the other characters using a bold thin font of the same font.
Or different sizes or something like that. In a way, that's not going to be easily even seen.
So speaking of which, I've mentioned these technical methods. And the truth is there are linguistic and semantic methods that you can use as well.
Think about an example where you hide a text message in a text post to a social media platform, like tweets.
You might spread it across a number of tweets, even using capitalization or using position, the fourth word in each tweet or a certain letter.
You know, the first letter of the second sentence in every tweet. Maybe it's every third tweet.
But these are not technical ways of hiding. They're more like semantic ways or linguistic techniques.
And they're very effective. I don't want to give them short shrift. But I am emphasizing the technical stuff because there's more to say about it.
And there's no end to the number of techniques with linguistic techniques because basically they're the same stegonographic methods that have been used for thousands of years just brought over into the digital realm.
So back to the technical stuff. We have the the snow method. It's called the first one I mentioned with the spaces and tabs.
The modern zero with characters using those unicode characters. You see that a lot these days. And there's you can mix and match these things as well.
So I would be remiss if I didn't talk a little bit also about images. And I don't want to go too long here. So without getting into the specific image format specific methods of hiding data because they all have format specific ways of hiding data like the end of content marker that I mentioned.
Or actually that PHP smuggling example that I used that was PHP commands hidden in the JFF headers of a JPEG file.
So it always has to do with the file format or the network protocol specifics or the specifications of whatever the whatever the media that the message is being hidden in.
So the most common technique for encoding your data in an image file is called lsb encoding least significant bit lsb that sounds familiar to the old computer guys.
It's from this old idea about which are the most important bits in a bite for example. Sometimes precision is needed and all the bites are significant.
And sometimes it's not when the bite is describing a color of a pixel in an image it does not have to be exact because your eye can't distinguish that closely if it's very close.
That is to say if you change the least significant bit or bits as opposed to the most significant bit which will noticeably change the color.
So this lsb encoding typically is like the least significant one to three bits but your mileage may vary.
It depends on the relative sizes of of the content you're embedding in the carrier and the size of the image that you're embedding it in.
And this could be a text message or an image or anything else that you're embedding in there. You could literally hide a video inside an image. There's no reason you can't.
So there's a lot of things happening in a file format like a JPEG or a PNG and changing the least significant bits of the color values is just one technique.
It's pretty easy to understand conceptually which is the reason it's mentioned.
I really glossed over the details there because in fact a good way to hide the data is in this DCT transform where there's a bunch of coefficients just a big table of coefficients.
And I didn't want to get in the weeds except to mention that and to mention another common technique that I don't want to forget is color palettes.
If you can define the color palette that's being used in the image, perhaps you have an opportunity to define 65,536 different colors and you don't actually need so many.
Plenty of space there for extra stuff.
And some images have an alpha channel in addition to the colors. You could hide information in there.
But I like the example of LSB encoding because conceptually it's easy. If I just barely change the color on some bits visually you can't detect it in an image with millions of pixels.
Am I really going to notice a tiny difference in the color of blue in that patch of sky? Probably not.
So LSB encoding is a very popular technique. There are plenty of others when it comes to images like I said. But this one is the most commonly used also in audio and video files.
And it makes sense when you think about video files conceptually there's an easy way to think about this which is to realize that the MPEG video file format which is a pretty old file format.
It basically encodes the changes from frame to frame as a way of being efficient instead of describing the entirety of each frame of the video.
So if you think of a video as a series of frames or still images, LSB encoding is a natural fit.
Now I mentioned hiding it in other places. I want to give examples of that too. Even though really what you see a lot of in the real world is hiding data and images.
Let's take the example of hiding data in DNS records. Any network protocol is going to define very specifically the communication that takes place between the nodes on the network.
And if you look at that specification of how they communicate and identify areas where arbitrary data can go, that is where you can hide data.
But back to my example that I was trying to give of DNS records, if you think about them, this is a bunch of texts.
Yeah, literally have text records that you can use. You have SPF records. And there are places you can hide data. And it's not that common.
I don't know any security programs that are out there or anyone that's actually looking for something like this.
So network protocols, I think there are probably plenty of examples out there in the world and not a lot of use cases that you can study because it just works.
And my example of DNS records here, this one is chosen because I like it. Not because it's a better example than most.
If you think about it, everything on the internet almost is text. If it's not, it's a binary file that is easy to hide stuff in as well.
So how do you detect these? Detection is a big part of it and honestly one of the most interesting parts of it.
It's a two step process is the answer. Step one is make assumptions and hope to get lucky.
Assume that they've used common software to encode data. And you can use that same common software for those same techniques to try to detect it.
For example, there's common software used to do LSP encoding text messages in images. And that same software will tell you if there's a message in there.
So you can kind of bundle together of these things, cobble them together into tools. And there are toolkits like that. There's z-stig and there's some others that just go through and check to see if you can get lucky.
If there's an obvious message that was encoded by this software, so assume they use freely available software to hide the data and you simply use that same data or you write your own using the same algorithm to check it.
So if you're a firewall that wants to prevent, let's say you're an email server that wants to keep people from exaltrating data and images, you know how to check for the common things.
And that's easy to do. If you're a WAF, same thing. If you're a firewall of some sort.
This is the go-to technique to try to get lucky. Just assume it's something easy to detect. Because honestly, after that, it's going to get very difficult.
There are various statistical methods of analyzing files, in particular binary files that are used to detect this sort of thing.
It's a bit trickier with text in a lot of cases. Currently, I think because the amount of text you're dealing with is not sufficient.
In the case of social media posts, for example, a lot of those are very short. Old tweets, for example, notoriously 144 characters.
So very short amounts of text are not going to be amenable to statistical methods.
And when you do have plenty of content like a video, it's still very tricky. The math is tricky and so it's difficult to write good software that'll analyze the files for you.
Usually, all the effort goes into step one.
So what I've been talking about here is Steganalysis. Steganalysis is basically just inspecting what you've got. I was going to say the file.
Steganalysis is just an inspection technique that I'm talking about to try to determine whether or not there's something hidden in there.
So if you just have an image on your computer and the common software does not indicate that there is something hidden in there, what would you do?
So the common approach is if you run Linux, you probably know about strings, strings, and bin walk is another one.
Strings will go through a file and try to identify text strings that may be words or maybe something that's not actually binary.
That could be interpreted as a string, as text. And bin walk similarly allows you to walk through, I guess, that's why they call it bin walk, to walk through a binary file.
Looking for things. Those are very useful. And of course, the hex editor is essential. The text-steganography methods I was mentioning before, none of those are going to hide the data from a text editor.
And then there are more osent kind of techniques, open source intelligence. For example, if you have an image, you could do a reverse search on the internet
to find the original image and compare it to the one that you have, even if they're not visually distinct, there are software ways that you can evaluate the difference between the two images that look the same, that are conceptually a lot like GIF, if any of you are familiar with that,
that old Unix text tool for programmers comparing two text files and literally showing you the difference, so that's called GIF.
So a binary version of that, for a particular image format, is something you could do if the images are visually the same, but not the same, and you suspect that a message is hidden in there, and yet the tools that you use to check for LSB encoding don't seem to give you any results.
So some of these tool sets, I mentioned Z-steg before, there's Steg Solve, Steg Detect, and Z-steg I mentioned I think is just for PNG files.
There are other toolkits though, like I said, I don't know what the latest and greatest is, you should probably try them all, honestly, because there's not that many.
There really is a distinct lack of tools to check for the less common things as well, like audio and video.
There's really nothing out there that you can use.
So let's see, I'm sure I left out some very important basics, because that was too lazy to really get organized ahead of this, and to make sure I had a bunch of bullet points to avoid talking about the obvious basics that I should have mentioned.
But whatever it is, I missed for the three of you that listened all the way to the end bravely here.
I guess I'll just need a follow-up episode.
Feel free to leave comments, and tell me what I got wrong, tell me what I forgot that definitely should have been mentioned.
Obviously, you know, I only scratched the surface here, this is the tip of the iceberg, because you get into the weeds really quickly when you talk about this.
And like I said before, it's always related to the file formats, or the specific protocols, or the technical details always matter.
So it is difficult to talk about in a general way.
Hiding stuff in plain sight, like hiding a message in an image that's posted on social media account, is what it's all about, though.
So thanks for listening, let me know in the comments, see you next time.
You have been listening to Hacker Public Radio, at Hacker Public Radio, does work.
Today's show was contributed by a HBR listener like yourself.
If you ever thought of recording podcasts, and click on our contribute link to find out how easy it leads.
Hosting for HBR has been kindly provided by an honesthost.com, the internet archive, and our syncs.net.
On this advice status, today's show is released under Creative Commons Attribution 4.0 International License.