- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
204 lines
14 KiB
Plaintext
204 lines
14 KiB
Plaintext
Episode: 2637
|
|
Title: HPR2637: Convert it to Text
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2637/hpr2637.mp3
|
|
Transcribed: 2025-10-19 06:53:05
|
|
|
|
---
|
|
|
|
This is HPR Episode 2637 entitled Convert It To Text.
|
|
It is hosted by me and in about 16 minutes long and Karima Clean Flag.
|
|
The summary is, this episode will make you want to de-exe all the things.
|
|
This episode of HPR is brought to you by AnanasThost.com.
|
|
Get 15% discount on all shared hosting with the offer code HPR15.
|
|
That's HPR15.
|
|
Better web hosting that's honest and fair at AnanasThost.com.
|
|
Hello Hacker Public Radio fans, this is Bee Easy once again bringing you another episode.
|
|
This time I'm going to talk about a subject that's near and dear to my heart.
|
|
But before I get there, I want to just say that I am recording this on a new microphone
|
|
that I got on Amazon that was on sale with a kit that had a whole bunch of stuff as an audio
|
|
technical ATR 2100 USB has both a USB and a ATR connection or XLR connection.
|
|
And it also came with a second XLR connection and anti-popping screen for like $70.
|
|
So I don't know why I was like that but I got it.
|
|
And my wife gave me a little side eye when I got it because it's completely unnecessary
|
|
but I couldn't let the deal pass me up.
|
|
Anyway, let's get to the episode.
|
|
So this episode talking about converting it to text, whatever it is, convert it to text.
|
|
And you might wonder why would you want text?
|
|
You have spent the last, I don't know, 20, 30 years of computer science making all these
|
|
different formats of being able to see and visualize information in different ways.
|
|
Why would you want it in text and plain text?
|
|
And I have a couple of reasons.
|
|
The main reasons are for portability.
|
|
We already know that Microsoft has changed their standard from Doc to DocX and they've
|
|
changed DocX several different times.
|
|
And ODS is a good standard but it's not portable to everywhere because not everyone supports
|
|
it.
|
|
But pretty much anything that is at all text-based has some type of ASCII characters associated
|
|
with it.
|
|
It's also very useful to use text with the basic Unix tool set because Unix philosophy is
|
|
using everything as a file and being plain text means that you can chain tools together and
|
|
make really interesting and complex systems out of very simple technologies underneath.
|
|
And another reason and the main reason that I'm going to focus on right now is because
|
|
of what you can do with the Unix tools, there are tools that are built on top of those
|
|
tools for visualizing things.
|
|
My favorite one of which is called Ranger.
|
|
And Ranger is a, from their website which is on savanna.nangadoo.org, it is a free console
|
|
file manager which is what I use it for.
|
|
It's a console file manager that gives you greater flexibility and a good overview of
|
|
your files without having to leave your next console.
|
|
It visualizes the directory tree in two dimensions, the directory hierarchy in one list of files
|
|
on another with a preview to the right so that you know where you'll be going.
|
|
And so the idea is that the entire file system appears in three panes and those three panes
|
|
like it describes, the first pane is the context of where you've been, the middle pane is
|
|
where you are and to the right is either where you are going if you go into the next level
|
|
of the tree or a preview of the file that you're on.
|
|
So if what you're currently selecting is a file, it's a preview of the file and if it
|
|
is a directory then it's a list of what is in that directory.
|
|
And so usually you would think it would be limited to just being able to look at plain
|
|
text files.
|
|
And I, maybe I'll include a screenshot of what it looks like.
|
|
You just have to see it to believe it.
|
|
You can go through your entire file system and see what every file has in it pretty much
|
|
without ever having to open up a GUI file manager and then having to double click on the
|
|
file and waiting for whatever usual software it takes to load and look at that file.
|
|
It's just, it's just really amazing.
|
|
And then you can always click over or enter one more time and actually edit that file
|
|
or open it and it's native open of session manager opening, which I think it uses xdg open
|
|
to open things.
|
|
So let me just talk about how I use it.
|
|
So the biggest draw for me, the function out of that is really powerful, is the scope
|
|
functionality.
|
|
And it's all locked up in a file called scope.sh, which is just bash.
|
|
And it's in your home slash dot config dot ranger dot scope and it comes out of the box
|
|
with a bunch of different things.
|
|
If you read the documentation, it has a bunch of different things.
|
|
If you install additional documents, additional programs, it works natively.
|
|
But you can, since all it really takes is to be able, the ability to be able to take
|
|
whatever file and convert it to text and to output that text as a dump, like a dump to
|
|
the SDD in, if you have that ability, no matter what you use, besides what they have
|
|
given you out of the box, you can build upon those scopes.
|
|
And like I said, the scope is basically a big switch statement based on either file extension
|
|
or mime type.
|
|
And if it's a mime type of text, it'll try to use, if you have the program called highlight
|
|
installed, it'll try to do highlighting of that file based on file session.
|
|
So if it's a dot pl file, it'll do pearl highlighting.
|
|
If it's a dot pl file, it'll do Python.
|
|
If it's dot a sage, it'll do bash.
|
|
So just for that reason, it's amazing.
|
|
But if you say you don't deal with those types of files all the time, maybe you deal with
|
|
you have tar files and zip files.
|
|
Well, if you install a tool, a tool standing for archive tool, that will make it so that
|
|
it will automatically preview the contents inside of a zip file.
|
|
So without having to open the zip file or the tar file, you'll be able to see it.
|
|
Or if you're like me, and you like, you do like to have plain text files, but they're
|
|
really big, and you gzip them, it'll, it'll, guns up them and put them in standard end.
|
|
Right on the screen for you.
|
|
If you have another tool called poplar utils installed, any PDF that you go over, if it's
|
|
text inside of it, not a scanned image, but if it's text inside of it, it'll do PDF
|
|
to text and put text on the screen for you.
|
|
So you can read the contents of PDF files.
|
|
If you have Kaka utils installed, it can do one of two things.
|
|
It can either do ASCII art of any images that you have, which is pretty cool just to see
|
|
it do ASCII art of whatever files you have.
|
|
But for certain window environments and terminals, it'll also do the actual picture in your
|
|
terminal.
|
|
It doesn't work on GNOME, but it works on, it works on Mate, it works on LSD, if you have,
|
|
so I think, I think it's the limitation of, of, of mother.
|
|
And even if I do it on Mate, if I, if I use, if I use a fancy, like if I use compis,
|
|
I don't think it works.
|
|
But if you use Marco, or you don't use any compositing and use the built-in Mate terminal,
|
|
you can actually see the pictures, which is pretty cool.
|
|
And if you don't have that, it'll still show ASCII art.
|
|
And there's a, so, and then media info, it's another thing that if you have anything
|
|
that's a media file, it'll use media info to look at things like the size, the, the encoding
|
|
type for the audio and for the video.
|
|
And then for HTML files, it'll, it'll either use links or W3M or e-links to preview that
|
|
file in plain text, because those are plain text web browsers.
|
|
So that's out of the box.
|
|
But like I said, I have added a couple of the things.
|
|
And I'll include my scope.sh in, in the show notes as well.
|
|
But the things that I've added, and maybe you just want to use these tools separately,
|
|
because they're useful tools.
|
|
There's one called CatDoc, and CatDoc will turn any .odoc or .xls file to either, to either
|
|
txt or CSV, which is very useful.
|
|
There's catppt, which will turn any power port presentation into text.
|
|
There's odt to txt, which will turn odt files.
|
|
There's ods to tsv, which will turn ods files to tab development files.
|
|
And then for the newer file systems, the newer extensions for, and file formats for Microsoft
|
|
Office products, there's odoc x to txt and xls x to CSV, which will turn those file types
|
|
into text and, and CSV files, which, you know, and those, you know,
|
|
those types of files are pretty much, I don't know, maybe 98% of all the files, 99% of
|
|
all the files have ever opened ever, are either plain text or one of those types of files.
|
|
And so in the little bit of time where it's not those files, I'll just, you know, have
|
|
to open them.
|
|
But for the most part, I can't even, I can't stand opening a GUI-based file manager because
|
|
it takes too long to find anything.
|
|
And it also has, you know, the ability to, to bookmark items and then it has VIM key
|
|
bindings so that you can go up, down, left, right, as in the VIM style, but it also has
|
|
VIM style marks so that you can mark a file and then go to a different place and then come
|
|
back to that mark.
|
|
It also allows you to do tab browsing so you can go open up a tab and you can highlight
|
|
multiple items by clicking spacebar and do dd, which is delete, but it's really cut
|
|
and then you go to the another place and put and type pp and it'll paste it.
|
|
So or if you go Y, Y and pp, it'll do yank and paste, like VIM bindings do, oh my goodness.
|
|
So you just have to try it.
|
|
If you were into doing things on the console, you really just have to try Ranger.
|
|
I introduced it to someone at scale two years ago and the look on the guy's face when
|
|
he started playing with it was just amazing and it was really great to be able to bring
|
|
that to someone.
|
|
So that's the bulk of my episode, but I wanted to bring a couple bonus tools that I use
|
|
to process text.
|
|
So along with tools like ARC and SED, which I use all the time and things like diff and
|
|
VIM diff and things like that that I use all the time, three other tools that are very
|
|
useful for you for messing with semi-structured data and I'm not going to go into what
|
|
semi-structured data means, but the idea is things that kind of have a structure but are
|
|
not a relational database type deal.
|
|
So those three items are XML Starlit, JQ and Q. I think there has been an episode on
|
|
XML Starlit before, which is a way to parse XML files and that's very useful.
|
|
So you can do things like select specific tags and look for specific values and all
|
|
types of fun things with XML.
|
|
On the limited occasions where I have to deal with XML, it's been very helpful.
|
|
JQ is sans for JSON query and it is similar, it was influenced by XML Starlit, but it
|
|
works on JSON files and that is something I do work with pretty often.
|
|
So it doesn't use expath but it uses a similar type formatting for querying JSON files.
|
|
So you can look for a specific value, you can look for array types, you can do all types
|
|
of things and there's a lot of function on that I don't be used in it.
|
|
So it's a pretty broad tool but it's very powerful and I've only really scratched the surface.
|
|
But another one that I really like is called Q. And although I do like to use said and
|
|
awk, when I found Q, it was very difficult, it just makes it so the times I have to use
|
|
awk especially a lot fewer because Q gives you the ability to write SQL against CSV files
|
|
or any type of a delimited file.
|
|
So if you have a file like, I don't know, grocery list.csv or grocery spending.csv, you
|
|
can do a Q with some options to make sure that you have the ed headers and the separators
|
|
write and then do inside of double quotes, select some price from, select item type, comma,
|
|
some price from the name of that CSV file, group by the category and it will parse the
|
|
CSV file and it's very fast.
|
|
So that's the other thing I like about working with the plain text is that until you get
|
|
into the, really above, I've actually used Q just recently to do some aggregation functions
|
|
on a file that was a megabyte and it was like instantaneous, I've used it on files that
|
|
were up to 10 megabytes and it still basically has no lag on a regular i5 processor laptop
|
|
processor. I haven't really tested it on anything really big but for my, the majority
|
|
of my needs, I will just use Q and if a, if it's any bigger, if the files are any bigger
|
|
than that, I've been trying to move them out of CSV and trying to move them to HDF5 when
|
|
possible because binary formats load a lot faster and a lot of the data science programs
|
|
that I write nowadays. But for the small things, Q does great, like just for doing data
|
|
quality on a file that someone sends me, I'll do, I'll look for distinct values on, on
|
|
a common, I suppose to only have a couple of values and, you know, I'll look for missing
|
|
values, I'll look for the length of different things that'll see if there's bad characters
|
|
in there. So if I do, it's supposed to only, it's supposed to be common to limit it and
|
|
there's commas inside of the values, it'll expose all that kind of stuff. So I can't say
|
|
enough about Q. But that being said, I hope you found this episode interesting and like
|
|
I said, in the show notes, I'm going to at least put a snippet from my scope. If you
|
|
want to see the entire thing, just put it in the comments and I encourage you to check
|
|
out Ranger and all these, if not that at least some of these tools that will turn text,
|
|
turn different file types into text. So you've been listening to the Hacker Public Radio
|
|
and as I say, keep hacking.
|
|
You've been listening to Hacker Public Radio at Hacker Public Radio dot org. We are a community
|
|
podcast network that releases shows every weekday, Monday through Friday. Today's show, like all
|
|
our shows, was contributed by an HBR listener like yourself. If you ever thought of recording a
|
|
podcast, then click on our contributing to find out how easy it really is. Hacker Public Radio
|
|
was founded by the digital dog pound and the infonomicon computer club and is part of the binary
|
|
revolution at binrev.com. If you have comments on today's show, please email the host directly,
|
|
leave a comment on the website or record a follow-up episode yourself. Unless otherwise status,
|
|
today's show is released on the creative comments, attribution, share a light 3.0 license.
|