Files

204 lines
14 KiB
Plaintext
Raw Permalink Normal View History

Episode: 2637
Title: HPR2637: Convert it to Text
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2637/hpr2637.mp3
Transcribed: 2025-10-19 06:53:05
---
This is HPR Episode 2637 entitled Convert It To Text.
It is hosted by me and in about 16 minutes long and Karima Clean Flag.
The summary is, this episode will make you want to de-exe all the things.
This episode of HPR is brought to you by AnanasThost.com.
Get 15% discount on all shared hosting with the offer code HPR15.
That's HPR15.
Better web hosting that's honest and fair at AnanasThost.com.
Hello Hacker Public Radio fans, this is Bee Easy once again bringing you another episode.
This time I'm going to talk about a subject that's near and dear to my heart.
But before I get there, I want to just say that I am recording this on a new microphone
that I got on Amazon that was on sale with a kit that had a whole bunch of stuff as an audio
technical ATR 2100 USB has both a USB and a ATR connection or XLR connection.
And it also came with a second XLR connection and anti-popping screen for like $70.
So I don't know why I was like that but I got it.
And my wife gave me a little side eye when I got it because it's completely unnecessary
but I couldn't let the deal pass me up.
Anyway, let's get to the episode.
So this episode talking about converting it to text, whatever it is, convert it to text.
And you might wonder why would you want text?
You have spent the last, I don't know, 20, 30 years of computer science making all these
different formats of being able to see and visualize information in different ways.
Why would you want it in text and plain text?
And I have a couple of reasons.
The main reasons are for portability.
We already know that Microsoft has changed their standard from Doc to DocX and they've
changed DocX several different times.
And ODS is a good standard but it's not portable to everywhere because not everyone supports
it.
But pretty much anything that is at all text-based has some type of ASCII characters associated
with it.
It's also very useful to use text with the basic Unix tool set because Unix philosophy is
using everything as a file and being plain text means that you can chain tools together and
make really interesting and complex systems out of very simple technologies underneath.
And another reason and the main reason that I'm going to focus on right now is because
of what you can do with the Unix tools, there are tools that are built on top of those
tools for visualizing things.
My favorite one of which is called Ranger.
And Ranger is a, from their website which is on savanna.nangadoo.org, it is a free console
file manager which is what I use it for.
It's a console file manager that gives you greater flexibility and a good overview of
your files without having to leave your next console.
It visualizes the directory tree in two dimensions, the directory hierarchy in one list of files
on another with a preview to the right so that you know where you'll be going.
And so the idea is that the entire file system appears in three panes and those three panes
like it describes, the first pane is the context of where you've been, the middle pane is
where you are and to the right is either where you are going if you go into the next level
of the tree or a preview of the file that you're on.
So if what you're currently selecting is a file, it's a preview of the file and if it
is a directory then it's a list of what is in that directory.
And so usually you would think it would be limited to just being able to look at plain
text files.
And I, maybe I'll include a screenshot of what it looks like.
You just have to see it to believe it.
You can go through your entire file system and see what every file has in it pretty much
without ever having to open up a GUI file manager and then having to double click on the
file and waiting for whatever usual software it takes to load and look at that file.
It's just, it's just really amazing.
And then you can always click over or enter one more time and actually edit that file
or open it and it's native open of session manager opening, which I think it uses xdg open
to open things.
So let me just talk about how I use it.
So the biggest draw for me, the function out of that is really powerful, is the scope
functionality.
And it's all locked up in a file called scope.sh, which is just bash.
And it's in your home slash dot config dot ranger dot scope and it comes out of the box
with a bunch of different things.
If you read the documentation, it has a bunch of different things.
If you install additional documents, additional programs, it works natively.
But you can, since all it really takes is to be able, the ability to be able to take
whatever file and convert it to text and to output that text as a dump, like a dump to
the SDD in, if you have that ability, no matter what you use, besides what they have
given you out of the box, you can build upon those scopes.
And like I said, the scope is basically a big switch statement based on either file extension
or mime type.
And if it's a mime type of text, it'll try to use, if you have the program called highlight
installed, it'll try to do highlighting of that file based on file session.
So if it's a dot pl file, it'll do pearl highlighting.
If it's a dot pl file, it'll do Python.
If it's dot a sage, it'll do bash.
So just for that reason, it's amazing.
But if you say you don't deal with those types of files all the time, maybe you deal with
you have tar files and zip files.
Well, if you install a tool, a tool standing for archive tool, that will make it so that
it will automatically preview the contents inside of a zip file.
So without having to open the zip file or the tar file, you'll be able to see it.
Or if you're like me, and you like, you do like to have plain text files, but they're
really big, and you gzip them, it'll, it'll, guns up them and put them in standard end.
Right on the screen for you.
If you have another tool called poplar utils installed, any PDF that you go over, if it's
text inside of it, not a scanned image, but if it's text inside of it, it'll do PDF
to text and put text on the screen for you.
So you can read the contents of PDF files.
If you have Kaka utils installed, it can do one of two things.
It can either do ASCII art of any images that you have, which is pretty cool just to see
it do ASCII art of whatever files you have.
But for certain window environments and terminals, it'll also do the actual picture in your
terminal.
It doesn't work on GNOME, but it works on, it works on Mate, it works on LSD, if you have,
so I think, I think it's the limitation of, of, of mother.
And even if I do it on Mate, if I, if I use, if I use a fancy, like if I use compis,
I don't think it works.
But if you use Marco, or you don't use any compositing and use the built-in Mate terminal,
you can actually see the pictures, which is pretty cool.
And if you don't have that, it'll still show ASCII art.
And there's a, so, and then media info, it's another thing that if you have anything
that's a media file, it'll use media info to look at things like the size, the, the encoding
type for the audio and for the video.
And then for HTML files, it'll, it'll either use links or W3M or e-links to preview that
file in plain text, because those are plain text web browsers.
So that's out of the box.
But like I said, I have added a couple of the things.
And I'll include my scope.sh in, in the show notes as well.
But the things that I've added, and maybe you just want to use these tools separately,
because they're useful tools.
There's one called CatDoc, and CatDoc will turn any .odoc or .xls file to either, to either
txt or CSV, which is very useful.
There's catppt, which will turn any power port presentation into text.
There's odt to txt, which will turn odt files.
There's ods to tsv, which will turn ods files to tab development files.
And then for the newer file systems, the newer extensions for, and file formats for Microsoft
Office products, there's odoc x to txt and xls x to CSV, which will turn those file types
into text and, and CSV files, which, you know, and those, you know,
those types of files are pretty much, I don't know, maybe 98% of all the files, 99% of
all the files have ever opened ever, are either plain text or one of those types of files.
And so in the little bit of time where it's not those files, I'll just, you know, have
to open them.
But for the most part, I can't even, I can't stand opening a GUI-based file manager because
it takes too long to find anything.
And it also has, you know, the ability to, to bookmark items and then it has VIM key
bindings so that you can go up, down, left, right, as in the VIM style, but it also has
VIM style marks so that you can mark a file and then go to a different place and then come
back to that mark.
It also allows you to do tab browsing so you can go open up a tab and you can highlight
multiple items by clicking spacebar and do dd, which is delete, but it's really cut
and then you go to the another place and put and type pp and it'll paste it.
So or if you go Y, Y and pp, it'll do yank and paste, like VIM bindings do, oh my goodness.
So you just have to try it.
If you were into doing things on the console, you really just have to try Ranger.
I introduced it to someone at scale two years ago and the look on the guy's face when
he started playing with it was just amazing and it was really great to be able to bring
that to someone.
So that's the bulk of my episode, but I wanted to bring a couple bonus tools that I use
to process text.
So along with tools like ARC and SED, which I use all the time and things like diff and
VIM diff and things like that that I use all the time, three other tools that are very
useful for you for messing with semi-structured data and I'm not going to go into what
semi-structured data means, but the idea is things that kind of have a structure but are
not a relational database type deal.
So those three items are XML Starlit, JQ and Q. I think there has been an episode on
XML Starlit before, which is a way to parse XML files and that's very useful.
So you can do things like select specific tags and look for specific values and all
types of fun things with XML.
On the limited occasions where I have to deal with XML, it's been very helpful.
JQ is sans for JSON query and it is similar, it was influenced by XML Starlit, but it
works on JSON files and that is something I do work with pretty often.
So it doesn't use expath but it uses a similar type formatting for querying JSON files.
So you can look for a specific value, you can look for array types, you can do all types
of things and there's a lot of function on that I don't be used in it.
So it's a pretty broad tool but it's very powerful and I've only really scratched the surface.
But another one that I really like is called Q. And although I do like to use said and
awk, when I found Q, it was very difficult, it just makes it so the times I have to use
awk especially a lot fewer because Q gives you the ability to write SQL against CSV files
or any type of a delimited file.
So if you have a file like, I don't know, grocery list.csv or grocery spending.csv, you
can do a Q with some options to make sure that you have the ed headers and the separators
write and then do inside of double quotes, select some price from, select item type, comma,
some price from the name of that CSV file, group by the category and it will parse the
CSV file and it's very fast.
So that's the other thing I like about working with the plain text is that until you get
into the, really above, I've actually used Q just recently to do some aggregation functions
on a file that was a megabyte and it was like instantaneous, I've used it on files that
were up to 10 megabytes and it still basically has no lag on a regular i5 processor laptop
processor. I haven't really tested it on anything really big but for my, the majority
of my needs, I will just use Q and if a, if it's any bigger, if the files are any bigger
than that, I've been trying to move them out of CSV and trying to move them to HDF5 when
possible because binary formats load a lot faster and a lot of the data science programs
that I write nowadays. But for the small things, Q does great, like just for doing data
quality on a file that someone sends me, I'll do, I'll look for distinct values on, on
a common, I suppose to only have a couple of values and, you know, I'll look for missing
values, I'll look for the length of different things that'll see if there's bad characters
in there. So if I do, it's supposed to only, it's supposed to be common to limit it and
there's commas inside of the values, it'll expose all that kind of stuff. So I can't say
enough about Q. But that being said, I hope you found this episode interesting and like
I said, in the show notes, I'm going to at least put a snippet from my scope. If you
want to see the entire thing, just put it in the comments and I encourage you to check
out Ranger and all these, if not that at least some of these tools that will turn text,
turn different file types into text. So you've been listening to the Hacker Public Radio
and as I say, keep hacking.
You've been listening to Hacker Public Radio at Hacker Public Radio dot org. We are a community
podcast network that releases shows every weekday, Monday through Friday. Today's show, like all
our shows, was contributed by an HBR listener like yourself. If you ever thought of recording a
podcast, then click on our contributing to find out how easy it really is. Hacker Public Radio
was founded by the digital dog pound and the infonomicon computer club and is part of the binary
revolution at binrev.com. If you have comments on today's show, please email the host directly,
leave a comment on the website or record a follow-up episode yourself. Unless otherwise status,
today's show is released on the creative comments, attribution, share a light 3.0 license.