- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
264 lines
23 KiB
Plaintext
264 lines
23 KiB
Plaintext
Episode: 2091
|
|
Title: HPR2091: Everyday Unix/Linux Tools for data processing
|
|
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2091/hpr2091.mp3
|
|
Transcribed: 2025-10-18 14:11:41
|
|
|
|
---
|
|
|
|
This episode of HBR is brought to you by AnanasThost.com, get 15% discount on all shared
|
|
hosting with the offer code HBR15, that's HBR15, better web hosting that's
|
|
honest and fair at AnanasThost.com
|
|
Hello Hacker Public Radio, this is Be Easy once again with another episode this time
|
|
we're going to talk to you about some of the command line and other tools I use to analyze
|
|
and process data. A lot of what I do for my day job is looking at data in various formats
|
|
whether it be text or numbers or other types of objects, sometimes images even and it comes
|
|
at all types of formats and so a lot of things I have to do before even start analyzing
|
|
it at all is to get it into a format that is more agreeable for automated data processing
|
|
or automated data analysis. So I'm just going to go through some of the tools I use
|
|
and I'm going to go through an exhaustive list but here we go. So in terms of cleaning
|
|
up to get prepared for the right type of environment, I like to use a program called Unix
|
|
to DOS or DOS to Unix depending on which way I'm going. It's the same one program but it
|
|
has both commands and it does stuff like change the file encoding to UTF8 and it has
|
|
options where you can change the other things but you can change the file encoding. It
|
|
changes the end of line character and the end of file characters that differ between Unix
|
|
like systems and Windows like systems or DOS like systems and so a lot of times I'll get
|
|
a bunch of files that are text based and they'll be in the DOS style and I'll just run
|
|
Unix and DOS style on directory and it does all of them. Another one that is also useful
|
|
is a program called Detox. I talked about it briefly before I think or I might have
|
|
mentioned in the comments of another podcast but Detox is a utility that renames files to
|
|
make them easier to work with. This is out of the main page. It removes spaces and
|
|
other such annoyances. It also translates or cleans up Latin 1 ISO 85-1 characters and
|
|
coded in 8 bit ASCII. Unicodes, characters and coded in UTF8 and CCHISCade characters. You
|
|
might remember sometimes people have put spaces and things in the file names and if you
|
|
are not careful in bashing you don't properly put quotation marks around your variables
|
|
and you'll end up parsing a part of a file name instead of the entire file name because
|
|
of the space. Then you'll get an error message like command not found or no such file
|
|
or directory because if you're not looking in the entire name of the file you're only
|
|
looking at the first set of characters before the first piece of white space. If you don't
|
|
want to have to worry about that you can use Detox and it cleans up that kind of stuff.
|
|
It also takes out question marks and you can indicate what character you want to make
|
|
the default character to be to replace these characters as one of the options. You can
|
|
take out all the question marks, replace them with underscores or replace them with dash
|
|
lines. However you want to clean it up and there are some other options you can look
|
|
in the main page to learn more. Another thing I use for cleaning up files a lot of times
|
|
is PDF the text. A lot of times I'll get PDF documents that were either a save as PDF
|
|
or print to file PDF files or not a scanned image file that would take OCR to decode. I'm
|
|
not talking about that. I'm talking about when you save a save a text file or word file
|
|
to PDF or or even from late text. So for those type of documents PDF the text will do
|
|
exactly what it says. It'll turn a PDF document into a text box and just strip off the text
|
|
out. One thing that I like to do on top of that is use a option called dash layout. So
|
|
single-tec layout. That will make the PDF to text do its best job of trying to impersonate
|
|
the layout of the PDF document. So if it's if you don't do that and save the title is
|
|
centered it'll just put the title all the way less justified. But if you use dash layout
|
|
it'll put spaces in the beginning to make it look like it's centered. Or if you have
|
|
a table this is what is really great for it's for tables. It'll try to format tables
|
|
with by aligning the white spaces up and making it look like a table. Very useful and especially
|
|
useful in bulk. So if I do PDF to text star a start at PDF on a directory it'll make
|
|
a whole directory full of .txt files. Why is that good? Well because you can do a lot
|
|
of other things with text that you can't do with PDFs. Like some of the other commands
|
|
that I've got to go over such as act and grip and said. Now I'm not going to go into
|
|
said because we've had many wonderful episodes already about said and if we haven't heard
|
|
any you should go back because I've learned a lot. I've been taking a lot and using them
|
|
in my work. So it's very useful. Act and grip do similar type things. Whereas if you
|
|
use the command ls it'll list all the names of files. But if you do grip and then a word
|
|
or pattern, reg x pattern and then a file or a group of files or star on the directory
|
|
or star.txt, whatever the case may be, it'll look into every document and look for that
|
|
text inside of it. And then with different options with grip for instance you can do
|
|
dash capital a five and whatever the pattern is star.txt, it'll look for that pattern and
|
|
the five lines after. So if you know that there's a pattern in the documents that you're
|
|
looking at that from this pattern down five lines is the text that you want. You can do
|
|
that by using dash a number or dash. No, okay, dash a number, dash b number and I'm sorry
|
|
I switched up dash a number is above and dash b number is below and it's the capital. The
|
|
lower case letters have a slightly different meaning and you can read their fan page to learn
|
|
about them. That's what I use grip for most but if I want to do dash capital a or dash capital
|
|
b, I'll use grab otherwise I will use act and I also like act because I also use vim and
|
|
there is a plugin called act vim that I use inside of vim to do act on it. So if I cd into
|
|
a directory and open a file in vim, I can act and the pattern that I want and it'll give me a
|
|
quick tip menu with all the files that match and all the lines and I can go and look at every
|
|
file that matches. But it does the same thing on the command line where you put act and the
|
|
pattern whether it's just the word or any other reg x pattern and it'll look in there and find
|
|
it and print it to the screen or you can print it to a file if you redirect like we've learned
|
|
in other bash tips and so I really really enjoy act. It's in the man pages it describes it as
|
|
a grip alternative for programmers and even if you're not a programmer it makes you feel good
|
|
to be considered like a programmer for using this tool so little ego boost. Now one of the things
|
|
I like to use these tools in combination so I needed to find out if a certain phrase or word
|
|
or group of words I've already written down in documentation and the documents are PDFs. So I
|
|
will PDFs and text all the files and then act for the name or the subject that I'm looking for
|
|
find out which file is in that actually open up that file and read what I want. I do this with
|
|
my notes so I take a lot of notes for you know work for things that might learn on my side time
|
|
my shopping list and being able to just do act vegetables on my notes directory and I'll find
|
|
where I wrote down vegetables is really great but also for work it has lots of great uses there too.
|
|
So another tool that I use a lot is Auk and I'm pretty sure there have been some episodes on Auk as
|
|
well it's a great tool for parsing data that is in a structured format with field separators
|
|
and columns so it's you can do it text tab, delimited files, piped delimited files,
|
|
common delimited files you can define the field separator and I'm not going to go into it exhaustively
|
|
here because it's a really big topic Auk is really big and if there's a need I can make a whole
|
|
series about Auk like there's a whole series about said and I probably just talked to myself
|
|
and doing it so you might want to stay tuned eventually for that but just like anything else
|
|
it's a way to get your data in the columns so if I see as we file instead of trying to open it in
|
|
in LibreOffice I'll just lock it and if I just want the third column I'll just take the third
|
|
column out and I use it in bash scripts one bash script that I like to use it with is with another
|
|
couple tools called wget and curl which are command line utilities to interface with the internet
|
|
and so one project here's a little project for you I needed to get a list of words the definition of
|
|
them so what do I do I look at something like mirror and webster
|
|
sectionary and I see the pattern of the of the URL where there's a there's a part in the URL
|
|
where the word that you're looking for goes and then I'll say okay well I'll just write a big CSV
|
|
file or a big text file with the words I'm looking for and do a for loop on every line and do a
|
|
wget of the web page and do a substitution where I add the word that from that line so for
|
|
every line look for the look for this word and put that word in the URL and then a lot of these
|
|
sites are tricky so you might have to put a sleep 10 or a sleep 60 or whatever you want to have
|
|
to do you might have to make it pause for a couple seconds before sensing other requests because
|
|
a lot of these sites don't want you to do this so they'll put in something like a delay
|
|
so that you so and if you do it too fast one of the wgets that you get the HTML file all say is
|
|
something like don't use robots to download our data so it's not that it's illegal it's just
|
|
that it's something that they don't want you to do and then you know obviously reference away
|
|
I got this from Merrim Webster one of the things I like to do either in said or inside of
|
|
VIM instead of using Unix to DOS or DOS to Unix is I'll just do a said command or a VIM
|
|
record expression substitution to take out the special characters so to be like said dash i
|
|
then inside of single quotes s four slash
|
|
carrot capital m four slash four slash g then close the um close the single quote start at txt
|
|
that'll take out all the end of line characters another way to take them all out is said dash i
|
|
uh single quote s four slash back slash r which is an overturned key slash slash g
|
|
close the quote start at txt that'll do something similar another tricky one
|
|
is to get the last line now and I do that when in VIM more than anything else I'll talk about that
|
|
because sometimes there'll be an end-of-file character that you have to parse out not going to
|
|
that and a little bit um and then another thing that I use a lot is pandok as I've said in another
|
|
podcast I really enjoy writing in plain text writing and mark down one because I can do cool things
|
|
like grip an entire directory or accurate then the entire directory and find out where where I wrote
|
|
down how to um set up a lamp server if I haven't written down somewhere uh that I could find the
|
|
file and it is but also because I just really hate messing around with formatting and I'm
|
|
and when I do mess with it I'm kind of you know I get distracted by making sure it's all perfect
|
|
and not focusing all the text on max rewriting so you know do something like H make
|
|
an H one header with a single asterix makes it so I don't have to worry about if the next one time
|
|
I want a header one it's lined up correctly or it's the right color and then and I've learned from
|
|
some of the other episodes how would be a good template creator and that just made me more into
|
|
using mark mark down because now that I know that there's a good way to do it I would really
|
|
spend you know 15 20 minutes just making a good template before I even start writing
|
|
so if I just need to write I will just write and mark down
|
|
um like I said curl W get I'm not going to get into those there's lots of things online
|
|
let's talk about curl W get but one thing I do use uh I use I usually use W get if I just want to
|
|
bring down a dot HTML file and I'll use curl if I want to send like if I want to do work with
|
|
a rest API I'll do a curl dash capital X you know then put into capital H for the headers
|
|
add headers and then put a you know dash capital X post send it and then get the data back
|
|
you know if curl works with being able to use plain authentication another authentication
|
|
message so if you need to log into something find a file on the sftp server bring it down
|
|
you can do it through curl and that way you can script it and automate it
|
|
all right so keeping with the idea of working with text a lot I'm going to go over someone
|
|
just a couple of VIM tricks that I use I'm not going to go into all of them VIM is a lot of things
|
|
but one of the things that I like to do is I will open up
|
|
our list of files that I want to work with so I'll say something like if I'm in a directory
|
|
with a bunch of text files VIM star dot txt which will load all of the VIM files into
|
|
the argument list of VIM and then you can work with them in a very cool way where you can do
|
|
you can hit colon VIM space then forward slash the pattern and then slash and then space
|
|
what is that like the pound pound and what that will do it will look for the pattern in
|
|
everything in your argument and it'll put it into a little quick tips thing at the bottom and
|
|
then you can do our you can do our our next our a next I use a plugin called unimpaired so I
|
|
don't remember how to do it I just do I just use the shortcut that the impaired tool gives you
|
|
but you can go through every single document so instead of just searching one document when you
|
|
hit n it goes to next you hit n it goes to the next next next and then when you get to the end
|
|
of this list in this document you hit next one more time it goes to the next document it starts
|
|
looking in that one and then the next one and the next one next one that has been such a
|
|
saver and have been able to take projects that I do and people that and other people do the same
|
|
project and it'll take me like a third of the time because I can search text so fast and find
|
|
the find the things I'm looking for make changes on on the fly and keep going or just find it
|
|
see if it exists or not if it doesn't exist at it it's been such a huge time saver like I mentioned
|
|
before there's the act plugin and then a similar thing as the the the argly search is
|
|
the buffer searching in place and so there's a command called buff do so if you don't want to
|
|
if you um know that there's something that's in all these files and any don't want to use said
|
|
because you're already in them and you already have like five files open in them you can do colon
|
|
buff do which be you FTO and then you can do space percent sign s for slash the pattern slash
|
|
the replacement slash GE and it'll and then space pipe space update it's going to go in and replace
|
|
all those files and I'll and receive them with the replacement in it very cool thing to do
|
|
so that's all the them tricks I'm going to go over right now some of the the two languages I
|
|
program in most of the time I have to do scripting are R and Python and I'm not going to go into
|
|
how I use them because it's a really big topic but I'm going to go into just some of the modules
|
|
that I use so for them are the libraries I use are that are useful are our curl which is
|
|
a curl interface when you're in our our vest which is similar to beautiful soup and Python which
|
|
I'll talk about but it's a way to scrape data off a web page XML another way to scrape data off
|
|
a web page and XLSX which is a way to get data out of Excel files I think we're working with
|
|
some of these plugins though is that they have dependencies on on your operating system like
|
|
XOXX you need to have our Java installed which means you have to have Java installed
|
|
RXML I mean XML requires a Linux ML2 so I don't know these things when you're on a window system
|
|
because I've used them on Windows before it's actually in the binary because you know there are
|
|
any dynamic dynamic things in there so it's in the binary when you do when you install the module
|
|
but for Unix like some just install the install the dependencies first so yeah those three are
|
|
really useful one thing about XLSX you can go and it'll just bring the put the entire
|
|
worksheet in a data in a data frame and you can say I want to look at the third
|
|
worksheet in the workbook and I want to just look at the first 25 rows or we can say for
|
|
every sheet worksheet and workbook do this so automation is not kind of stuff it's really fun
|
|
I start doing these things a lot more instead of using visual basic because I hate writing macros
|
|
that's it and then to do that labor office to to MS office just not good when it comes to macros
|
|
this is reproducible some of the Python libraries I use beautiful suit like I said once
|
|
before beautiful suit is a great Python library for scraping data for web pages
|
|
NLTK which is a natural language toolkit it is there's another one called TM and R which is
|
|
similar which is TM6 stands for text minor natural language toolkit and NLTK and Python they're
|
|
both natural language processors there's no way I have time to go into natural language processing
|
|
but there are some really good I'll put in the show notes a good couple of YouTube videos to watch
|
|
if you want to learn about natural language processing because it's a big topic and it's the way
|
|
search engines and contextual search and a lot of these things work that in the combination a
|
|
lot of times with hidden markup models are the ways that a lot of this dynamic search contextual
|
|
search stuff works in search engines RE Python library which is just the regular special library
|
|
it's good if you want to work with regular expressions in Python you need to do that
|
|
rdflib I work with creating text enemies and ontology sometimes and so if you're going to
|
|
deal with a semantic web and you don't want to deal with like semi four or one of these big data
|
|
huge platforms you don't want to spend money for that but you have the files already
|
|
um rdflib is a great way to interact with those files or that the server there's another
|
|
module called sparkle wrapper sp8 rql wrapper and it's a subset of rdflib commands
|
|
that's a little bit easier to work with but I like working with rdflib straight away from
|
|
for most things and then csv csv is a built-in Python library that lets you work with csv files
|
|
tab it's it's pretty much similar to awkward work with a lot on any type of
|
|
character separated file but it also works with excel files too
|
|
and so those are some of the tools I use some other tools that I use that are that are
|
|
Java web client or web servers are open refine open refine.org if you haven't seen it
|
|
it was a project by google it's open source it's a way to visualize and clean up dirty data
|
|
so you can import a big it can interface with a database or just like an excel file and I'll read it
|
|
in a you know it can cluster terms that are similar so if it if there's things like capitalization
|
|
or misspellings you can start to cluster those words together and say that old word is now this
|
|
word and you know it's just a visual way of helping you clean up large amounts of data really
|
|
quickly um another tool to use with that is um reconcile csv it's actually a plugin for open
|
|
refine and it's it's also a Java web client where you it's basically run it as a little demon
|
|
as you're running open refine and you can interact with it and what that does it'll let you take
|
|
data that you just cleaned up and a standard data set compare the two data sets and let you
|
|
and it'll do a job of like matching the thing that you just cleaned up to the to whatever standard
|
|
you choose and it'll give you a probability score of how good that matches and I mean if you have
|
|
to do it's really niche type of operation but if you have to do this type of work it's so useful
|
|
or so there were like five different ways people smelled Mississippi because there's so many letters
|
|
and you did a job in open refine clustering all the similar ones to a single spelling of Mississippi
|
|
then you can go and take you know just see as we file with all the states and compare it and then
|
|
have it compare Mississippi over here to Mississippi over there and it doesn't have to be a single
|
|
word it can be like a sentence even but uh and then it'll it for Mississippi it's pretty easy
|
|
there'd be like a one-to-one ratio of like Mississippi so your your your p-value be a one by default
|
|
anything above a point eight I think it says is a good match so if you're looking at all of the
|
|
if you're looking at all of the data that unit the units of measure consortium compile and you have
|
|
a whole bunch of units of measure and a system that you're working with you can first clean up
|
|
the units of measure because people will spell milliliter or sent a leader or whatever differently
|
|
clean it up first compared to the standard and now you can say okay for sure what this system
|
|
called milliliter is this national standard called ml and then there you go so that's just like
|
|
a small use case the last tool is also a Java applet not applet sorry the Java web uh web server
|
|
where all these Java web servers all you have to do is just go you know Java the name of the file
|
|
and they run pretty much it's not too much to it um but this one is called tabula and if you
|
|
ever had a pdf and you actually wanted to take that pdf or a couple like a table or a couple
|
|
tables out of that pdf and put it into like a CSV file or or an Excel file tabula does exactly that
|
|
it'll bring that file into like you go in the in on your web browser upload the file which is not
|
|
really uploaded just saying word is in the trip in your directory because it's local um and then
|
|
it'll analyze it you you um basically just drag and drop like the area that you wanted to
|
|
analyze so you just highlight the area on the screen that looks like the table hit start
|
|
and then it'll extract the data out of that and put it into a CSV file or an Excel file
|
|
took me forever to find this but once I found it I'm like this is a keeper because so much of what I
|
|
do is working people give you pdf files and expect you to be able to use them as well this is
|
|
our documentation so put this in our database it's like well it's not structured data but you
|
|
can structure it in this type of way um very useful I've suggested that other people have no other
|
|
people who are using it now and so it's very useful um so those are my tricks that are outside
|
|
of command line tools but yeah that's uh gonna do it for this episode if you have any questions
|
|
uh or any further suggestions you can go ahead and uh look for me on hacker public radio
|
|
I'm there you can also find me on twitter on at our young 29 and stay tuned for more hacker public
|
|
radio good-bye you've been listening to hacker public radio at hackerpublicradio.org
|
|
we are a community podcast network that releases shows every weekday Monday through Friday
|
|
today's show like all our shows was contributed by an hbr listener like yourself
|
|
if you ever thought of recording a podcast and click on our contributing to find out how easy it
|
|
really is hacker public radio was founded by the digital dog pound and the infonomican computer club
|
|
and it's part of the binary revolution at binwreff.com if you have comments on today's show
|
|
please email the host directly leave a comment on the website or record a follow-up episode yourself
|
|
unless otherwise stated today's show is released under creative comments,
|
|
attribution, share a light 3.0 license
|