Episode: 2091
Title: HPR2091: Everyday Unix/Linux Tools for data processing
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2091/hpr2091.mp3
Transcribed: 2025-10-18 14:11:41

---

This episode of HBR is brought to you by AnanasThost.com, get 15% discount on all shared
hosting with the offer code HBR15, that's HBR15, better web hosting that's
honest and fair at AnanasThost.com
Hello Hacker Public Radio, this is Be Easy once again with another episode this time
we're going to talk to you about some of the command line and other tools I use to analyze
and process data. A lot of what I do for my day job is looking at data in various formats
whether it be text or numbers or other types of objects, sometimes images even and it comes
at all types of formats and so a lot of things I have to do before even start analyzing
it at all is to get it into a format that is more agreeable for automated data processing
or automated data analysis. So I'm just going to go through some of the tools I use
and I'm going to go through an exhaustive list but here we go. So in terms of cleaning
up to get prepared for the right type of environment, I like to use a program called Unix
to DOS or DOS to Unix depending on which way I'm going. It's the same one program but it
has both commands and it does stuff like change the file encoding to UTF8 and it has
options where you can change the other things but you can change the file encoding. It
changes the end of line character and the end of file characters that differ between Unix
like systems and Windows like systems or DOS like systems and so a lot of times I'll get
a bunch of files that are text based and they'll be in the DOS style and I'll just run
Unix and DOS style on directory and it does all of them. Another one that is also useful
is a program called Detox. I talked about it briefly before I think or I might have
mentioned in the comments of another podcast but Detox is a utility that renames files to
make them easier to work with. This is out of the main page. It removes spaces and
other such annoyances. It also translates or cleans up Latin 1 ISO 85-1 characters and
coded in 8 bit ASCII. Unicodes, characters and coded in UTF8 and CCHISCade characters. You
might remember sometimes people have put spaces and things in the file names and if you
are not careful in bashing you don't properly put quotation marks around your variables
and you'll end up parsing a part of a file name instead of the entire file name because
of the space. Then you'll get an error message like command not found or no such file
or directory because if you're not looking in the entire name of the file you're only
looking at the first set of characters before the first piece of white space. If you don't
want to have to worry about that you can use Detox and it cleans up that kind of stuff.
It also takes out question marks and you can indicate what character you want to make
the default character to be to replace these characters as one of the options. You can
take out all the question marks, replace them with underscores or replace them with dash
lines. However you want to clean it up and there are some other options you can look
in the main page to learn more. Another thing I use for cleaning up files a lot of times
is PDF the text. A lot of times I'll get PDF documents that were either a save as PDF
or print to file PDF files or not a scanned image file that would take OCR to decode. I'm
not talking about that. I'm talking about when you save a save a text file or word file
to PDF or or even from late text. So for those type of documents PDF the text will do
exactly what it says. It'll turn a PDF document into a text box and just strip off the text
out. One thing that I like to do on top of that is use a option called dash layout. So
single-tec layout. That will make the PDF to text do its best job of trying to impersonate
the layout of the PDF document. So if it's if you don't do that and save the title is
centered it'll just put the title all the way less justified. But if you use dash layout
it'll put spaces in the beginning to make it look like it's centered. Or if you have
a table this is what is really great for it's for tables. It'll try to format tables
with by aligning the white spaces up and making it look like a table. Very useful and especially
useful in bulk. So if I do PDF to text star a start at PDF on a directory it'll make
a whole directory full of .txt files. Why is that good? Well because you can do a lot
of other things with text that you can't do with PDFs. Like some of the other commands
that I've got to go over such as act and grip and said. Now I'm not going to go into
said because we've had many wonderful episodes already about said and if we haven't heard
any you should go back because I've learned a lot. I've been taking a lot and using them
in my work. So it's very useful. Act and grip do similar type things. Whereas if you
use the command ls it'll list all the names of files. But if you do grip and then a word
or pattern, reg x pattern and then a file or a group of files or star on the directory
or star.txt, whatever the case may be, it'll look into every document and look for that
text inside of it. And then with different options with grip for instance you can do
dash capital a five and whatever the pattern is star.txt, it'll look for that pattern and
the five lines after. So if you know that there's a pattern in the documents that you're
looking at that from this pattern down five lines is the text that you want. You can do
that by using dash a number or dash. No, okay, dash a number, dash b number and I'm sorry
I switched up dash a number is above and dash b number is below and it's the capital. The
lower case letters have a slightly different meaning and you can read their fan page to learn
about them. That's what I use grip for most but if I want to do dash capital a or dash capital
b, I'll use grab otherwise I will use act and I also like act because I also use vim and
there is a plugin called act vim that I use inside of vim to do act on it. So if I cd into
a directory and open a file in vim, I can act and the pattern that I want and it'll give me a
quick tip menu with all the files that match and all the lines and I can go and look at every
file that matches. But it does the same thing on the command line where you put act and the
pattern whether it's just the word or any other reg x pattern and it'll look in there and find
it and print it to the screen or you can print it to a file if you redirect like we've learned
in other bash tips and so I really really enjoy act. It's in the man pages it describes it as
a grip alternative for programmers and even if you're not a programmer it makes you feel good
to be considered like a programmer for using this tool so little ego boost. Now one of the things
I like to use these tools in combination so I needed to find out if a certain phrase or word
or group of words I've already written down in documentation and the documents are PDFs. So I
will PDFs and text all the files and then act for the name or the subject that I'm looking for
find out which file is in that actually open up that file and read what I want. I do this with
my notes so I take a lot of notes for you know work for things that might learn on my side time
my shopping list and being able to just do act vegetables on my notes directory and I'll find
where I wrote down vegetables is really great but also for work it has lots of great uses there too.
So another tool that I use a lot is Auk and I'm pretty sure there have been some episodes on Auk as
well it's a great tool for parsing data that is in a structured format with field separators
and columns so it's you can do it text tab, delimited files, piped delimited files,
common delimited files you can define the field separator and I'm not going to go into it exhaustively
here because it's a really big topic Auk is really big and if there's a need I can make a whole
series about Auk like there's a whole series about said and I probably just talked to myself
and doing it so you might want to stay tuned eventually for that but just like anything else
it's a way to get your data in the columns so if I see as we file instead of trying to open it in
in LibreOffice I'll just lock it and if I just want the third column I'll just take the third
column out and I use it in bash scripts one bash script that I like to use it with is with another
couple tools called wget and curl which are command line utilities to interface with the internet
and so one project here's a little project for you I needed to get a list of words the definition of
them so what do I do I look at something like mirror and webster
sectionary and I see the pattern of the of the URL where there's a there's a part in the URL
where the word that you're looking for goes and then I'll say okay well I'll just write a big CSV
file or a big text file with the words I'm looking for and do a for loop on every line and do a
wget of the web page and do a substitution where I add the word that from that line so for
every line look for the look for this word and put that word in the URL and then a lot of these
sites are tricky so you might have to put a sleep 10 or a sleep 60 or whatever you want to have
to do you might have to make it pause for a couple seconds before sensing other requests because
a lot of these sites don't want you to do this so they'll put in something like a delay
so that you so and if you do it too fast one of the wgets that you get the HTML file all say is
something like don't use robots to download our data so it's not that it's illegal it's just
that it's something that they don't want you to do and then you know obviously reference away
I got this from Merrim Webster one of the things I like to do either in said or inside of
VIM instead of using Unix to DOS or DOS to Unix is I'll just do a said command or a VIM
record expression substitution to take out the special characters so to be like said dash i
then inside of single quotes s four slash
carrot capital m four slash four slash g then close the um close the single quote start at txt
that'll take out all the end of line characters another way to take them all out is said dash i
uh single quote s four slash back slash r which is an overturned key slash slash g
close the quote start at txt that'll do something similar another tricky one
is to get the last line now and I do that when in VIM more than anything else I'll talk about that
because sometimes there'll be an end-of-file character that you have to parse out not going to
that and a little bit um and then another thing that I use a lot is pandok as I've said in another
podcast I really enjoy writing in plain text writing and mark down one because I can do cool things
like grip an entire directory or accurate then the entire directory and find out where where I wrote
down how to um set up a lamp server if I haven't written down somewhere uh that I could find the
file and it is but also because I just really hate messing around with formatting and I'm
and when I do mess with it I'm kind of you know I get distracted by making sure it's all perfect
and not focusing all the text on max rewriting so you know do something like H make
an H one header with a single asterix makes it so I don't have to worry about if the next one time
I want a header one it's lined up correctly or it's the right color and then and I've learned from
some of the other episodes how would be a good template creator and that just made me more into
using mark mark down because now that I know that there's a good way to do it I would really
spend you know 15 20 minutes just making a good template before I even start writing
so if I just need to write I will just write and mark down
um like I said curl W get I'm not going to get into those there's lots of things online
let's talk about curl W get but one thing I do use uh I use I usually use W get if I just want to
bring down a dot HTML file and I'll use curl if I want to send like if I want to do work with
a rest API I'll do a curl dash capital X you know then put into capital H for the headers
add headers and then put a you know dash capital X post send it and then get the data back
you know if curl works with being able to use plain authentication another authentication
message so if you need to log into something find a file on the sftp server bring it down
you can do it through curl and that way you can script it and automate it
all right so keeping with the idea of working with text a lot I'm going to go over someone
just a couple of VIM tricks that I use I'm not going to go into all of them VIM is a lot of things
but one of the things that I like to do is I will open up
our list of files that I want to work with so I'll say something like if I'm in a directory
with a bunch of text files VIM star dot txt which will load all of the VIM files into
the argument list of VIM and then you can work with them in a very cool way where you can do
you can hit colon VIM space then forward slash the pattern and then slash and then space
what is that like the pound pound and what that will do it will look for the pattern in
everything in your argument and it'll put it into a little quick tips thing at the bottom and
then you can do our you can do our our next our a next I use a plugin called unimpaired so I
don't remember how to do it I just do I just use the shortcut that the impaired tool gives you
but you can go through every single document so instead of just searching one document when you
hit n it goes to next you hit n it goes to the next next next and then when you get to the end
of this list in this document you hit next one more time it goes to the next document it starts
looking in that one and then the next one and the next one next one that has been such a
saver and have been able to take projects that I do and people that and other people do the same
project and it'll take me like a third of the time because I can search text so fast and find
the find the things I'm looking for make changes on on the fly and keep going or just find it
see if it exists or not if it doesn't exist at it it's been such a huge time saver like I mentioned
before there's the act plugin and then a similar thing as the the the argly search is
the buffer searching in place and so there's a command called buff do so if you don't want to
if you um know that there's something that's in all these files and any don't want to use said
because you're already in them and you already have like five files open in them you can do colon
buff do which be you FTO and then you can do space percent sign s for slash the pattern slash
the replacement slash GE and it'll and then space pipe space update it's going to go in and replace
all those files and I'll and receive them with the replacement in it very cool thing to do
so that's all the them tricks I'm going to go over right now some of the the two languages I
program in most of the time I have to do scripting are R and Python and I'm not going to go into
how I use them because it's a really big topic but I'm going to go into just some of the modules
that I use so for them are the libraries I use are that are useful are our curl which is
a curl interface when you're in our our vest which is similar to beautiful soup and Python which
I'll talk about but it's a way to scrape data off a web page XML another way to scrape data off
a web page and XLSX which is a way to get data out of Excel files I think we're working with
some of these plugins though is that they have dependencies on on your operating system like
XOXX you need to have our Java installed which means you have to have Java installed
RXML I mean XML requires a Linux ML2 so I don't know these things when you're on a window system
because I've used them on Windows before it's actually in the binary because you know there are
any dynamic dynamic things in there so it's in the binary when you do when you install the module
but for Unix like some just install the install the dependencies first so yeah those three are
really useful one thing about XLSX you can go and it'll just bring the put the entire
worksheet in a data in a data frame and you can say I want to look at the third
worksheet in the workbook and I want to just look at the first 25 rows or we can say for
every sheet worksheet and workbook do this so automation is not kind of stuff it's really fun
I start doing these things a lot more instead of using visual basic because I hate writing macros
that's it and then to do that labor office to to MS office just not good when it comes to macros
this is reproducible some of the Python libraries I use beautiful suit like I said once
before beautiful suit is a great Python library for scraping data for web pages
NLTK which is a natural language toolkit it is there's another one called TM and R which is
similar which is TM6 stands for text minor natural language toolkit and NLTK and Python they're
both natural language processors there's no way I have time to go into natural language processing
but there are some really good I'll put in the show notes a good couple of YouTube videos to watch
if you want to learn about natural language processing because it's a big topic and it's the way
search engines and contextual search and a lot of these things work that in the combination a
lot of times with hidden markup models are the ways that a lot of this dynamic search contextual
search stuff works in search engines RE Python library which is just the regular special library
it's good if you want to work with regular expressions in Python you need to do that
rdflib I work with creating text enemies and ontology sometimes and so if you're going to
deal with a semantic web and you don't want to deal with like semi four or one of these big data
huge platforms you don't want to spend money for that but you have the files already
um rdflib is a great way to interact with those files or that the server there's another
module called sparkle wrapper sp8 rql wrapper and it's a subset of rdflib commands
that's a little bit easier to work with but I like working with rdflib straight away from
for most things and then csv csv is a built-in Python library that lets you work with csv files
tab it's it's pretty much similar to awkward work with a lot on any type of
character separated file but it also works with excel files too
and so those are some of the tools I use some other tools that I use that are that are
Java web client or web servers are open refine open refine.org if you haven't seen it
it was a project by google it's open source it's a way to visualize and clean up dirty data
so you can import a big it can interface with a database or just like an excel file and I'll read it
in a you know it can cluster terms that are similar so if it if there's things like capitalization
or misspellings you can start to cluster those words together and say that old word is now this
word and you know it's just a visual way of helping you clean up large amounts of data really
quickly um another tool to use with that is um reconcile csv it's actually a plugin for open
refine and it's it's also a Java web client where you it's basically run it as a little demon
as you're running open refine and you can interact with it and what that does it'll let you take
data that you just cleaned up and a standard data set compare the two data sets and let you
and it'll do a job of like matching the thing that you just cleaned up to the to whatever standard
you choose and it'll give you a probability score of how good that matches and I mean if you have
to do it's really niche type of operation but if you have to do this type of work it's so useful
or so there were like five different ways people smelled Mississippi because there's so many letters
and you did a job in open refine clustering all the similar ones to a single spelling of Mississippi
then you can go and take you know just see as we file with all the states and compare it and then
have it compare Mississippi over here to Mississippi over there and it doesn't have to be a single
word it can be like a sentence even but uh and then it'll it for Mississippi it's pretty easy
there'd be like a one-to-one ratio of like Mississippi so your your your p-value be a one by default
anything above a point eight I think it says is a good match so if you're looking at all of the
if you're looking at all of the data that unit the units of measure consortium compile and you have
a whole bunch of units of measure and a system that you're working with you can first clean up
the units of measure because people will spell milliliter or sent a leader or whatever differently
clean it up first compared to the standard and now you can say okay for sure what this system
called milliliter is this national standard called ml and then there you go so that's just like
a small use case the last tool is also a Java applet not applet sorry the Java web uh web server
where all these Java web servers all you have to do is just go you know Java the name of the file
and they run pretty much it's not too much to it um but this one is called tabula and if you
ever had a pdf and you actually wanted to take that pdf or a couple like a table or a couple
tables out of that pdf and put it into like a CSV file or or an Excel file tabula does exactly that
it'll bring that file into like you go in the in on your web browser upload the file which is not
really uploaded just saying word is in the trip in your directory because it's local um and then
it'll analyze it you you um basically just drag and drop like the area that you wanted to
analyze so you just highlight the area on the screen that looks like the table hit start
and then it'll extract the data out of that and put it into a CSV file or an Excel file
took me forever to find this but once I found it I'm like this is a keeper because so much of what I
do is working people give you pdf files and expect you to be able to use them as well this is
our documentation so put this in our database it's like well it's not structured data but you
can structure it in this type of way um very useful I've suggested that other people have no other
people who are using it now and so it's very useful um so those are my tricks that are outside
of command line tools but yeah that's uh gonna do it for this episode if you have any questions
uh or any further suggestions you can go ahead and uh look for me on hacker public radio
I'm there you can also find me on twitter on at our young 29 and stay tuned for more hacker public
radio good-bye you've been listening to hacker public radio at hackerpublicradio.org
we are a community podcast network that releases shows every weekday Monday through Friday
today's show like all our shows was contributed by an hbr listener like yourself
if you ever thought of recording a podcast and click on our contributing to find out how easy it
really is hacker public radio was founded by the digital dog pound and the infonomican computer club
and it's part of the binary revolution at binwreff.com if you have comments on today's show
please email the host directly leave a comment on the website or record a follow-up episode yourself
unless otherwise stated today's show is released under creative comments,
attribution, share a light 3.0 license