Episode: 2091 Title: HPR2091: Everyday Unix/Linux Tools for data processing Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2091/hpr2091.mp3 Transcribed: 2025-10-18 14:11:41 --- This episode of HBR is brought to you by AnanasThost.com, get 15% discount on all shared hosting with the offer code HBR15, that's HBR15, better web hosting that's honest and fair at AnanasThost.com Hello Hacker Public Radio, this is Be Easy once again with another episode this time we're going to talk to you about some of the command line and other tools I use to analyze and process data. A lot of what I do for my day job is looking at data in various formats whether it be text or numbers or other types of objects, sometimes images even and it comes at all types of formats and so a lot of things I have to do before even start analyzing it at all is to get it into a format that is more agreeable for automated data processing or automated data analysis. So I'm just going to go through some of the tools I use and I'm going to go through an exhaustive list but here we go. So in terms of cleaning up to get prepared for the right type of environment, I like to use a program called Unix to DOS or DOS to Unix depending on which way I'm going. It's the same one program but it has both commands and it does stuff like change the file encoding to UTF8 and it has options where you can change the other things but you can change the file encoding. It changes the end of line character and the end of file characters that differ between Unix like systems and Windows like systems or DOS like systems and so a lot of times I'll get a bunch of files that are text based and they'll be in the DOS style and I'll just run Unix and DOS style on directory and it does all of them. Another one that is also useful is a program called Detox. I talked about it briefly before I think or I might have mentioned in the comments of another podcast but Detox is a utility that renames files to make them easier to work with. This is out of the main page. It removes spaces and other such annoyances. It also translates or cleans up Latin 1 ISO 85-1 characters and coded in 8 bit ASCII. Unicodes, characters and coded in UTF8 and CCHISCade characters. You might remember sometimes people have put spaces and things in the file names and if you are not careful in bashing you don't properly put quotation marks around your variables and you'll end up parsing a part of a file name instead of the entire file name because of the space. Then you'll get an error message like command not found or no such file or directory because if you're not looking in the entire name of the file you're only looking at the first set of characters before the first piece of white space. If you don't want to have to worry about that you can use Detox and it cleans up that kind of stuff. It also takes out question marks and you can indicate what character you want to make the default character to be to replace these characters as one of the options. You can take out all the question marks, replace them with underscores or replace them with dash lines. However you want to clean it up and there are some other options you can look in the main page to learn more. Another thing I use for cleaning up files a lot of times is PDF the text. A lot of times I'll get PDF documents that were either a save as PDF or print to file PDF files or not a scanned image file that would take OCR to decode. I'm not talking about that. I'm talking about when you save a save a text file or word file to PDF or or even from late text. So for those type of documents PDF the text will do exactly what it says. It'll turn a PDF document into a text box and just strip off the text out. One thing that I like to do on top of that is use a option called dash layout. So single-tec layout. That will make the PDF to text do its best job of trying to impersonate the layout of the PDF document. So if it's if you don't do that and save the title is centered it'll just put the title all the way less justified. But if you use dash layout it'll put spaces in the beginning to make it look like it's centered. Or if you have a table this is what is really great for it's for tables. It'll try to format tables with by aligning the white spaces up and making it look like a table. Very useful and especially useful in bulk. So if I do PDF to text star a start at PDF on a directory it'll make a whole directory full of .txt files. Why is that good? Well because you can do a lot of other things with text that you can't do with PDFs. Like some of the other commands that I've got to go over such as act and grip and said. Now I'm not going to go into said because we've had many wonderful episodes already about said and if we haven't heard any you should go back because I've learned a lot. I've been taking a lot and using them in my work. So it's very useful. Act and grip do similar type things. Whereas if you use the command ls it'll list all the names of files. But if you do grip and then a word or pattern, reg x pattern and then a file or a group of files or star on the directory or star.txt, whatever the case may be, it'll look into every document and look for that text inside of it. And then with different options with grip for instance you can do dash capital a five and whatever the pattern is star.txt, it'll look for that pattern and the five lines after. So if you know that there's a pattern in the documents that you're looking at that from this pattern down five lines is the text that you want. You can do that by using dash a number or dash. No, okay, dash a number, dash b number and I'm sorry I switched up dash a number is above and dash b number is below and it's the capital. The lower case letters have a slightly different meaning and you can read their fan page to learn about them. That's what I use grip for most but if I want to do dash capital a or dash capital b, I'll use grab otherwise I will use act and I also like act because I also use vim and there is a plugin called act vim that I use inside of vim to do act on it. So if I cd into a directory and open a file in vim, I can act and the pattern that I want and it'll give me a quick tip menu with all the files that match and all the lines and I can go and look at every file that matches. But it does the same thing on the command line where you put act and the pattern whether it's just the word or any other reg x pattern and it'll look in there and find it and print it to the screen or you can print it to a file if you redirect like we've learned in other bash tips and so I really really enjoy act. It's in the man pages it describes it as a grip alternative for programmers and even if you're not a programmer it makes you feel good to be considered like a programmer for using this tool so little ego boost. Now one of the things I like to use these tools in combination so I needed to find out if a certain phrase or word or group of words I've already written down in documentation and the documents are PDFs. So I will PDFs and text all the files and then act for the name or the subject that I'm looking for find out which file is in that actually open up that file and read what I want. I do this with my notes so I take a lot of notes for you know work for things that might learn on my side time my shopping list and being able to just do act vegetables on my notes directory and I'll find where I wrote down vegetables is really great but also for work it has lots of great uses there too. So another tool that I use a lot is Auk and I'm pretty sure there have been some episodes on Auk as well it's a great tool for parsing data that is in a structured format with field separators and columns so it's you can do it text tab, delimited files, piped delimited files, common delimited files you can define the field separator and I'm not going to go into it exhaustively here because it's a really big topic Auk is really big and if there's a need I can make a whole series about Auk like there's a whole series about said and I probably just talked to myself and doing it so you might want to stay tuned eventually for that but just like anything else it's a way to get your data in the columns so if I see as we file instead of trying to open it in in LibreOffice I'll just lock it and if I just want the third column I'll just take the third column out and I use it in bash scripts one bash script that I like to use it with is with another couple tools called wget and curl which are command line utilities to interface with the internet and so one project here's a little project for you I needed to get a list of words the definition of them so what do I do I look at something like mirror and webster sectionary and I see the pattern of the of the URL where there's a there's a part in the URL where the word that you're looking for goes and then I'll say okay well I'll just write a big CSV file or a big text file with the words I'm looking for and do a for loop on every line and do a wget of the web page and do a substitution where I add the word that from that line so for every line look for the look for this word and put that word in the URL and then a lot of these sites are tricky so you might have to put a sleep 10 or a sleep 60 or whatever you want to have to do you might have to make it pause for a couple seconds before sensing other requests because a lot of these sites don't want you to do this so they'll put in something like a delay so that you so and if you do it too fast one of the wgets that you get the HTML file all say is something like don't use robots to download our data so it's not that it's illegal it's just that it's something that they don't want you to do and then you know obviously reference away I got this from Merrim Webster one of the things I like to do either in said or inside of VIM instead of using Unix to DOS or DOS to Unix is I'll just do a said command or a VIM record expression substitution to take out the special characters so to be like said dash i then inside of single quotes s four slash carrot capital m four slash four slash g then close the um close the single quote start at txt that'll take out all the end of line characters another way to take them all out is said dash i uh single quote s four slash back slash r which is an overturned key slash slash g close the quote start at txt that'll do something similar another tricky one is to get the last line now and I do that when in VIM more than anything else I'll talk about that because sometimes there'll be an end-of-file character that you have to parse out not going to that and a little bit um and then another thing that I use a lot is pandok as I've said in another podcast I really enjoy writing in plain text writing and mark down one because I can do cool things like grip an entire directory or accurate then the entire directory and find out where where I wrote down how to um set up a lamp server if I haven't written down somewhere uh that I could find the file and it is but also because I just really hate messing around with formatting and I'm and when I do mess with it I'm kind of you know I get distracted by making sure it's all perfect and not focusing all the text on max rewriting so you know do something like H make an H one header with a single asterix makes it so I don't have to worry about if the next one time I want a header one it's lined up correctly or it's the right color and then and I've learned from some of the other episodes how would be a good template creator and that just made me more into using mark mark down because now that I know that there's a good way to do it I would really spend you know 15 20 minutes just making a good template before I even start writing so if I just need to write I will just write and mark down um like I said curl W get I'm not going to get into those there's lots of things online let's talk about curl W get but one thing I do use uh I use I usually use W get if I just want to bring down a dot HTML file and I'll use curl if I want to send like if I want to do work with a rest API I'll do a curl dash capital X you know then put into capital H for the headers add headers and then put a you know dash capital X post send it and then get the data back you know if curl works with being able to use plain authentication another authentication message so if you need to log into something find a file on the sftp server bring it down you can do it through curl and that way you can script it and automate it all right so keeping with the idea of working with text a lot I'm going to go over someone just a couple of VIM tricks that I use I'm not going to go into all of them VIM is a lot of things but one of the things that I like to do is I will open up our list of files that I want to work with so I'll say something like if I'm in a directory with a bunch of text files VIM star dot txt which will load all of the VIM files into the argument list of VIM and then you can work with them in a very cool way where you can do you can hit colon VIM space then forward slash the pattern and then slash and then space what is that like the pound pound and what that will do it will look for the pattern in everything in your argument and it'll put it into a little quick tips thing at the bottom and then you can do our you can do our our next our a next I use a plugin called unimpaired so I don't remember how to do it I just do I just use the shortcut that the impaired tool gives you but you can go through every single document so instead of just searching one document when you hit n it goes to next you hit n it goes to the next next next and then when you get to the end of this list in this document you hit next one more time it goes to the next document it starts looking in that one and then the next one and the next one next one that has been such a saver and have been able to take projects that I do and people that and other people do the same project and it'll take me like a third of the time because I can search text so fast and find the find the things I'm looking for make changes on on the fly and keep going or just find it see if it exists or not if it doesn't exist at it it's been such a huge time saver like I mentioned before there's the act plugin and then a similar thing as the the the argly search is the buffer searching in place and so there's a command called buff do so if you don't want to if you um know that there's something that's in all these files and any don't want to use said because you're already in them and you already have like five files open in them you can do colon buff do which be you FTO and then you can do space percent sign s for slash the pattern slash the replacement slash GE and it'll and then space pipe space update it's going to go in and replace all those files and I'll and receive them with the replacement in it very cool thing to do so that's all the them tricks I'm going to go over right now some of the the two languages I program in most of the time I have to do scripting are R and Python and I'm not going to go into how I use them because it's a really big topic but I'm going to go into just some of the modules that I use so for them are the libraries I use are that are useful are our curl which is a curl interface when you're in our our vest which is similar to beautiful soup and Python which I'll talk about but it's a way to scrape data off a web page XML another way to scrape data off a web page and XLSX which is a way to get data out of Excel files I think we're working with some of these plugins though is that they have dependencies on on your operating system like XOXX you need to have our Java installed which means you have to have Java installed RXML I mean XML requires a Linux ML2 so I don't know these things when you're on a window system because I've used them on Windows before it's actually in the binary because you know there are any dynamic dynamic things in there so it's in the binary when you do when you install the module but for Unix like some just install the install the dependencies first so yeah those three are really useful one thing about XLSX you can go and it'll just bring the put the entire worksheet in a data in a data frame and you can say I want to look at the third worksheet in the workbook and I want to just look at the first 25 rows or we can say for every sheet worksheet and workbook do this so automation is not kind of stuff it's really fun I start doing these things a lot more instead of using visual basic because I hate writing macros that's it and then to do that labor office to to MS office just not good when it comes to macros this is reproducible some of the Python libraries I use beautiful suit like I said once before beautiful suit is a great Python library for scraping data for web pages NLTK which is a natural language toolkit it is there's another one called TM and R which is similar which is TM6 stands for text minor natural language toolkit and NLTK and Python they're both natural language processors there's no way I have time to go into natural language processing but there are some really good I'll put in the show notes a good couple of YouTube videos to watch if you want to learn about natural language processing because it's a big topic and it's the way search engines and contextual search and a lot of these things work that in the combination a lot of times with hidden markup models are the ways that a lot of this dynamic search contextual search stuff works in search engines RE Python library which is just the regular special library it's good if you want to work with regular expressions in Python you need to do that rdflib I work with creating text enemies and ontology sometimes and so if you're going to deal with a semantic web and you don't want to deal with like semi four or one of these big data huge platforms you don't want to spend money for that but you have the files already um rdflib is a great way to interact with those files or that the server there's another module called sparkle wrapper sp8 rql wrapper and it's a subset of rdflib commands that's a little bit easier to work with but I like working with rdflib straight away from for most things and then csv csv is a built-in Python library that lets you work with csv files tab it's it's pretty much similar to awkward work with a lot on any type of character separated file but it also works with excel files too and so those are some of the tools I use some other tools that I use that are that are Java web client or web servers are open refine open refine.org if you haven't seen it it was a project by google it's open source it's a way to visualize and clean up dirty data so you can import a big it can interface with a database or just like an excel file and I'll read it in a you know it can cluster terms that are similar so if it if there's things like capitalization or misspellings you can start to cluster those words together and say that old word is now this word and you know it's just a visual way of helping you clean up large amounts of data really quickly um another tool to use with that is um reconcile csv it's actually a plugin for open refine and it's it's also a Java web client where you it's basically run it as a little demon as you're running open refine and you can interact with it and what that does it'll let you take data that you just cleaned up and a standard data set compare the two data sets and let you and it'll do a job of like matching the thing that you just cleaned up to the to whatever standard you choose and it'll give you a probability score of how good that matches and I mean if you have to do it's really niche type of operation but if you have to do this type of work it's so useful or so there were like five different ways people smelled Mississippi because there's so many letters and you did a job in open refine clustering all the similar ones to a single spelling of Mississippi then you can go and take you know just see as we file with all the states and compare it and then have it compare Mississippi over here to Mississippi over there and it doesn't have to be a single word it can be like a sentence even but uh and then it'll it for Mississippi it's pretty easy there'd be like a one-to-one ratio of like Mississippi so your your your p-value be a one by default anything above a point eight I think it says is a good match so if you're looking at all of the if you're looking at all of the data that unit the units of measure consortium compile and you have a whole bunch of units of measure and a system that you're working with you can first clean up the units of measure because people will spell milliliter or sent a leader or whatever differently clean it up first compared to the standard and now you can say okay for sure what this system called milliliter is this national standard called ml and then there you go so that's just like a small use case the last tool is also a Java applet not applet sorry the Java web uh web server where all these Java web servers all you have to do is just go you know Java the name of the file and they run pretty much it's not too much to it um but this one is called tabula and if you ever had a pdf and you actually wanted to take that pdf or a couple like a table or a couple tables out of that pdf and put it into like a CSV file or or an Excel file tabula does exactly that it'll bring that file into like you go in the in on your web browser upload the file which is not really uploaded just saying word is in the trip in your directory because it's local um and then it'll analyze it you you um basically just drag and drop like the area that you wanted to analyze so you just highlight the area on the screen that looks like the table hit start and then it'll extract the data out of that and put it into a CSV file or an Excel file took me forever to find this but once I found it I'm like this is a keeper because so much of what I do is working people give you pdf files and expect you to be able to use them as well this is our documentation so put this in our database it's like well it's not structured data but you can structure it in this type of way um very useful I've suggested that other people have no other people who are using it now and so it's very useful um so those are my tricks that are outside of command line tools but yeah that's uh gonna do it for this episode if you have any questions uh or any further suggestions you can go ahead and uh look for me on hacker public radio I'm there you can also find me on twitter on at our young 29 and stay tuned for more hacker public radio good-bye you've been listening to hacker public radio at hackerpublicradio.org we are a community podcast network that releases shows every weekday Monday through Friday today's show like all our shows was contributed by an hbr listener like yourself if you ever thought of recording a podcast and click on our contributing to find out how easy it really is hacker public radio was founded by the digital dog pound and the infonomican computer club and it's part of the binary revolution at binwreff.com if you have comments on today's show please email the host directly leave a comment on the website or record a follow-up episode yourself unless otherwise stated today's show is released under creative comments, attribution, share a light 3.0 license