Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr2091.txt
+++ b/hpr_transcripts/hpr2091.txt
@@ -0,0 +1,263 @@
+Episode: 2091
+Title: HPR2091: Everyday Unix/Linux Tools for data processing
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr2091/hpr2091.mp3
+Transcribed: 2025-10-18 14:11:41
+
+---
+
+This episode of HBR is brought to you by AnanasThost.com, get 15% discount on all shared
+hosting with the offer code HBR15, that's HBR15, better web hosting that's
+honest and fair at AnanasThost.com
+Hello Hacker Public Radio, this is Be Easy once again with another episode this time
+we're going to talk to you about some of the command line and other tools I use to analyze
+and process data. A lot of what I do for my day job is looking at data in various formats
+whether it be text or numbers or other types of objects, sometimes images even and it comes
+at all types of formats and so a lot of things I have to do before even start analyzing
+it at all is to get it into a format that is more agreeable for automated data processing
+or automated data analysis. So I'm just going to go through some of the tools I use
+and I'm going to go through an exhaustive list but here we go. So in terms of cleaning
+up to get prepared for the right type of environment, I like to use a program called Unix
+to DOS or DOS to Unix depending on which way I'm going. It's the same one program but it
+has both commands and it does stuff like change the file encoding to UTF8 and it has
+options where you can change the other things but you can change the file encoding. It
+changes the end of line character and the end of file characters that differ between Unix
+like systems and Windows like systems or DOS like systems and so a lot of times I'll get
+a bunch of files that are text based and they'll be in the DOS style and I'll just run
+Unix and DOS style on directory and it does all of them. Another one that is also useful
+is a program called Detox. I talked about it briefly before I think or I might have
+mentioned in the comments of another podcast but Detox is a utility that renames files to
+make them easier to work with. This is out of the main page. It removes spaces and
+other such annoyances. It also translates or cleans up Latin 1 ISO 85-1 characters and
+coded in 8 bit ASCII. Unicodes, characters and coded in UTF8 and CCHISCade characters. You
+might remember sometimes people have put spaces and things in the file names and if you
+are not careful in bashing you don't properly put quotation marks around your variables
+and you'll end up parsing a part of a file name instead of the entire file name because
+of the space. Then you'll get an error message like command not found or no such file
+or directory because if you're not looking in the entire name of the file you're only
+looking at the first set of characters before the first piece of white space. If you don't
+want to have to worry about that you can use Detox and it cleans up that kind of stuff.
+It also takes out question marks and you can indicate what character you want to make
+the default character to be to replace these characters as one of the options. You can
+take out all the question marks, replace them with underscores or replace them with dash
+lines. However you want to clean it up and there are some other options you can look
+in the main page to learn more. Another thing I use for cleaning up files a lot of times
+is PDF the text. A lot of times I'll get PDF documents that were either a save as PDF
+or print to file PDF files or not a scanned image file that would take OCR to decode. I'm
+not talking about that. I'm talking about when you save a save a text file or word file
+to PDF or or even from late text. So for those type of documents PDF the text will do
+exactly what it says. It'll turn a PDF document into a text box and just strip off the text
+out. One thing that I like to do on top of that is use a option called dash layout. So
+single-tec layout. That will make the PDF to text do its best job of trying to impersonate
+the layout of the PDF document. So if it's if you don't do that and save the title is
+centered it'll just put the title all the way less justified. But if you use dash layout
+it'll put spaces in the beginning to make it look like it's centered. Or if you have
+a table this is what is really great for it's for tables. It'll try to format tables
+with by aligning the white spaces up and making it look like a table. Very useful and especially
+useful in bulk. So if I do PDF to text star a start at PDF on a directory it'll make
+a whole directory full of .txt files. Why is that good? Well because you can do a lot
+of other things with text that you can't do with PDFs. Like some of the other commands
+that I've got to go over such as act and grip and said. Now I'm not going to go into
+said because we've had many wonderful episodes already about said and if we haven't heard
+any you should go back because I've learned a lot. I've been taking a lot and using them
+in my work. So it's very useful. Act and grip do similar type things. Whereas if you
+use the command ls it'll list all the names of files. But if you do grip and then a word
+or pattern, reg x pattern and then a file or a group of files or star on the directory
+or star.txt, whatever the case may be, it'll look into every document and look for that
+text inside of it. And then with different options with grip for instance you can do
+dash capital a five and whatever the pattern is star.txt, it'll look for that pattern and
+the five lines after. So if you know that there's a pattern in the documents that you're
+looking at that from this pattern down five lines is the text that you want. You can do
+that by using dash a number or dash. No, okay, dash a number, dash b number and I'm sorry
+I switched up dash a number is above and dash b number is below and it's the capital. The
+lower case letters have a slightly different meaning and you can read their fan page to learn
+about them. That's what I use grip for most but if I want to do dash capital a or dash capital
+b, I'll use grab otherwise I will use act and I also like act because I also use vim and
+there is a plugin called act vim that I use inside of vim to do act on it. So if I cd into
+a directory and open a file in vim, I can act and the pattern that I want and it'll give me a
+quick tip menu with all the files that match and all the lines and I can go and look at every
+file that matches. But it does the same thing on the command line where you put act and the
+pattern whether it's just the word or any other reg x pattern and it'll look in there and find
+it and print it to the screen or you can print it to a file if you redirect like we've learned
+in other bash tips and so I really really enjoy act. It's in the man pages it describes it as
+a grip alternative for programmers and even if you're not a programmer it makes you feel good
+to be considered like a programmer for using this tool so little ego boost. Now one of the things
+I like to use these tools in combination so I needed to find out if a certain phrase or word
+or group of words I've already written down in documentation and the documents are PDFs. So I
+will PDFs and text all the files and then act for the name or the subject that I'm looking for
+find out which file is in that actually open up that file and read what I want. I do this with
+my notes so I take a lot of notes for you know work for things that might learn on my side time
+my shopping list and being able to just do act vegetables on my notes directory and I'll find
+where I wrote down vegetables is really great but also for work it has lots of great uses there too.
+So another tool that I use a lot is Auk and I'm pretty sure there have been some episodes on Auk as
+well it's a great tool for parsing data that is in a structured format with field separators
+and columns so it's you can do it text tab, delimited files, piped delimited files,
+common delimited files you can define the field separator and I'm not going to go into it exhaustively
+here because it's a really big topic Auk is really big and if there's a need I can make a whole
+series about Auk like there's a whole series about said and I probably just talked to myself
+and doing it so you might want to stay tuned eventually for that but just like anything else
+it's a way to get your data in the columns so if I see as we file instead of trying to open it in
+in LibreOffice I'll just lock it and if I just want the third column I'll just take the third
+column out and I use it in bash scripts one bash script that I like to use it with is with another
+couple tools called wget and curl which are command line utilities to interface with the internet
+and so one project here's a little project for you I needed to get a list of words the definition of
+them so what do I do I look at something like mirror and webster
+sectionary and I see the pattern of the of the URL where there's a there's a part in the URL
+where the word that you're looking for goes and then I'll say okay well I'll just write a big CSV
+file or a big text file with the words I'm looking for and do a for loop on every line and do a
+wget of the web page and do a substitution where I add the word that from that line so for
+every line look for the look for this word and put that word in the URL and then a lot of these
+sites are tricky so you might have to put a sleep 10 or a sleep 60 or whatever you want to have
+to do you might have to make it pause for a couple seconds before sensing other requests because
+a lot of these sites don't want you to do this so they'll put in something like a delay
+so that you so and if you do it too fast one of the wgets that you get the HTML file all say is
+something like don't use robots to download our data so it's not that it's illegal it's just
+that it's something that they don't want you to do and then you know obviously reference away
+I got this from Merrim Webster one of the things I like to do either in said or inside of
+VIM instead of using Unix to DOS or DOS to Unix is I'll just do a said command or a VIM
+record expression substitution to take out the special characters so to be like said dash i
+then inside of single quotes s four slash
+carrot capital m four slash four slash g then close the um close the single quote start at txt
+that'll take out all the end of line characters another way to take them all out is said dash i
+uh single quote s four slash back slash r which is an overturned key slash slash g
+close the quote start at txt that'll do something similar another tricky one
+is to get the last line now and I do that when in VIM more than anything else I'll talk about that
+because sometimes there'll be an end-of-file character that you have to parse out not going to
+that and a little bit um and then another thing that I use a lot is pandok as I've said in another
+podcast I really enjoy writing in plain text writing and mark down one because I can do cool things
+like grip an entire directory or accurate then the entire directory and find out where where I wrote
+down how to um set up a lamp server if I haven't written down somewhere uh that I could find the
+file and it is but also because I just really hate messing around with formatting and I'm
+and when I do mess with it I'm kind of you know I get distracted by making sure it's all perfect
+and not focusing all the text on max rewriting so you know do something like H make
+an H one header with a single asterix makes it so I don't have to worry about if the next one time
+I want a header one it's lined up correctly or it's the right color and then and I've learned from
+some of the other episodes how would be a good template creator and that just made me more into
+using mark mark down because now that I know that there's a good way to do it I would really
+spend you know 15 20 minutes just making a good template before I even start writing
+so if I just need to write I will just write and mark down
+um like I said curl W get I'm not going to get into those there's lots of things online
+let's talk about curl W get but one thing I do use uh I use I usually use W get if I just want to
+bring down a dot HTML file and I'll use curl if I want to send like if I want to do work with
+a rest API I'll do a curl dash capital X you know then put into capital H for the headers
+add headers and then put a you know dash capital X post send it and then get the data back
+you know if curl works with being able to use plain authentication another authentication
+message so if you need to log into something find a file on the sftp server bring it down
+you can do it through curl and that way you can script it and automate it
+all right so keeping with the idea of working with text a lot I'm going to go over someone
+just a couple of VIM tricks that I use I'm not going to go into all of them VIM is a lot of things
+but one of the things that I like to do is I will open up
+our list of files that I want to work with so I'll say something like if I'm in a directory
+with a bunch of text files VIM star dot txt which will load all of the VIM files into
+the argument list of VIM and then you can work with them in a very cool way where you can do
+you can hit colon VIM space then forward slash the pattern and then slash and then space
+what is that like the pound pound and what that will do it will look for the pattern in
+everything in your argument and it'll put it into a little quick tips thing at the bottom and
+then you can do our you can do our our next our a next I use a plugin called unimpaired so I
+don't remember how to do it I just do I just use the shortcut that the impaired tool gives you
+but you can go through every single document so instead of just searching one document when you
+hit n it goes to next you hit n it goes to the next next next and then when you get to the end
+of this list in this document you hit next one more time it goes to the next document it starts
+looking in that one and then the next one and the next one next one that has been such a
+saver and have been able to take projects that I do and people that and other people do the same
+project and it'll take me like a third of the time because I can search text so fast and find
+the find the things I'm looking for make changes on on the fly and keep going or just find it
+see if it exists or not if it doesn't exist at it it's been such a huge time saver like I mentioned
+before there's the act plugin and then a similar thing as the the the argly search is
+the buffer searching in place and so there's a command called buff do so if you don't want to
+if you um know that there's something that's in all these files and any don't want to use said
+because you're already in them and you already have like five files open in them you can do colon
+buff do which be you FTO and then you can do space percent sign s for slash the pattern slash
+the replacement slash GE and it'll and then space pipe space update it's going to go in and replace
+all those files and I'll and receive them with the replacement in it very cool thing to do
+so that's all the them tricks I'm going to go over right now some of the the two languages I
+program in most of the time I have to do scripting are R and Python and I'm not going to go into
+how I use them because it's a really big topic but I'm going to go into just some of the modules
+that I use so for them are the libraries I use are that are useful are our curl which is
+a curl interface when you're in our our vest which is similar to beautiful soup and Python which
+I'll talk about but it's a way to scrape data off a web page XML another way to scrape data off
+a web page and XLSX which is a way to get data out of Excel files I think we're working with
+some of these plugins though is that they have dependencies on on your operating system like
+XOXX you need to have our Java installed which means you have to have Java installed
+RXML I mean XML requires a Linux ML2 so I don't know these things when you're on a window system
+because I've used them on Windows before it's actually in the binary because you know there are
+any dynamic dynamic things in there so it's in the binary when you do when you install the module
+but for Unix like some just install the install the dependencies first so yeah those three are
+really useful one thing about XLSX you can go and it'll just bring the put the entire
+worksheet in a data in a data frame and you can say I want to look at the third
+worksheet in the workbook and I want to just look at the first 25 rows or we can say for
+every sheet worksheet and workbook do this so automation is not kind of stuff it's really fun
+I start doing these things a lot more instead of using visual basic because I hate writing macros
+that's it and then to do that labor office to to MS office just not good when it comes to macros
+this is reproducible some of the Python libraries I use beautiful suit like I said once
+before beautiful suit is a great Python library for scraping data for web pages
+NLTK which is a natural language toolkit it is there's another one called TM and R which is
+similar which is TM6 stands for text minor natural language toolkit and NLTK and Python they're
+both natural language processors there's no way I have time to go into natural language processing
+but there are some really good I'll put in the show notes a good couple of YouTube videos to watch
+if you want to learn about natural language processing because it's a big topic and it's the way
+search engines and contextual search and a lot of these things work that in the combination a
+lot of times with hidden markup models are the ways that a lot of this dynamic search contextual
+search stuff works in search engines RE Python library which is just the regular special library
+it's good if you want to work with regular expressions in Python you need to do that
+rdflib I work with creating text enemies and ontology sometimes and so if you're going to
+deal with a semantic web and you don't want to deal with like semi four or one of these big data
+huge platforms you don't want to spend money for that but you have the files already
+um rdflib is a great way to interact with those files or that the server there's another
+module called sparkle wrapper sp8 rql wrapper and it's a subset of rdflib commands
+that's a little bit easier to work with but I like working with rdflib straight away from
+for most things and then csv csv is a built-in Python library that lets you work with csv files
+tab it's it's pretty much similar to awkward work with a lot on any type of
+character separated file but it also works with excel files too
+and so those are some of the tools I use some other tools that I use that are that are
+Java web client or web servers are open refine open refine.org if you haven't seen it
+it was a project by google it's open source it's a way to visualize and clean up dirty data
+so you can import a big it can interface with a database or just like an excel file and I'll read it
+in a you know it can cluster terms that are similar so if it if there's things like capitalization
+or misspellings you can start to cluster those words together and say that old word is now this
+word and you know it's just a visual way of helping you clean up large amounts of data really
+quickly um another tool to use with that is um reconcile csv it's actually a plugin for open
+refine and it's it's also a Java web client where you it's basically run it as a little demon
+as you're running open refine and you can interact with it and what that does it'll let you take
+data that you just cleaned up and a standard data set compare the two data sets and let you
+and it'll do a job of like matching the thing that you just cleaned up to the to whatever standard
+you choose and it'll give you a probability score of how good that matches and I mean if you have
+to do it's really niche type of operation but if you have to do this type of work it's so useful
+or so there were like five different ways people smelled Mississippi because there's so many letters
+and you did a job in open refine clustering all the similar ones to a single spelling of Mississippi
+then you can go and take you know just see as we file with all the states and compare it and then
+have it compare Mississippi over here to Mississippi over there and it doesn't have to be a single
+word it can be like a sentence even but uh and then it'll it for Mississippi it's pretty easy
+there'd be like a one-to-one ratio of like Mississippi so your your your p-value be a one by default
+anything above a point eight I think it says is a good match so if you're looking at all of the
+if you're looking at all of the data that unit the units of measure consortium compile and you have
+a whole bunch of units of measure and a system that you're working with you can first clean up
+the units of measure because people will spell milliliter or sent a leader or whatever differently
+clean it up first compared to the standard and now you can say okay for sure what this system
+called milliliter is this national standard called ml and then there you go so that's just like
+a small use case the last tool is also a Java applet not applet sorry the Java web uh web server
+where all these Java web servers all you have to do is just go you know Java the name of the file
+and they run pretty much it's not too much to it um but this one is called tabula and if you
+ever had a pdf and you actually wanted to take that pdf or a couple like a table or a couple
+tables out of that pdf and put it into like a CSV file or or an Excel file tabula does exactly that
+it'll bring that file into like you go in the in on your web browser upload the file which is not
+really uploaded just saying word is in the trip in your directory because it's local um and then
+it'll analyze it you you um basically just drag and drop like the area that you wanted to
+analyze so you just highlight the area on the screen that looks like the table hit start
+and then it'll extract the data out of that and put it into a CSV file or an Excel file
+took me forever to find this but once I found it I'm like this is a keeper because so much of what I
+do is working people give you pdf files and expect you to be able to use them as well this is
+our documentation so put this in our database it's like well it's not structured data but you
+can structure it in this type of way um very useful I've suggested that other people have no other
+people who are using it now and so it's very useful um so those are my tricks that are outside
+of command line tools but yeah that's uh gonna do it for this episode if you have any questions
+uh or any further suggestions you can go ahead and uh look for me on hacker public radio
+I'm there you can also find me on twitter on at our young 29 and stay tuned for more hacker public
+radio good-bye you've been listening to hacker public radio at hackerpublicradio.org
+we are a community podcast network that releases shows every weekday Monday through Friday
+today's show like all our shows was contributed by an hbr listener like yourself
+if you ever thought of recording a podcast and click on our contributing to find out how easy it
+really is hacker public radio was founded by the digital dog pound and the infonomican computer club
+and it's part of the binary revolution at binwreff.com if you have comments on today's show
+please email the host directly leave a comment on the website or record a follow-up episode yourself
+unless otherwise stated today's show is released under creative comments,
+attribution, share a light 3.0 license