Initial commit: HPR Knowledge Base MCP Server

- MCP server with stdio transport for local use - Search episodes, transcripts, hosts, and series - 4,511 episodes with metadata and transcripts - Data loader with in-memory JSON storage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 10:54:13 +00:00
commit 7c8efd2228
4494 changed files with 1705541 additions and 0 deletions
--- a/hpr_transcripts/hpr4312.txt
+++ b/hpr_transcripts/hpr4312.txt
@@ -0,0 +1,241 @@
+Episode: 4312
+Title: HPR4312: What Is The Indie Archive?
+Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4312/hpr4312.mp3
+Transcribed: 2025-10-25 22:49:51
+
+---
+
+This is Hacker Public Radio Episode 4 312 for Tuesday 11 February 2025.
+Today's show is entitled, What is the Indie Archive?
+It is part of the series programming 101.
+It is hosted by Harry Larry, and is about 16 minutes long.
+It carries a clean flag.
+In the summary is, the Indie Archive is an archival solution for Indie producers.
+What is the Indie Archive?
+I'm Harry Larry, and you're listening to the Plain Text Programs podcast.
+The Indie Archive is an archival solution for Indie producers.
+In most Indie producers run on a shoestring budget, it's important that the Indie Archive
+is inexpensive to install and run.
+It's especially important that monthly expenses are minimal, because a reasonable expense
+most months will sometimes be more than an Indie producer can afford during some months.
+The first major constraint is cost.
+So I'll be talking about prices a lot in this podcast, and get more technical in future
+podcasts about the Indie Archive.
+Indie Archive is an archival system which is different than a backup system.
+If you don't have a backup system, do that first.
+My backup system uses the same tools as Indie Archive, our sync and our snapshot.
+My brother uses the online backup service carbonite.
+There are many other options.
+A good backup system runs automatically to backup everything frequently, and preserve
+version history.
+It's also good to have backups off-site.
+An archival system, like Indie Archive, keeps multiple redundant copies across several
+hard drives on several systems in multiple locations.
+An archival system also checks file integrity as protection against file corruption or
+user error.
+When you have a project, you really never want to lose, like a finished novel, a music
+album, a video, or any other major effort that involves significant work.
+That's when you need an archival system.
+So the Indie Archive does not automatically backup your projects every day.
+That's what your backup system should do.
+The Indie Archive is an archival system where the producer of the content decides what
+needs to be archived and when it needs to be archived, and then manually moves the directory
+containing the files onto the Indie Archive carefully preserving the files metadata during
+the transfer.
+Then these files are propagated over at least seven hard drives on four different systems
+in three locations.
+File integrity checks are run daily, comparing the files and reporting discrepancies.
+Two of the systems are kept in the studio where the content is produced.
+I call them the primary and secondary systems.
+They have a boot drive and two data drives each.
+One of the systems is kept off-site at a nearby location.
+I call it the remote system.
+It also has a boot drive and two data drives.
+If you have a more distant location where you can put a second remote system, you can
+have remote near and remote fire systems.
+Otherwise, the final system is somewhere in the cloud, provided by a professional data
+storage provider.
+It has a single copy of the data and usually some additional data retention.
+The provider makes the backups of this data.
+This is the part that might involve a monthly bill.
+So depending on the size of your file set, it could be free or it could cost so much
+a month.
+There are a lot of options for cloud storage providers.
+But first, I'm going to discuss the three systems, primary, secondary, and remote, and
+how they function.
+As far as the hardware goes, the systems are the same.
+Now I'm a Linux guy and I do all my production work on Linux, so I'm using Linux.
+I want to test the system on several versions of Linux and with BSD.
+I'm not a Mac guy or a Windows guy, so I won't be going there.
+The software is open source and the required programs run on all three platforms, so I'll
+let a Mac or Windows programmer test the indie archive for their systems.
+My guess is that the Mac fork will be easier than the Windows fork because of the file metadata.
+It might even be possible to add Mac folders to the indie archive running Linux, but
+I'll let someone who actually has a Mac figure that out.
+I don't think the same is true for Windows.
+Windows file metadata is different, so if you want to preserve the metadata, you will
+probably have to install indie archive on Windows systems.
+So I'm developing and deploying on Linux and I will also test on BSD.
+So far I have tested Debian, Ubuntu, Free BSD, Midnight BSD, and Zubuntu, and the indie
+archive works fine on all of these operating systems.
+So back to the hardware, pretty much any older system that will support at least three
+state address will work.
+I'm using older business desktops, Dell and HD.
+I pulled mine out of storage, but they are very inexpensive to buy if you're not like
+me with a shed full of old computer stuff.
+I just bought a small form factor HP desktop on eBay for $30, including tax and shipping.
+To clarify, it's best if the primary system supports four SATA drives.
+The secondary and remote systems do not need an optical drive, so they should support
+three SATA drives, but they can be run on two SATA drives if you boot from the primary
+file drive.
+I'm currently testing a remote system with two SATA drives running midnight BSD.
+The Dell desktops make a big deal about being green.
+I am open to suggestions on what would be the best energy efficient systems for the indie
+archive because of both the cost of electricity and the impact on the environment.
+There are three drives on each system, a boot drive and two Dianna drives.
+The boot drives can be SSD or spinning hard drives and need to be big enough to hold
+the OS comfortably.
+The data drives need to be large enough to hold the files you want to archive and they
+should be high quality spinning drives.
+I use the multi terabyte HGST drives and I am also looking at some Dell drives made
+by HGST.
+There will be a data drive and a snapshot drive on each system.
+If they are not the same size, the snapshot drives should be larger.
+I am testing with three terabyte data drives and four terabyte snapshot drives.
+Besides the main data set that is being archived, the snapshot drives also hold the version
+history of the files that have been deleted or changed, so that's why they should be
+the larger drive.
+From my primary system has a primary files directory with a three terabyte drive
+mounted to it and a primary snapshot directory with a four terabyte drive mounted to it.
+Same for the secondary and remote systems.
+Now, so far I only had to buy one drive, but generally speaking, the six data drives
+will be the major expense in assembling the systems.
+So a good bargain on six four terabyte drives could be $120 used or $270 new and this
+is the most expensive part.
+I install used HGST drives all the time and really have problems with them.
+I have worked for clients who won't buy used only new.
+Since the file integrity checks should give early warning of drive failure and since there
+is a seven drive redundancy on the data files, if I were buying drives for the indie archive,
+I'd go with six used four terabyte HGST drives for $120.
+There is no reason not to use drives all the same size as long as the snapshot drives are
+large enough.
+The size of data drives you need depends on the size of your projects and the time it takes
+to do a project.
+Look at your hard drives on your working systems.
+Think about what directories you would like to see in archival storage.
+What is the total size of these directories?
+Check how many gigabytes these projects have consumed in the last year.
+Think forward a few years.
+Assume you will use more disk space in the future than you are now.
+Do some quick arithmetic and make a decision.
+Like I said, I only had to buy one drive so far because I'm weird and I had a bunch of
+three terabyte drives available.
+If I had to buy drives, I probably would have tried to start larger.
+I am sure that at some point in the not too distant future when I am running the indie
+archive and not developing it, I will have to upgrade my drives.
+The primary system is the console for the indie archive.
+When you copy a project onto the indie archive, the directory goes into the primary files
+directory.
+From there, it is propagated out to the primary snapshot directory, the secondary system,
+the cloud storage if you are using it and eventually to the remote system.
+All of the data propagation is done with arsenic using the archive setting that is designed
+to preserve file metadata like owner permissions and date less modified.
+So I have been using arsenic with the archive setting to move the files from the work system
+to a USB drive and from the USB drive to the primary files folder.
+At first I thought I would use an optical disk to move the files, but optical disks do
+not preserve file metadata.
+Also I had some word results with the USB flash drive because it was formatted fat32.
+Fat32 does not support Linux metadata, so if you are going to move projects over on a
+flash drive or a USB external drive, be sure to format to EXT4.
+Another way to move projects over to the primary files directory is with tower compression.
+This preserves metadata when the files are extracted, so this might be easier and it works
+with optical drives.
+If your directory will fit on an optical drive, this also gives you another backup on another
+media.
+If you have any suggestions on how to transfer projects while preserving the file metadata,
+let me know.
+I know that there are network options available, but I am hesitant to recommend them because
+if I can transfer files from a system to the primary system over the LAN, then anyone
+can do the same.
+Or delete files, or accidentally delete directories.
+I kind of want to keep TAC control over access to the primary system.
+It kind of ruins the archival quality of the indie archive if anyone on the LAN can
+accidentally mess with it.
+So I am open to dialogue on these issues.
+I am kind of worried I want it to be easy to add projects to the indie archive, but not
+too easy if you know what I mean.
+I feel like having to sit down at the primary system and enter a password should be the minimum
+amount of security required to access the primary system.
+The primary system also runs file integrity checks daily from a crime job.
+All of the propagation and file integrity scripts have to be run as root to preserve the
+metadata since only root can write a file that it doesn't own.
+The secondary system is the SSH server for the indie archive.
+The primary system logs under the secondary system as root using SSH.
+Security is managed with public and private keys, so entering a password is not required.
+After the keys are set up for both the primary and remote systems, password authentication
+is disabled for the SSH server, so only those two systems can SSH into the secondary system.
+When the propagation script is run on the primary system, our snapshot is used to create
+a current version of the primary files directory in the primary snapshots directory.
+Then the primary system uses our sync over SSH to make a copy of the primary files directory
+to the secondary files directory.
+Then the primary system logs onto the secondary system as root, and our snapshot is used
+to create a current version of the secondary files directory on the secondary snapshots directory.
+Finally, if cloud storage is being used, the primary system uses gcloudr sync to make
+a copy of the primary files directory to a Google Cloud Storage bucket archive.
+I have this bucket set to 90 days soft delete.
+If you are using another type of cloud storage on Google, AWS, Mega, or other storage providers,
+this command will have to be adjusted.
+The reason I chose the gcloud archive bucket is because of the storage cost per gigabyte.
+They have the cheapest cost per gigabyte that I found.
+This will keep the monthly bill low.
+Once a day the primary system runs the file integrity check from a crime job using our sync
+to compare the primary files directory to the current version alpha.0 in the primary snapshots directory
+logging any discrepancies.
+It then does the same comparing primary files to secondary files and to the current version
+in the secondary snapshots directory.
+Logging discrepancies and notifying the maintainer of any discrepancies.
+Notification is done by email using curl and an SMTP provider.
+The remote system runs on its own schedule.
+Logging into the secondary system daily to copy data from secondary files to remote files
+and then using our snapshot to make a copy of remote files to the remote snapshots directory.
+Since it's run on a daily schedule, it uses our snapshot with the standard daily, weekly, monthly, and yearly backups.
+The remote system also runs a daily file integrity check comparing remote files to the current version
+on remote snapshots and comparing remote files to both data directories on the secondary system
+again logging the results and notifying the maintainer of any discrepancies.
+If there is an outward facing static IP at the location with the primary and secondary
+systems, then the remote system can use that static IP to SSH into the secondary system.
+If there is not a static IP, then the remote system uses a duct DNS subdomain to log
+onto the secondary system.
+Any system using the same router as the secondary system can run a crime job to update duct DNS
+with the current IP address.
+Since a static IP is a monthly expense, it's important that there's an alternative that
+does not require paying another bill.
+So the secondary system has the SSH server, but it doesn't really do much.
+Both of the other systems connect to it and use it as a junction for data propagation
+and file integrity checks.
+So as you can tell, there's a lot going on to make the indie archive work.
+Future podcasts will get down into the details and discuss some of the choices I had to
+make and why I made them.
+The funny thing about this project is that the actual code was the least amount of work.
+Figuring out exactly how R-Sync and R-Snap Shot work together was quite a bit of work.
+The duration for both R-Snap Shot and SSH took a bit of head scratching.
+Then there were a few user ID tricks I had to work through to make the indie archive usable.
+But by far, the most work was writing the indie archive installation document detailing
+each step of installing the software on three systems.
+It's been fun so far.
+If you have input, I always appreciate the help.
+I get quite a bit of help on Macedon.
+If you go to home.gamerplus.org, you'll find the script for this podcast with the Macedon
+comment thread embedded in the post.
+This podcast is being read from a document that is a work in progress.
+Current versions of what is the indie archive document will be posted at Codeberg when
+I'm ready to upload the project.
+Thanks for listening.
+You have been listening to Hacker Public Radio at HackerPublicRadio.org.
+Today's show was contributed by a HBR listener like yourself.
+If you ever thought of recording a podcast, then click on our contribute link to find out
+how easy it really is.
+Hosting for HBR has been kindly provided by an onsthost.com, the internet archive and
+rsync.net.
+On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
+License.