hpr-knowledge-base/hpr_transcripts/hpr4312.txt

Episode: 4312
Title: HPR4312: What Is The Indie Archive?
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4312/hpr4312.mp3
Transcribed: 2025-10-25 22:49:51

---

This is Hacker Public Radio Episode 4 312 for Tuesday 11 February 2025.
Today's show is entitled, What is the Indie Archive?
It is part of the series programming 101.
It is hosted by Harry Larry, and is about 16 minutes long.
It carries a clean flag.
In the summary is, the Indie Archive is an archival solution for Indie producers.
What is the Indie Archive?
I'm Harry Larry, and you're listening to the Plain Text Programs podcast.
The Indie Archive is an archival solution for Indie producers.
In most Indie producers run on a shoestring budget, it's important that the Indie Archive
is inexpensive to install and run.
It's especially important that monthly expenses are minimal, because a reasonable expense
most months will sometimes be more than an Indie producer can afford during some months.
The first major constraint is cost.
So I'll be talking about prices a lot in this podcast, and get more technical in future
podcasts about the Indie Archive.
Indie Archive is an archival system which is different than a backup system.
If you don't have a backup system, do that first.
My backup system uses the same tools as Indie Archive, our sync and our snapshot.
My brother uses the online backup service carbonite.
There are many other options.
A good backup system runs automatically to backup everything frequently, and preserve
version history.
It's also good to have backups off-site.
An archival system, like Indie Archive, keeps multiple redundant copies across several
hard drives on several systems in multiple locations.
An archival system also checks file integrity as protection against file corruption or
user error.
When you have a project, you really never want to lose, like a finished novel, a music
album, a video, or any other major effort that involves significant work.
That's when you need an archival system.
So the Indie Archive does not automatically backup your projects every day.
That's what your backup system should do.
The Indie Archive is an archival system where the producer of the content decides what
needs to be archived and when it needs to be archived, and then manually moves the directory
containing the files onto the Indie Archive carefully preserving the files metadata during
the transfer.
Then these files are propagated over at least seven hard drives on four different systems
in three locations.
File integrity checks are run daily, comparing the files and reporting discrepancies.
Two of the systems are kept in the studio where the content is produced.
I call them the primary and secondary systems.
They have a boot drive and two data drives each.
One of the systems is kept off-site at a nearby location.
I call it the remote system.
It also has a boot drive and two data drives.
If you have a more distant location where you can put a second remote system, you can
have remote near and remote fire systems.
Otherwise, the final system is somewhere in the cloud, provided by a professional data
storage provider.
It has a single copy of the data and usually some additional data retention.
The provider makes the backups of this data.
This is the part that might involve a monthly bill.
So depending on the size of your file set, it could be free or it could cost so much
a month.
There are a lot of options for cloud storage providers.
But first, I'm going to discuss the three systems, primary, secondary, and remote, and
how they function.
As far as the hardware goes, the systems are the same.
Now I'm a Linux guy and I do all my production work on Linux, so I'm using Linux.
I want to test the system on several versions of Linux and with BSD.
I'm not a Mac guy or a Windows guy, so I won't be going there.
The software is open source and the required programs run on all three platforms, so I'll
let a Mac or Windows programmer test the indie archive for their systems.
My guess is that the Mac fork will be easier than the Windows fork because of the file metadata.
It might even be possible to add Mac folders to the indie archive running Linux, but
I'll let someone who actually has a Mac figure that out.
I don't think the same is true for Windows.
Windows file metadata is different, so if you want to preserve the metadata, you will
probably have to install indie archive on Windows systems.
So I'm developing and deploying on Linux and I will also test on BSD.
So far I have tested Debian, Ubuntu, Free BSD, Midnight BSD, and Zubuntu, and the indie
archive works fine on all of these operating systems.
So back to the hardware, pretty much any older system that will support at least three
state address will work.
I'm using older business desktops, Dell and HD.
I pulled mine out of storage, but they are very inexpensive to buy if you're not like
me with a shed full of old computer stuff.
I just bought a small form factor HP desktop on eBay for $30, including tax and shipping.
To clarify, it's best if the primary system supports four SATA drives.
The secondary and remote systems do not need an optical drive, so they should support
three SATA drives, but they can be run on two SATA drives if you boot from the primary
file drive.
I'm currently testing a remote system with two SATA drives running midnight BSD.
The Dell desktops make a big deal about being green.
I am open to suggestions on what would be the best energy efficient systems for the indie
archive because of both the cost of electricity and the impact on the environment.
There are three drives on each system, a boot drive and two Dianna drives.
The boot drives can be SSD or spinning hard drives and need to be big enough to hold
the OS comfortably.
The data drives need to be large enough to hold the files you want to archive and they
should be high quality spinning drives.
I use the multi terabyte HGST drives and I am also looking at some Dell drives made
by HGST.
There will be a data drive and a snapshot drive on each system.
If they are not the same size, the snapshot drives should be larger.
I am testing with three terabyte data drives and four terabyte snapshot drives.
Besides the main data set that is being archived, the snapshot drives also hold the version
history of the files that have been deleted or changed, so that's why they should be
the larger drive.
From my primary system has a primary files directory with a three terabyte drive
mounted to it and a primary snapshot directory with a four terabyte drive mounted to it.
Same for the secondary and remote systems.
Now, so far I only had to buy one drive, but generally speaking, the six data drives
will be the major expense in assembling the systems.
So a good bargain on six four terabyte drives could be $120 used or $270 new and this
is the most expensive part.
I install used HGST drives all the time and really have problems with them.
I have worked for clients who won't buy used only new.
Since the file integrity checks should give early warning of drive failure and since there
is a seven drive redundancy on the data files, if I were buying drives for the indie archive,
I'd go with six used four terabyte HGST drives for $120.
There is no reason not to use drives all the same size as long as the snapshot drives are
large enough.
The size of data drives you need depends on the size of your projects and the time it takes
to do a project.
Look at your hard drives on your working systems.
Think about what directories you would like to see in archival storage.
What is the total size of these directories?
Check how many gigabytes these projects have consumed in the last year.
Think forward a few years.
Assume you will use more disk space in the future than you are now.
Do some quick arithmetic and make a decision.
Like I said, I only had to buy one drive so far because I'm weird and I had a bunch of
three terabyte drives available.
If I had to buy drives, I probably would have tried to start larger.
I am sure that at some point in the not too distant future when I am running the indie
archive and not developing it, I will have to upgrade my drives.
The primary system is the console for the indie archive.
When you copy a project onto the indie archive, the directory goes into the primary files
directory.
From there, it is propagated out to the primary snapshot directory, the secondary system,
the cloud storage if you are using it and eventually to the remote system.
All of the data propagation is done with arsenic using the archive setting that is designed
to preserve file metadata like owner permissions and date less modified.
So I have been using arsenic with the archive setting to move the files from the work system
to a USB drive and from the USB drive to the primary files folder.
At first I thought I would use an optical disk to move the files, but optical disks do
not preserve file metadata.
Also I had some word results with the USB flash drive because it was formatted fat32.
Fat32 does not support Linux metadata, so if you are going to move projects over on a
flash drive or a USB external drive, be sure to format to EXT4.
Another way to move projects over to the primary files directory is with tower compression.
This preserves metadata when the files are extracted, so this might be easier and it works
with optical drives.
If your directory will fit on an optical drive, this also gives you another backup on another
media.
If you have any suggestions on how to transfer projects while preserving the file metadata,
let me know.
I know that there are network options available, but I am hesitant to recommend them because
if I can transfer files from a system to the primary system over the LAN, then anyone
can do the same.
Or delete files, or accidentally delete directories.
I kind of want to keep TAC control over access to the primary system.
It kind of ruins the archival quality of the indie archive if anyone on the LAN can
accidentally mess with it.
So I am open to dialogue on these issues.
I am kind of worried I want it to be easy to add projects to the indie archive, but not
too easy if you know what I mean.
I feel like having to sit down at the primary system and enter a password should be the minimum
amount of security required to access the primary system.
The primary system also runs file integrity checks daily from a crime job.
All of the propagation and file integrity scripts have to be run as root to preserve the
metadata since only root can write a file that it doesn't own.
The secondary system is the SSH server for the indie archive.
The primary system logs under the secondary system as root using SSH.
Security is managed with public and private keys, so entering a password is not required.
After the keys are set up for both the primary and remote systems, password authentication
is disabled for the SSH server, so only those two systems can SSH into the secondary system.
When the propagation script is run on the primary system, our snapshot is used to create
a current version of the primary files directory in the primary snapshots directory.
Then the primary system uses our sync over SSH to make a copy of the primary files directory
to the secondary files directory.
Then the primary system logs onto the secondary system as root, and our snapshot is used
to create a current version of the secondary files directory on the secondary snapshots directory.
Finally, if cloud storage is being used, the primary system uses gcloudr sync to make
a copy of the primary files directory to a Google Cloud Storage bucket archive.
I have this bucket set to 90 days soft delete.
If you are using another type of cloud storage on Google, AWS, Mega, or other storage providers,
this command will have to be adjusted.
The reason I chose the gcloud archive bucket is because of the storage cost per gigabyte.
They have the cheapest cost per gigabyte that I found.
This will keep the monthly bill low.
Once a day the primary system runs the file integrity check from a crime job using our sync
to compare the primary files directory to the current version alpha.0 in the primary snapshots directory
logging any discrepancies.
It then does the same comparing primary files to secondary files and to the current version
in the secondary snapshots directory.
Logging discrepancies and notifying the maintainer of any discrepancies.
Notification is done by email using curl and an SMTP provider.
The remote system runs on its own schedule.
Logging into the secondary system daily to copy data from secondary files to remote files
and then using our snapshot to make a copy of remote files to the remote snapshots directory.
Since it's run on a daily schedule, it uses our snapshot with the standard daily, weekly, monthly, and yearly backups.
The remote system also runs a daily file integrity check comparing remote files to the current version
on remote snapshots and comparing remote files to both data directories on the secondary system
again logging the results and notifying the maintainer of any discrepancies.
If there is an outward facing static IP at the location with the primary and secondary
systems, then the remote system can use that static IP to SSH into the secondary system.
If there is not a static IP, then the remote system uses a duct DNS subdomain to log
onto the secondary system.
Any system using the same router as the secondary system can run a crime job to update duct DNS
with the current IP address.
Since a static IP is a monthly expense, it's important that there's an alternative that
does not require paying another bill.
So the secondary system has the SSH server, but it doesn't really do much.
Both of the other systems connect to it and use it as a junction for data propagation
and file integrity checks.
So as you can tell, there's a lot going on to make the indie archive work.
Future podcasts will get down into the details and discuss some of the choices I had to
make and why I made them.
The funny thing about this project is that the actual code was the least amount of work.
Figuring out exactly how R-Sync and R-Snap Shot work together was quite a bit of work.
The duration for both R-Snap Shot and SSH took a bit of head scratching.
Then there were a few user ID tricks I had to work through to make the indie archive usable.
But by far, the most work was writing the indie archive installation document detailing
each step of installing the software on three systems.
It's been fun so far.
If you have input, I always appreciate the help.
I get quite a bit of help on Macedon.
If you go to home.gamerplus.org, you'll find the script for this podcast with the Macedon
comment thread embedded in the post.
This podcast is being read from a document that is a work in progress.
Current versions of what is the indie archive document will be posted at Codeberg when
I'm ready to upload the project.
Thanks for listening.
You have been listening to Hacker Public Radio at HackerPublicRadio.org.
Today's show was contributed by a HBR listener like yourself.
If you ever thought of recording a podcast, then click on our contribute link to find out
how easy it really is.
Hosting for HBR has been kindly provided by an onsthost.com, the internet archive and
rsync.net.
On the Sadois status, today's show is released under Creative Commons, Attribution 4.0 International
License.