Files
hpr-knowledge-base/hpr_transcripts/hpr4341.txt

152 lines
9.9 KiB
Plaintext
Raw Normal View History

Episode: 4341
Title: HPR4341: Transferring Large Data Sets
Source: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4341/hpr4341.mp3
Transcribed: 2025-10-25 23:20:05
---
This is Hacker Public Radio Episode 4341 for Monday 24 March 2025.
Today's show is entitled Transferring Large Data Sets.
It is part of the series programming 101.
It is hosted by Harry Larry and is about 11 minutes long.
It carries a clean flag.
The summary is, how to transfer large data sets using TAR and Blurray disks while preserving
metadata.
Transferring Large Data Sets
Very large data sets present their own problems.
Not everyone has directories with hundreds of gigabytes of project files, but I do, and
I assume I'm not the only one.
For instance, I have a directory with over 700 radio shows.
My name these directories also have a podcast, and they also have pictures and text files.
Doing a properties check on the directory, I see 450 gigabytes of data.
When I started envisioning Libre Indie Archive, I wanted to move directories into archival
storage using optical drives.
My first attempt at this didn't work because I lost metadata when I wrote the optical
drives since optical drives are read only.
After further work and study, I learned that TAR files can preserve metadata if they
are created and uncompressed as root.
In fact, if you're running TAR as root preserving file ownership and permissions is the default.
So this means that optical drives are an option if you write TAR archives onto the optical
drives.
I have better success rates with 25 gigabyte Blu-ray disks than with the 50 gigabyte disks.
So if your directory breaks up into projects that fit on 25 gigabyte disks, that's great.
My data did not do this easily, but TAR does have an option to write a dataset to multiple
TAR files, each with a maximum size, labeling them dash 0, dash 1, etc.
When using this multi-volume feature, you cannot use compression, so you will get TAR files
not TAR.GZ files.
It's better to break the file sets up in more reasonable sizes, so I decided to divide
the shows up alphabetically by title, so all the shows starting with the letter A would
be one dataset and then down the alphabet one letter at a time.
Most of the letters would result in the single TAR file label dash 0 that would fit on the
25 gigabyte disk.
Many letters, however, took two or even three TAR files that would have to be written on
different disks and then concatenated on the primary system before they are extracted
to the correct location in primary files.
There is a companion program to TAR called TARCAT that I used to combine two or three
TAR files split by length into a single TAR file that could be extracted.
I ran in grandpa as root to extract the files.
So I used a TAR command on the working system where my something blue radio shows their
stored.
Then I used K3B to burn these files onto a 25 gigabyte blue ray disk, carefully labeling
the disk and writing a text file that I used to keep up with which files I had already
copied to disk.
Then on the Libre Indy Archive primary system, I copied from the blue ray to the boot drive
the file or files for that dataset.
Then I would use TARCAT to combine the files if there was more than one file for that
dataset.
And finally, I would extract the files to primary files by running and grandpa as root.
Now I'm going into details on each of these steps.
First make sure that the Libre Indy Archive program prep.sh is in your home directory
on your workstation.
Then from the data directory to be archived, in my case the something underscore blue directory
run prep.sh like this, till the slash prep.sh.
This will create a file named IAunderorigin.txt that lists the date, the computer and directory
being archived, and the users and user IDs on that system.
All very helpful information to have, if it's sometime in the future, you need to do
a restore.
Next create a TAR dataset for each letter of the alphabet.
You may want to divide your dataset in a different way.
Open a terminal in the same directory as the data directory, my something blue directory,
so that LS displays something blue, your data directory.
I keep the something blue shows and podcasts in sub-directories in the something blue directory.
Here's the TAR command.
sudo TAR dash cv dash dash tape dash length equals 20 million.
dash dash file equals something blue dash a dash squirley bracket zero dot dot 50 close
squirley bracket dot TAR.
space slash home slash slurry slash delta slash something under blue slash a star.
This is for the letter a, so the file parameter includes the letter a.
The number zero dot dot 50 in the squirley brackets are the sequence numbers for the files.
I only had one file for the letter a, something blue dash a dash zero dot TAR.
The last parameter is the source for the TAR files.
In this case, slash home slash slurry slash delta slash something blue slash a star.
All of the files and directories in the something blue directory that start with the letter a.
You may want to change the dash dash tape dash length parameter.
As listed, it stores up to 19.1 gigabytes.
The maximum capacity of a 25 gigabyte blue ray is 23.3 gigabytes for data storage.
Example B. For the letter B, I ended up with three TAR files.
Something blue dash B dash zero dot TAR, something blue dash B dash one dot TAR, and something blue
dash B dash two dot TAR.
I will use these files in the example below using TARCAT to combine the files.
I use K3B to burn blue ray data disk.
Besides installing K3B, you have to install some other programs.
Then there is a particular setup that needs to be done, including selecting CD record and no multi session.
Here's an excellent article that will go step by step through the installation and setup.
How to burn blue ray disks on Ubuntu and derivatives using K3B and the link.
I also always check verify data, and I use the Linux Unix file system not Windows,
which will rename your files if the file names are too long.
I installed a blue ray reader into the primary system, and I used sooner to copy the files from the blue ray disk to the boot drive.
In the primary file directory, I made a sub directory, something under blue, to hold the archive shows.
If there is only one file, like an example A above, you can skip the concatenation step.
If there is more than one file, like example B above, you use TARCAT to concatenate these files into one TAR file.
You have to do this.
If you try to extract from just one of the numbered files, when there is more than one, you will get an error.
So if I try to extract from something blue, dash B dash 0 dot TAR, and I get an error, it doesn't mean that there's anything wrong with that file.
It just has to be concatenated with the other B files before it can be extracted.
There is a companion program to TAR, called TARCAT, that should be used to concatenate the TAR files.
Here's the command I used, for example B above.
TARCAT, something blue dash B dash 0 dot TAR, space, something blue dash B dash 1 dot TAR space,
something blue dash B dash 2 dot TAR, space, redirect to, space, SB dash B dot TAR.
This will concatenate the three smaller TAR files into one bigger TAR file named SB dash B dot TAR.
In order to preserve the metadata, you have to extract the files as root.
In order to make it easier to select the files to be extracted, and where to store them.
I used the GUI archive manager and grandpa.
To run and grandpa as root, open a terminal with CTRL ALT and use this command.
Sudo, dash capital H and grandpa.
Click open and select the TAR file to extract.
Then follow the path until you are in the something blue directory,
and you are seeing the folders and files you want to extract.
Type CTRL A to select them all.
Instead of the something blue directory, you will go to your data directory.
Then click extract at the top of the window.
Open the directory where you want the files to go.
In my case, primary files slash something blue.
Then click extract again in the lower right.
After the files are extracted, go to your data directory in primary files,
and check that the directories and files are where you expect them to be.
You can also open a terminal in that directory and type LSTASH-L to review the metadata.
When dealing with that a chunk size 20 gigabytes or more, each of these steps takes time.
The reason I like using an optical backup to transfer the files from the working system
to the Libra Indie archive is because it gives me an easy-to-store backup that is not on a
spinning drive and that cannot be overwritten. Still, optical distortion is not perfect either.
It's just another belt to go with your suspenders.
Another way to transfer directories into the primary files directory is with SSH over the network.
This is not as safe as using optical disk, and it also does not provide the extra snapshot backup.
It also takes a long time, but it is not as labor intensive.
After I spend some more time thinking about this and testing,
I will do a podcast about transferring large data sets with SSH.
Although I am transferring large data sets to move them into archival storage using Libra Indie
archive, there are many other situations where you might want to move a large data set
while preserving the metadata. So what I have written about car files, optical disks,
and running sooner and then grandpa as root is generally applicable.
As always, comments are appreciated. You can comment on hacker public radio or on mastodon.
Visit my blog at home.gamerplus.org where I will post the show notes and embed the mastodon thread
for comments about this podcast. Thanks.
You have been listening to hacker public radio as hacker public radio does work.
Today's show was contributed by a HBR listener like yourself.
If you ever thought of recording a podcast,
you can click on our contribute link to find out how easy it really is.
Hosting for HBR has been kindly provided by
an honesthost.com, the internet archive, and our sync.net.
On the Sadois status, today's show is released under Creative Commons
Attribution 4.0 International License.