Hacker Public Radio / HPR4312: What Is The Indie Archive?

Description

This show has been flagged as Clean by the host. What Is The Indie Archive? I'm Hairy Larry and you're listening to the Plain Text Programs podcast. The Indie Archive is a archival solution for indie producers. Since most indie producers run on a shoestring budget it's important that the Indie Archive is inexpensive to install and run. It's especially important that monthly expenses are minimal because a reasonable expense most months will sometimes be more than an indie producer can afford during some months. The first major constraint is cost. So I'll be talking about prices a lot in this podcast and get more technical in future podcasts about The Indie Archive. Indie Archive is an archival system which is different than a backup system. If you don't have a backup system do that first. My backup system uses the same tools as Indie Archive, rsync and rsnapshot. My brother uses the online backup service Carbonite. There are many other options. A good backup system runs automatically to back up everything frequently and preserve version history. It's also good to have backups offsite. An archival system, like Indie Archive, keeps multiple redundant copies across several hard drives on several systems in multiple locations. An archival system also checks file integrity as protection against file corruption or user error. When you have a project you really never want to lose, like a finished novel, a music album, a video, or any other major effort that involves significant work, that's when you need an archival system. So The Indie Archive does not automatically backup your projects every day. That's what your backup system should do. The Indie Archive is an archival system where the producer of the content decides what needs to be archived and when it needs to be archived and then manually moves a directory containing the files onto the Indie Archive carefully preserving the file's metadata during the transfer. Then these files are propagated over at least 7 hard drives on 4 different systems in three locations. File integrity checks are run daily comparing the files and reporting discrepancies. Two of the systems are kept in the studio where the content is produced. I call them the primary and secondary systems. They have a boot drive and two data drives each. One of the systems is kept offsite at a nearby location. I call it the remote system. It also has a boot drive and two data drives. If you have a more distant location where you can put a second remotes system you can have remotenear and remotefar systems. Otherwise ... The final system is somewhere in the cloud provided by a professional data storage provider. It has a single copy of the data and usually some additional data retention. The provider makes the backups of this data. This is the part that might involve a monthly bill. So, depending on the size of your file set, it could be free or it could cost so much a month. There are a lot of options for cloud storage providers. But first I'm going to discuss the three systems, primary, secondary, and remote, and how they function. As far as the hardware goes the systems are the same. Now, I"m a Linix guy and I do all my production work on Linux so I'm using Linux. I want to test the system on several versions of Linux and with BSD. I'm not a Mac guy or a Windows guy so I won't be going there. The software is open source and the required programs run on all three platforms so I'll let a Mac or Windows programmer test The Indie Archive for their systems. My guess is that the Mac fork will be easier than the Windows fork because of the file metadata. It might even be possible to add Mac folders to The Indie Archive running Linux but I'll let someone who actually has a Mac figure that out. I don't think the same is true for Windows. Windows file metadata is different and so if you want to preserve the metadata you will probably have to install The Indie Archive on Windows systems. So, I'm developing and deploying on Linux and I will also test on BSD. So far I have tested Debian, Ununtu, FreeBSD, Midnight BSD, and Xubuntu and The Indie Archive works fine all of these operating systems. So, back to the hardware. Pretty much any older system that will support at least three sata drives will work. I'm using older business class desktops, Dell and HP. I pulled mine out of storage but they are very inexpensive to buy if you're not like me with a shed full of old computer stuff. I just bought a Small Form Factor HP Desktop on ebay for $30 including tax and shipping. To clarify, it's best if the primary system supports 4 sata drives. The secondary and remote systems do not need an optical drive so they should support three sata drives but they can be run on two sata drives if you boot from the primaryfile drive. I am currently testing a remote system with two sata drives running Midnight BSD. The Dell desktops made a big deal about being green. I am open to suggestions on what would be the best energy efficient systems for The Indie Archive, because of both the cost of electricity and the impact on the environment. There are three drives on each system, a boot drive and two data drives. The boot drives can be SSD or spinning hard drives and need to be big enough to hold the OS comfortably. The data drives need to be large enough to hold the files you want to archive and they should be high quality spinning drives, I use the multi terrabyte HGST drives and I am also looking at some Dell drives made by HGST. There will be a data drive and a snapshot drive on each system. If they are not the same size the snapshot drives should be larger. I am testing with 3 terrabyte data drives and 4 terrabyte snapshot drives. Besides the main data set that is being archived the snapshot drives also hold the version history of files that have been deleted or changed. So, that's why they should be the larger drive. So my primary system has a primaryfiles directory with a 3 terrabyte drive mounted to it and a primarysnapshots directory with a 4 terrabyte drive mounted to it. Same for the secondary and remote systems. Now, so far I only had to buy one drive but generally speaking the six data drives will be the major expense in assembling the systems. So a good bargain on six 4 terrabyte drives could be $120 used or $270 new. And this is the most expensive part. I install used HGST drives all the time and rarely have problems with them. I have worked for clients who won't buy used, only new. Since the file integrity checks should give early warning on a drive failure and since there is a seven drive redundancy on the data files, if I were buying drives for The Indie Archive I'd go with six used 4 terrabyte HGST drives for $120. There is no reason not to use drives all the same size as long as the snapshot drives are large enough. The size of data drives you need depends on the size of your projects and the time it takes to do a project. Look at your hard drives on your working systems. Think about what directories you would like to see in archival storage. What is the total size of these directories? Check how many gigabytes these projects have consumed in the last year. Think forward a few years. Assume you will use more disc space in the future than you are now. Do some quick arithmetic and make a decision. Like I said I only had to buy one drive so far because I'm weird and I had a bunch of 3 terrabyte drives available. If I had to buy drives I probably would have tried to start larger. I am sure that at some point in the not to distant future, when I am running The Indie Archive and not developing it, I will have to upgrade my drives. The primary system is the console for The Indie Archive. When you copy a project onto The Indie Archive the directory goes into the primaryfiles directory. From there it is propagated out to the primarysnapshots directory, the secondary system, the cloud storage (if you are using it), and eventually to the remote systems. All of the data propagation is done with rsync using the archive setting that is desigend to preserve the file metadata like owner, permissions, and date last modified. So I have been using rsync with the archive setting to move the files from the work system to a usb drive and from the usb drive to the primaryfiles folder. At first I thought I would use an optical disc to move the files but optical discs do not preserve file metadata. Also I had some weird results with a usb flash drive because it was formatted fat32. fat32 does not support Linux metadata so if you're going to move projects over on a flash drive or a usb external drive be sure to format to ext4. Another way to move projects over to the primaryfiles directory is with tar compression. This preserves metadata when the files are extracted so this might be easier and it works with optical drives. If your directory will fit on an optical drive this also gives you another backup on another media. If you have any suggestions on how to transfer projects while preserving the file metadata let me know. I know that there are network options available but I am hesitant to recommend them because if I can transfer files from a system to the primary sytem over the LAN than anyone can do the same. Or delete files. Or accidentally delete directories. I kind of want to keep tight control over access to the primary system. It kind of ruins the archival quality of The Indie Archive if anyone on the LAN can accidentally mess with it. So, I am open to dialogue on these issues. I'm kind of where I want it to be easy to add projects to The Indie Archive but not too easy, if you know what I mean. I feel like having to sit down at the primary system and enter a password should be the minimum amount of security required to access the primary system. The primary system also runs file integrity checks daily from a cron job. All of the propagation and file integrity scripts have to be run as root to preserve the metadata since only root can write a file that it doesn't own. The secondary system is the ssh server for The Indie Archive. The primary system logs onto the secondary system as root using ssh. Security is managed with public and private keys so entering a password is not required. After the keys are set up for both the primary and remote systems, password authentication is disabled for the ssh server so only those two systems can ssh into the secondary system. When the propagation script is run on the primary system rsnapshot is used to create a current version of the primaryfiles directory in the primarysnapshots directory. Then the primary system uses rsync over ssh to make a copy of the primaryfiles directory to the secondaryfiles directory. Then the primary system logs onto the secondary system as root and rsnapshot is used to create a current version of the secondaryfiles directory on the secondarysnapshots directory. Finally, if cloud storage is being used, the primary system uses gcloud rsync to make a copy of the primaryfiles directory to a google cloud storage bucket archive. I have this bucket set to 90 days soft delete. If you are using another type of cloud storage on Google, AWS, Mega, or other storage providers this command will have to be adjusted. The reason I chose the gcloud archive bucket is because of the storage cost per gigabyte. They have the cheapest cost per gigabyte that I found. This will keep the monthly bill low. Once a day the primary system runs the file integrity check from a cron job using rsync to compare the primaryfiles directory to the current version, alpha.0, in the primarysnapshots directory logging any discrepancies. It then does the same comparing primaryfiles to secondaryfiles and to the current version in the secondarysnapshots directory, logging discrepancies and notifying the maintainer of any discrepancies. Notification is done by email using curl and an SMTP provider. The remote system runs on it's own schedule, logging into the secondary system daily to copy data from secondaryfiles to remotefiles and then using rsnapshot to make a copy of remotefiles to the remotesnapshots directory. Since it's run on a daily schedule it uses rsnapshot with the standard daily, weekly, monthly, and yearly backups. The remote system also runs a daily file integrity check comparing remotefiles to the current version on remotesnapshots and comparing remotefiles to both data directories on the secondary system, again logging the results and notifying the maintainer of any discrepancies. If there is an outward facing static IP at the location with the primary and secondary systems then the remote system can use that static IP to ssh into the secondary system. If there is not a static IP then the remote system uses a DuckDNS subdomain to log onto the secondary system. Any system using the same router as the secondary system can run a cron job to update DuckDNS with the current IP address. Since a static IP is a monthly expense it's important that there's an alternative that does not require paying another bill. So the secondary system has the ssh server but it doesn't really do much. Both of the other systems connect to it and use it as the junction for data propagation and file integrity checks. So, as you can tell, there's a lot going on to make The Indie Archive work. Future podcasts will get down into the details and discuss some of the choices I had to make and why I made them. The funny thing about this project is that the actual code was the least amount of work. Figuring out exactly how rsync and rsnapshot work together was quite a bit of work. Configuration for both rsnapshot and ssh took a bit of head scratching. Then there were a few user id tricks I had to work through to make The Indie Archive usable. But, by far the most work was writing The Indie Archive installation document detailing each step of installing the software on three systems. It's been fun so far. If you have input I always appreciate the help. I get quite a bit of help on Mastodon. If you go to home.gamerplus.org you will find the script for this podcast with the Mastodon comment thread embedded in the post. This podcast is being read from a document that is a work in progress. Current versions of the What Is The Indie Archive document will be posted at codeberg when I'm ready to upload the project. Thanks for listening. https://www.theindiearchive.com/ Provide feedback on this episode.

Summary

What Is The Indie Archive? I'm Hairy Larry and you're listening to the Plain Text Programs podcast. The Indie Archive is a archival solution for indie producers. Since most indie producers run on a shoestring budget it's important that the Indie Archive is inexpensive to install and run. It's especially important that monthly expenses are minimal because a reasonable expense most months will sometimes be more than an indie producer can afford during some months. The first major constraint is cost. So I'll be talking about prices a lot in this podcast and get more technical in future podcasts about The Indie Archive. Indie Archive is an archival system which is different than a backup system. If you don't have a backup system do that first. My backup system uses the same tools as Indie Archive, rsync and rsnapshot. My brother uses the online backup service Carbonite. There are many other options. A good backup system runs automatically to back up everything frequently and preserve version history. It's also good to have backups offsite. An archival system, like Indie Archive, keeps multiple redundant copies across several hard drives on several systems in multiple locations. An archival system also checks file integrity as protection against file corruption or user error. When you have a project you really never want to lose, like a finished novel, a music album, a video, or any other major effort that involves significant work, that's when you need an archival system. So The Indie Archive does not automatically backup your projects every day. That's what your backup system should do. The Indie Archive is an archival system where the producer of the content decides what needs to be archived and when it needs to be archived and then manually moves a directory containing the files onto the Indie Archive carefully preserving the file's metadata during the transfer. Then these files are propagated over at least 7 hard drives on 4 different systems in three locations. File integrity checks are run daily comparing the files and reporting discrepancies. Two of the systems are kept in the studio where the content is produced. I call them the primary and secondary systems. They have a boot drive and two data drives each. One of the systems is kept offsite at a nearby location. I call it the remote system. It also has a boot drive and two data drives. If you have a more distant location where you can put a second remotes system you can have remotenear and remotefar systems. Otherwise ... The final system is somewhere in the cloud provided by a professional data storage provider. It has a single copy of the data and usually some additional data retention. The provider makes the backups of this data. This is the part that might involve a monthly bill. So, depending on the size of your file set, it could be free or it could cost so much a month. There are a lot of options for cloud storage providers. But first I'm going to discuss the three systems, primary, secondary, and remote, and how they function. As far as

Subtitle
Duration
Publishing date: 2025-02-11 00:00
Link: https://hackerpublicradio.org/eps/hpr4312/index.html
Contributors: hairylarry.nospam@nospam.gmail.com (hairylarry); author
Enclosures: https://hub.hackerpublicradio.org/ccdn.php?filename=/eps/hpr4312/hpr4312.mp3; audio/mpeg