Are Our Photos and Video Backups Really Archival?

Are Our Photos and Video Backups Really Archival?

Over the history of mankind, the best way found to archive data was to carve it into stone, then bury it in the sand. Photographically, the most stable form of archiving is probably a black-and-white silver-based image on a glass plate. For digital data storage, there is no perfect permanent storage option. Most digital storage media can’t be confidently recommended to be dependable beyond 5-10 years. 

So clearly, our standard backup procedure (two backups) on hard disk, solid-state disk, optical disks, or magnetic tape is not the whole job. It’s not enough to make our backups and just store the copies in a box. Many experts recommend checking our archived media once a year since any kind of commercially available (and affordable) storage gradually degrades over time whether or not it is in use. 

Even right after our backup copy is made, can we assume it’s valid? The date, time, and size of the file copy may match the original, but do we know that the data in the files on the backups are any good? The worst time to find out that the backups are no good is when you need to retrieve or restore some files!

If you are thinking that looking at every image or video in every backup you have made once a year seems impractical, you’re right! Using a program to compare two copies of a file byte by byte is much better. Using a smarter program to compare all the files in a backup with the original or second copy is even better, but think back to how long it takes to just copy terabytes of files from one disk to another — probably minutes to hours on a fast system or perhaps over a day on a slow network. And remember that comparing copies of files takes about the same amount of time since the computer must read two large files to check one against the other.

Cutting the Effort in Half With a Digest

An alternative that cuts the checking process in half is to generate a message digest, also known as a cryptographic hash in computer terminology. What this does is to take an entire file and use a special scrambling procedure on it to output a short (32-128 bytes) digest. The scrambling is not random. The algorithm for the scrambling is specified exactly, so the same digest will always be calculated for a file. For the purposes of validating backups, the shortest standard digest is 32 bytes (128 bits) long and is designated MD5 (message digest version 5). The scrambling algorithm is designed in such a way that changing a single bit in the input drastically changes the digest. 

You’ll often see the MD5 (or longer SHA) digests alongside files for download over the internet so that your copy of the downloaded file can be verified to be exactly the same as the copy on the server providing the file. The procedure is for you to download the file and generate the digest on your downloaded copy, which can then be compared to the digest published on the server. If they match, you can rest assured that you’ve successfully downloaded a clean copy without corruption by the communication channel.

It should be obvious that for an arbitrary file to be squeezed down to a 128-bit digest does not generate a digest that is unique to a single input file, but the chance of two files (or an original and corrupted version of a file) having the same digest is small enough to ignore so we can use this property to check the validity of a copy by generating its digest and comparing it with the digest of the original file. The practical significance of this is that you don’t need the original file in order to verify that the copy is accurate, and you save the time that would have been required to read the original too.

Processing Backups

As noted earlier, there are longer digests that have been standardized, but the computation of these digests takes more CPU time, so the MD5 digest is better for use in verifying large collections of files. The general procedure for implementing this is:

  • Start at the top-level directory of your backup set (e.g. example directory 2020 for a set of photos).
  • Generate the MD5 digest for all files in directory 2020 and subdirectories, saving the list in a text file such as md5.txt.
  • Copy the directory (and subdirectories) along with the md5.txt file to the backup drive.

Once this is done, the backup disk copy can be verified by running an MD5 verification by reading the backup files and comparing the MD5 digests to what is contained in the md5.txt file. If they all match, you can be reasonably sure that the copy has been created without corruption. Note that this works for all files, regardless of their contents, so you can include your word-processed notes or spreadsheets that go along with the photos or videos you are archiving.

Software

While many programs exist to generate the MD5 digest of a file, most of these work only on a single file. It turns out that the best (and fastest) way to batch process all files in a backup is to use a set of relatively simple programs which already exist in a Linux/Unix terminal environment. While this may sound primitive, this does have the advantage of running in a variety of operating systems since many platforms also support Linux. Windows has its WSL (Windows Subsystem for Linux), higher-end X86 Chromebooks support installing Linux, and Apple systems already have a Unix variant underneath its GUI interface. Note: I have not personally tested this on an Apple computer.

Once a Linux terminal window is available, generate the MD5 digest on the original files (an example is the 2020 directory):

  • $ cd 2020    # make the top-level directory the current directory
  • $ find -type f -exec md5sum “{}” + | tee md5.txt 

For convenience, put the command line into a shell script to avoid typing it each time (If this is confusing, get your local computer guru to set you up).

Generating the MD5 digest... Each line displayed shows the MD5 digest in hexadecimal followed by the file name.

Note that “$” indicates the terminal prompt. The programs “find”, “md5sum”, and “tee” are installed by default on Linux systems. The “find” searches for all files at and below the current directory and passes the list to “md5sum,” the heart of the procedure, which generates the MD5 digest for each file. The MD5 digests are collected in the output file called md5.txt (placed in the current directory), which I’ve used here as a default but can be customized to whatever name desired. The “tee” is included so that the files will be shown on your screen as well as written to the file md5.txt as they are processed so you can get some feedback showing your computer is actually working on the task.

Once the file tree has been copied to a backup destination (say, /Backup), just go into the backup directory and verify the copied files:

  • $ cd /Backup/2020    # make the backup directory the current directory
  • $ md5sum -c md5.txt | tee md5.log

This will cause md5sum to read each file and MD5 digest in md5.txt, generate the digest for the file, compare it to the digest in the file, and output “OK” or “FAILED.” These are output to the screen as they are processed as well as copied to the text file md5.log. This file can be opened in any text editor and searched for the “FAIL” string to see if any of the files failed the verification. Note that the file md5.txt will appear to have failed if it existed previously since it is actually created and is changing during the MD5 generation procedure.

The MD5 generate process has completed (middle of this screenshot). The verification step is being run now as a demonstration of the file checking process. Each file checked is listed with an “OK” or “FAILED” status. View md5.log in a text editor to review these results and search for failures.

Put Your Old Computer(s) to Work

Both the initial generation of the MD5 digest list and verification on a copy can take hours if you have a large number of files. But the good news is that this doesn’t have to tie up your main computer. Backup disks can be taken to another computer system and verified without needing access to the master set of files. Put the second computer (and other idle computers) to work while you go on with normal work on your main computer. These other computers also don’t have to be running the same operating system as long as they have a Linux terminal capability, since the md5sum program generates the same MD5 digest on all systems.

If we follow the recommendations of archival experts, we should do this every year on all of our backup copies to allow us to sleep better at night. This procedure also annually reads the contents of all of the files on the disks, which also gives the firmware controlling the hard disk or SSD a regular chance to verify that all is well with the underlying hardware and recording medium. If errors are found in the reading process, the firmware will attempt to relocate the files out of failing areas of the media or in the worst case, alert us to the failure of the drive.

Log in or register to post comments

28 Comments

sam dasso's picture

Give me a break! Maybe in a last century you couldn't rely on media for more than 5-10 years, but if you store to archival quality DVD you get 100 years and if it is not enough then use M-Disk for 1000 years archival storage.

David Kodama's picture

We'll have to meet over coffee in a century to compare notes. :) Verbatim's M-disc guarantee is:"Verbatim has been a leader in data storage technology since 1969, and guarantees this product with a 10-year limited warranty and technical support" -- see https://www.amazon.com/Verbatim-98912-M-Disc-100GB-Surface/dp/B011PGT2FQ

sam dasso's picture

My sharp TV I purchased 10 years ago only had 90 days warranty. So 10 years warranty on a product with projected lifetime of several hundred years (based on ISO/IEC 16963 testing) is not that bad. But if you are looking for lifetime warranty, buy 100 years estimated life archival DVD
https://www.bhphotovideo.com/c/product/522752-REG/Verbatim_96320_DVD_R_U...

J.d. Davis's picture

"Photographically, the most stable form of archiving is probably a black-and-white silver-based image on a glass plate."

????

That statement doesn't pass The Drop Test!

Kirk Darling's picture

It's a factually true statement though, inasmuch as some of those glass plates still survive and there is no form that has lasted longer.

J.d. Davis's picture

Hmmmm....being the oldest isn't the same as 'stable'. Glass is fragile in much the same way that magnetic media is, and almost all the forms of floppy disk can't be read anymore without obsolete machines.

I have music CD's from the early 80's - with no data rot, and computer data CD's over 25 years old that show no degradation.

No, you have not convinced me that longevity is equal to stability, especially when the substrate is fragile!

Kirk Darling's picture

A glass plate doesn't require an obsoletable machine to display it.

J.d. Davis's picture

Just a clumsy oaf to destroy it!

Kirk Darling's picture

And they've still lasted a hundred years so far.

Justin Sharp's picture

A little late to the conversation, sorry. You are right about the glass plate. It is fragile, but the real topic here is the data on the glass. I can buy a great hard drive and physically destroy it the same as I can drop the glass plate, albeit with a bit more effort. Putting both the hard drive and the glass plate in the same controlled environment, the glass plate image has a better chance of out lasting the digital data. I have some older digital storage devices that are still working, but the chance is always there that they will fail. Maybe not, but there's the possibility that the digital data will degrade. However, the silver salts suspended in either collodion or gelatin, properly fixed and washed, will absolutely be more stable and have a much lower chance of degrading over time. the image will last for decades or maybe centuries....until I drop the glass plate.

Timothy Roper's picture

Unless you've got all your money in your freezer or in your mattress, all of your net worth is stored via computer files, too. So you've got a lot more to worry about than some pictures--if you're worried in the first place.

Hector M's picture

Whew! Clearly this topic draws a lot of emotion. I am an IT person, so I feel like I have had to consider this subject for the last three decades. When it comes to digital media for backups, there are so many aspects to consider, not the least of which is will you even have a device to read your files decades down the road? Also, humans inherently store a whole lot of "stuff". What you may treasure as invaluable, the next person may consider junk. So if you're looking to record all your treasure on gold discs and send them into space on your own Voyager, it's going to be inordinately expensive. Once the aliens have it in their possession, or your relatives, they may not know how to unlock its secrets. Or look at the data. Anyway, excellent topic!

C Fisher's picture

Not related to the post, but oh my god those stupid ads that border the entire screen are here now and I hate it. It keeps popping up every 30 seconds while I'm trying to read, and 3 more times now while typing. Can you guys block these types of ads? My phone is close to going through a window lmao

Kenneth Tanaka's picture

Printing is the only tried-and-true archival technology. Today’s inexpensive pigment printing technology makes that truer than ever.

J.d. Davis's picture

Citation needed.

Kirk Darling's picture

Well, just this weekend I've been going over my mother's photo albums of me as a baby in the early 50s. I'm going to digitize them and send them to all our relatives. Digital "seeding to the wind" is about the most certain way digital images are likely to be archived, because I'm pretty darned sure my children are just going to trash my old hard drives rather than comb through them. If the people who might want my images don't already have them before I die, they'll likely be lost.

J.d. Davis's picture

GUFFAW!

When my mother died at age 96, she left 3 steamer trunks full of paper prints and kodachrome slides dating from the 1890's and 1940's respectively. NONE of the paper prints were labeled, so we didn't know what the locations were or who was in the photos.

My sister and I took a handfull of paper prints each and only kept a few of each; these were of people we were certain we knew. The rest filled the better part of a dumpster.

You are correct about one thing, your hard drives will be trashed before your body gets cold.

Seriously - unless the Getty or Guggenheim ask you to donate your work it will most likely be lost forever.

We have great-grandchildren and believe me, they don't give a toss about our photos...sigh!

sam dasso's picture

Little off topic. Get yourself Epson FastFoto scanner. My wife scanned 18 albums with 160 4x6 photos each in just 3 days and most of the time she spent taking pictures out of the albums. This is amazing photo scanner for $600 and it will save you tons of time. Scans quality is excellent.

Lee Christiansen's picture

I've written in my will that it is a condition of getting my money, that the recipient maintains and looks after my hard drives, keeps media accessible and keeps my websites and domains. (Seriously - there's a section about archiving).

Or else I'm coming back...!

J.d. Davis's picture

Unlikely, unless you have it doled out by the year...so let us know how that works - OK?

Lee Christiansen's picture

Re-writing my will with your suggestion right away... :)

Christian Comes's picture

While hash has a certain use, I do not see why this method described will cut time in any way.
In order to generate an updated MD5, you need to read the whole file- some computer has to read the complete file.
That is valid for any form of backup.
So it should be done by the computer that is "attached" to the storage media. If you cannot put them both physically in the same fast network or even same computer, yes, let the remote computer make some MD5 and compare with yours locally.
Personally I use a "replace-every-two-years" policy and data gets copied to a new magnetic media every two years, and there are at all times at least 2 copies of each file (and an off-site backup additionally).
Having said that, 20 years of digital pictures have survived many years of carelessness and only having one copy of each file with no failure, so maybe hard drives are better than their reputation.

David Kodama's picture

Here's my work flow to make things clearer (I hope):
- I maintain my main computer and (at least) one backup computer and files
- Step 1. Main computer generates md5 hash on files to be archived.
- Step 2. Main computer copies archive files + md5 hash to a backup USB drive
- So far this takes 2 reads and 1 write for the archive set.
- Step 3. Take the backup drive to Backup computer.
- Step 4. Backup computer checks files against md5 hashes (1 read for archive set)
For each verification of the backup set (say annually), step 4. is repeated, saving the need to re-read from the original set of files each time. In addition this is done on another computer (not the master), thus saving main computer time. The backup computer is also used to make additional copies with the confidence that it is using a verified set of files.

I also am using hard disk drives, replacing the oldest ones as time goes by. Basically jumping from one ship to a new one before the old one sinks. I use HDD for speed, low cost, increasing density as each generation comes out, and relative stability (for archived drives).

J.d. Davis's picture

I also programmed for years - (waves to fellow nerd), doubtful that an average Joe could follow this with success.

David Kodama's picture

[Wave back to you] :) I understand that, but I hope everyone (especially those in business) have someone to support them. I cringe every time I hear of someone losing their data.

Christian Comes's picture

Yes, I see now what you do. In my case, I occasionally alter old pictures so that would not work- but it is archiving as archiving goes, so yes, good method overall.

Jim Cutler's picture

When you die people throw out your stuff. And even if one of your kids keep your hundreds of thousands of image, their kids will throw it out or lose it in a move or a flood or not care all that much. The hard drives will fail and the tech will change. If the technology for them to view it is even slightly hard, they won't look at it. And if they do it'll be once every 10 years. If the library is too large, it will be a chore to look through it so they won't. Enjoy your stuff now, back it up for your business for while you live. Leave your children a small very narrowed down best of's collection. Other than that realistically no one cares all that much about your images. It's just a fact. A friend who spent 60 years lovingly collecting the most amazing stamp and coin collections told me, "We see this for it's beauty, but someone after me will say "Wow let's sell it now and get that garage we want to add on."

J.d. Davis's picture

The big difference is: the stamps are actually worth something!