Use tune2fs to enable periodic mandatory disk-checking on Pi based devices?

jimrh · April 17, 2020, 10:40pm

Greetings!

I have been working on setting up some methodologies for creating system backups and archives of the work I want to do with Charlie - and I have been experimenting with creating and compressing disk, (SD card), images.

During the process of creating the images, I have been loop-mounting the images and fsck’ing¹ them to make sure that the file-system hasn’t been damaged during the wash-dry-fold cycles developing my techniques.

Taking a look at the filesystem using tune2fs - I noticed some interesting things:

Viz.:

root@Charlie:/home/pi# tune2fs -l /dev/mmcblk0p2
tune2fs 1.44.5 (15-Dec-2018)
Filesystem volume name:   rootfs
Last mounted on:          /
Filesystem UUID:          24eaa08b-10f2-49e0-8283-359f7eb1a0b6
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         unsigned_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              3853200
Block count:              15563776
Reserved block count:     480503
Free blocks:              14167650
Free inodes:              3709149
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      910
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8112
Inode blocks per group:   507
Flex block group size:    16
Filesystem created:       Tue Nov  5 23:57:39 2019
Last mount time:          Thu Feb 14 13:12:11 2019
Last write time:          Thu Feb 14 13:11:59 2019
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Feb 14 13:11:59 2019
Check interval:           0 (<none>)
Lifetime writes:          35 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
First orphan inode:       267571
Default directory hash:   half_md4
Directory Hash Seed:      2525e02c-083e-4713-9d57-e4a4d6af7b0e
Journal backup:           inode blocks
Checksum type:            crc32c
Checksum:                 0xac5e9886

The filesystem error “errors behavior” is set to “continue”
The “Maximum mount count” (before requiring a disk check) is set to “-1” - disabled.
The “Check interval” (elapsed time before requiring a disk check) is set to “0” - disabled.
(Interesting maybe?) The last time the filesystem was checked for consistency was back on Valentine’s day, 2019! (or so the filesystem thinks. . . .)

The tune2fs man page says this about mount-count and interval dependent disk checking:

You should strongly consider the consequences of disabling mount-count-dependent checking entirely. Bad disk drives, cables, memory, and kernel bugs could all corrupt a filesystem without marking the filesystem dirty or in error. If you are using journaling on your filesystem, your filesystem will never be marked dirty, so it will not normally be checked. A filesystem error detected by the kernel will still force an fsck on the next reboot, but it may already be too late to prevent data loss at that point.

[. . . . .]

It is strongly recommended that either -c (mount-count-dependent) or -i (time-dependent) checking be enabled to force periodic full e2fsck (8) checking of the filesystem. Failure to do so may lead to filesystem corruption (due to bad disks, cables, memory, or kernel bugs) going unnoticed, ultimately resulting in data loss or corruption.

Though most distributions ship with filesystems set up to NEVER check, (and, just maybe, there’s a good reason for this), I would think that a system like the Dexter robots would want to include some kind of periodic, forced, filesystem sanity check.

When I install a Linux desktop/server system, I set my systems up to do the following:

The filesystem error behavior mode is set to “re-mount read-only”
The filesystem force-check interval is set to about a month - actually an odd number of days so the fsck-on-reboot doesn’t always happen on Monday. ()
The filesystem force-check mount count is set to about twenty-or-so reboots/remounts, so that about once a month a frequently used system will ask for a disk check.

In Charlie’s case, I set a more aggressive schedule:

root@Charlie:/home/pi# tune2fs -e remount-ro -i 17 -c 5 -C 6 /dev/mmcblk0p2
tune2fs 1.44.5 (15-Dec-2018)
Setting maximal mount count to 5  [<== Check Charlies file-system every fifth reboot.]
Setting current mount count to 6  [<== start checking and reset at the next reboot.]
Setting error behavior to 2  [<== remount read-only if the filesystem prangs to limit the damage.]
Setting interval between checks to 1468800 seconds  [<== force a re-check every 17 days if it doesn't happen sooner.]

My thinking is that a system like Charlie - or maybe Carl? - is more likely to have “interesting” things happen to it than the typical sits-and-plays-Solitare desktop system. As a consequence, having a system that is more vigilant about disk integrity checking is a Good Thing.

Are there any really good reasons NOT to set up, (at the very least), some kind of periodic integrity test on our 'bots?

References:
[1] fsck - to run the e2fsck utility which checks a Linux ext2/3/4 filesystem for errors.

cyclicalobsessive · April 18, 2020, 12:35am

Interesting, but way beyond me.

While Carl may disagree with the importance of his most recent “life” experience, the only thing unrecoverable if his disk were to fail is a week or two of his life.log:

2020-04-16 07:53|[new_juicer.py.dock]---- Docking 1026 completed  at 8.1 v after 6.4 h playtime
2020-04-16 10:36|[new_juicer.py.undock]---- Dismount 1026 at 11.4 v after 2.7 h recharge
2020-04-16 17:00|[new_juicer.py.dock]---- Docking 1027 completed  at 8.1 v after 6.4 h playtime
2020-04-16 19:48|[new_juicer.py.undock]---- Dismount 1027 at 11.4 v after 2.8 h recharge
2020-04-17 02:10|[new_juicer.py.dock]---- Docking 1028 completed  at 8.1 v after 6.4 h playtime
2020-04-17 04:57|[new_juicer.py.undock]---- Dismount 1028 at 11.3 v after 2.8 h recharge
2020-04-17 11:18|[new_juicer.py.dock]---- Docking 1029 completed  at 8.1 v after 6.3 h playtime
2020-04-17 14:01|[new_juicer.py.undock]---- Dismount 1029 at 11.4 v after 2.7 h recharge
2020-04-17 20:19|lifelog.dEmain execution: 85
2020-04-17 20:24|[new_juicer.py.dock]---- Docking 1030 completed  at 8.1 v after 6.4 h playtime

Everything else is on kept current on github including a periodic life.log.backup.

Me thinks you might have gone off the deep end…

jimrh · April 18, 2020, 11:37am

That depends on how well you have everything backed up on GitHub - which I am still working on - and how dynamic the environment is.

In a classroom where the base O/S image is on a SD card that can be trashed and re-duplicated at-will, and any student work is saved on a USB drive - maybe not.

In an environment where the robot is the primary development environment, is being disassembled and re-assembled, or is running 24/7, maybe this isn’t a bad idea?

I am trying to create an environment for Charlie that is both resilient and recoverable.

Using autonomous robots like the Temi as an example, removing and totally re-flashing the O/S and current working data isn’t always possible. Ergo, a resilient environment is mandatory.

cyclicalobsessive · April 18, 2020, 12:17pm

Anecdotal impressions:

Before I started running Carl 24/7, he was manually booted/shutdown 132 times in 8 months, and experienced three “read-only SD card” system failures, and no known disk corruptions.
In the 12 months since, he was manually booted/shutdown 80 times and programmatically shutdown (low voltage protection) 14 times - with zero “read-only SD card” failures, and no known disk issues. (Only 40 boots in the last 8 months.)
running rpi-clone takes roughly 4 minutes
running apt-get upgrade is the most risky operation, and the only thing that has required using a rpi-cloned backup.

Might it be more dangerous to shut down to check the filesystem, than to remain running?
Could SD card technology have improved such as to obviate external checking?

jimrh · April 19, 2020, 5:19pm

Famous quote:
" When a politician says ‘Trust me!’ I always reach my hand back to make sure my wallet is still there."

I don’t trust any kind of storage any further than I can throw it under water. Look up “ghost shift” - particularly as applied to SD/Flash devices. This has become enough of an issue that I only buy SD cards/Flash drives from suppliers I trust: Samsung, Sandisk, Toshiba, and, (interestingly enough), Micro Center. I specifically avoid Kingston, PKNY, and bonzoid brands that I’ve never heard of before. Actually I have some, but I use them primarily as scratch drives that are used, dumped, and wiped.

I doubt it - unless your filesystem or drive, (including contacts and such), are so severely pranged as to make their continued use a miracle.

What IS dangerous is the uncommanded sudden file-system shut-down. (i.e. Loss of power, system lockup, etc.) Journaling filesystems help, but it’s like the safety net under the high-wire at the circus; it’s nice to have, but it’s smart not to depend on it.

When I worked as a Systems QA Analyst at H&R Block, I researched the way the H&R Block online tax preparation servers were run. Especially during the busy season, the servers were designed to periodically reboot on a rotating basis. Usually late at night.

Gtech, who runs somewhere between 60 and 75% of the worlds lotteries and makes more money than God, (brand new Beemers are not unheard of as employee door-prize gifts at the annual Christmas party), are contractually and legally required to reboot a jurisdiction’s entire lottery once every day after the lottery closes, balances, and backs up. This includes a remotely commanded reboot of every on-line lottery terminal that hasn’t already been shut down by the lottery agent.

The question of how vigilant you are about the integrity of your system drives - both on your 'bot and in general - is a matter of personal opinion. I’ve seen enough drives die in my time that I don’t trust them.

Back home in Worcester, Ma., I used to have a 15-20 T file-server that had years and years of data, programs, and utilities on it, dating back to the Pre-PC era. It had several sets of striped RAID-5 drive arrays and for every active drive array, there was a dedicated backup drive that was 2-3x the size of the active drive, that I used to create nightly incremental backups and weekly full backups that used a “Tower of Hanoi” rotation method. I set a rotating mandatory fsck and made sure the entire system was completely shut-down and re-booted at least once a week. Critical financial, business, and irreplaceable data was periodically backed up to separate drives that were labeled and filed away. I also kept spare copies of every critical cable, electrical subsystem and interface board. I eventually upgraded to a custom Athlon server-class MoBo, redundant power supplies that were woefully over-specified, ECC memory, etc.

I did this after I lost a 4T striped drive array that wasn’t being backed up and I had to - literally - forensically piece the data back together with microscope, tweezers, hex editor, and insane Linux drive utilities that Angels Themselves feared to use. After over two weeks of effort and multiple hundreds of dollars that I didn’t have to spend, I recovered about 80% of the data. The other 20% was irrevocably lost and was irreplaceable.

After that happened, I swore a Holy Oath to never subject myself to that again. Over the top? Maybe. “Once burnt, twice shy” and after the server crash maybe I reacted like someone with burns over 90% of their body. But I damn sure wasn’t going to loose any more data.