Creating a Script to Find File Corruptions
Last updated: 17/07/2018
So a while back, earlier this year, I did some work experience as your run of the mill 'IT Guy' at another school. While I was there, they had just finished updating all their servers to the latest ubuntu and a program they previously used to do the very same purpose my script does today, stopped working (it was written in ruby so I didnt touch it), so I offered to write them a new one due to my lack of interest in configuring ssh, vnc and ftp for 3 days straight on their new servers...
The other guy there hated programming, and linux for that matter, so I guess he was really good with active directory? I dunno, so that was that. The language of course was python, because python is a god send for stuff like this, and so on the shitty dell laptop they gave me (I wasnt allowed to bring mine in, thanks gdpr), I got to work.
This is the first part of the code, you can see the python libaries I used and the basic UI. One thing to look at however is the 'dire' variable, this would be the actual search directory for files, now let me explain how it works:
. takes the name, hash and timestamp of every file in a dir, then exports results to an output file
. when ran again some time later, does the same as before, then compares the hashes and timestamps
. If the timestamp and hash change = edited, if timestamp is the same but hash different = maybe its corrupt
This part of the script read (recursivly) through each file in a directory and outputed it's name, SHA256 hash and timestamp to an output file. I originally used glob.iglob for this, but i couldn't get it to output the files full path, not just the filename, so i instead switched to os.walk. For the hashes, I used the standard hashlib for python, which is part of the SL. One concern was to not overload the servers memory (ie, dont try and dump every file on the disk to memory), but end testing shows it wasn't too bad, even with a server with over 40tb of student data, although it took about 18hr or so to finish xD.
And now for the most fustrating part of the program:
This is (most) of the algarithm that 'decides' whether a file has just been edited or if it has truly been corrupt, as you can see, it is just a bunch of if and else statements as it compares the scan (output) file to older records and check for corruption. One thing worth mentioning is that corruption is more linkly to occur when a file(s) has been sitting on a drive without being touched for some time, and most modern hardrives fix tiny misflipped 1's and 0's that occur naturally, so they can be fixed just by the OS touching each file, still though, its only a precaution.
This last bit of code just does some cleanup of the files it used along with saving backups of past records in case you loose the most recent one. This script was built for linux, but can still work on windows, although watch out as it can trip up on weird $ windows file types etc...Download the script here