I recently had an issue which required me to find the MD5 hash of several thousand files on a ftp server. My first instinct was to use curl or wget (or the very cool lftp) and md5sum, but I soon realized that the bash script would quickly become onerous, so I turned to python.
We use python’s ftplib pretty extensively in our other systems and it turned out to be perfect for this job. Since all of our files were on the same ftp site, we would only need to login once. We can login (anonymously) pretty easily with something like this:
import ftplib def login(site): ftp = ftplib.FTP(site) ftp.login() return ftp
If your site requires a user and password, just specify them in the ftp.login() method.
To retrieve a file using ftplib, you use the retrbinary method in the FTP class. This method takes two arguments:
- A command to download the file, generally something like ‘RETR README’
- a callback, which is called after each chunk of data is downloaded. The argument to that callback is the chunk of data. The canonical example is: open(‘README’, ‘wb’).write
Here comes the really cool part… There’s no need for us to write the file to disk or even store it in memory, we can simply update the hash object in the callback argument.
import ftplib import hashlib def get_ftp_md5(ftp, remote_path): m = hashlib.md5() ftp.retrbinary('RETR %s' % file_remote_path, m.update) return m.hexdigest()
Where ftp is the instance of the FTP class returned by the login function defined above and remote_path is the full path to the file on the ftp server (not the url of the file, just the path portion, e.g. /pub/data/2013/142/myfile.tgz).
It’s not groundbreaking stuff, but it’s a simple way to get the MD5 hash of a file on a ftp server without putting the entire file on disk or in memory and only requires logging into the ftp server once, not once per file.