Recently I got an a enormous amount of text and related image files of scanned texts (around 210.000). The whole dataset had a size around 25 GB. The text data should be analyzed with some machine learning [ML] algorithms. One of the first things to do in such a situation is to get rid of the jpg-files. Such files consume most of the disk space.
In may case I also got the data from a Mac machine. Hidden "._"-files were created on the Mac when the original data were downloaded from the Internet. These files control Mac security operations. I had to eliminate these files, too.
Due to the doubling of files and additional "."-files the total number of files was around 830.000. The number of files really required was much smaller. To eliminate text files, which one does not need is an exercise which is often required in a Machine Learning context for text files.
In such a situation the function "os.walk()" in a Python environment.
os.walk() allows us to walk recursively through a directory tree. We get a tupel back containing
- the path to the present directory,
- a list of all sub-directories,
- a list of all files (by their names) in the directory.
For typical applications this is enough information to perform analysis and file operations within the directories.
In my case the usage was very simple. In a Jupyter cell the following code helped:
import os import time dir_path = "/py/projects/CA22/catch22/" # use os.walk to recursively run through the directory tree v_start_time = time.perf_counter() for (dirname, subdirs, filesdir) in os.walk(dir_path): print('[' + dirname + ']') for filename in filesdir: filepath = os.path.join(dirname, filename) #print(filepath) if filename.endswith('.jpg') or filename.endswith('.db') or filename.endswith('-GS.txt'): os.remove(filepath) # extra loop as there are hidden "."-files also for ".jpg"-files for (dirname, subdirs, filesdir) in os.walk(dir_path): print('[' + dirname + ']') for filename in filesdir: filepath = os.path.join(dirname, filename) #print(filepath) if filename.startswith('._'): os.remove(filepath) v_end_time = time.perf_counter() print("Total CPU time ", v_end_time - v_start_time)
If you do not print out file-paths this should be a matter of seconds only, on a SSD below a second.
Counting the remaining files
We can also use os.walk() to count the number of remaining files:
n = 0 v_start_time = time.perf_counter() for (dirname, subdirs, filesdir) in os.walk(dir_path): print('[' + dirname + ']') n += len(filesdir) v_end_time = time.perf_counter() print('Number of files = ', n ) print("Total CPU time ", v_end_time - v_start_time)
On a Linux system you could also use
mytux:~ # find /py/projects/CA22/catch22/ -type f | wc -l 208123
Thus I could bring down the total size to 780 MB and the number of txt-files to be processed down to around 208.000.