During some work for a ML project on a large text corpus I needed to extend a personally used reference vocabulary by some complex ad unusual German compounds and very branch specific technical terms. I kept my vocabulary data in a Pandas dataframe. Each “word” there had some additional information associated with it in some extra columns of the dataframe – as e.g. the length of a word or a stem or a list of constituting tri-char-grams. I was looking for a fast method to extend the dataframe in a quick procedure with a list of hundreds or thousands of new words.
I tried the df.append() method first and got disappointed with its rather bad performance. I also experimented with the incorporation of some lists or dictionaries. In the end a procedure based on csv-data was the by far most convenient and fastest approach. I list up the basic steps below.
In my case I used the lower case character version of the vocabulary words as an index of the dataframe. This is a very natural step. It requires some small intermediate column copies in the step sequence below, which may not be necessary for other use-cases. For the sake of completeness the following list contains many steps which have to be performed only once and which later on are superfluous for a routine workflow.
- Step1: Collect your extension data, i.e. a huge bunch of words, in a Libreoffice Calc-file in ods-format or (if you absolutely must) in an MS Excel-file. One of the columns of your datasheet should contain data which you later want to use as a (unique) index of your dataframe – in my case a column “lower” (containing the low letter representation of a word).
- Step 2: Avoid any operations for creating additional column information which you later can create by Python functions working on information already contained in some dataframe columns. Fill in dummy values into respective columns. (Or control the filling of a dataframe with special data during the data import below)
- Step 3: Create a CSV-File containing the collected extension data with all required field information in columns which correspond to respective columns of the dataframe to be extended.
- Step 4:Create a backup copy of your original dataframe which you want to extend. Just as a precaution ….
- Step 5: Copy the contents of the index of your existing dataframe to a specific dataframe column consistent with step 1. In my case I copied the words’ lower case version into a new data column “lower”.
- Step 6: Delete the existing index of the original dataframe and create a new basic integer based index.
- Step 7: Import the CSV-file into a new and separate intermediate Pandas dataframe with the help of the method pd.read_csv(). Map the data columns and the data formats properly by supplying respective (list-like) information to the parameter list of read_csv(). Control the filling of possibly empty row-fields. Check for fields containing “null” as string and handle these by the parameter “na_filter” if possible (in my case by “na_filter=False”)
- Step 8: Work on the freshly created dataframe and create required information in special columns by applying row-specific Python operations with a function and the df.apply()-method. For the sake of performance: Watch out for naturally vectorizable operations whilst doing so and separate them from other operations, if possible.
- Step 9: Check for completeness of all information in
your intermediate dataframe. verify that the column structure matches the columns of the original dataframe to be extend. - Step 10: Concatenate the original Pandas dataframe (for your vocabulary) with the new dataframe containing the extension data by using the df.concat() or (simpler) by df.append() methods.
- Step 11: Drop the index in the extended dataframe by the method pd.reset_index(). Afterward recreate a new index by pd.set_index() and using a special column containing the data – in my case the column “lower”
- Step 12: Check the new index for uniqueness – if required.
- Step 13: If uniqueness is not given but required:
Apply df = df[~df.index.duplicated(keep=’first’)] to keep only the first occurrence of rows for identical indices. But be careful and verify that this operation really fits your needs. - Step 14: Resort your index (and extended dataframe) if necessary by applying df.sort_index(inplace=True)
Some steps in the list above are of course specific for a dataframe with a vocabulary. But the general scheme should also be applicable for other cases.
From the description you have certainly realized which steps must only be performed once in the beginning to establish a much shorter standard pipeline for dataframe extensions. Some operations regarding the index-recreation and re-sorting can also be automatized by some simple Python function.
Have fun with Pandas!