Usefulness of .pipe()¶

Niwako Sugimura, Ph.D., Shad Sharma, MASc., Houda Aynaou, MA

My name is Niwako, and I worked with Shad and Houda on this Datathon. Today, we’re going to talk about the .pipe() method, which we used to help keep our code tidy.

Challenges in working on an ML project¶

  • Notebooks get messy with lots of data exploration and preprocessing
  • Can be hard to keep track of transformation applied to dataframe
  • Have to ensure that same sets of changes are applied to train, test, and unlabeled sets

Definition¶

DataFrame.pipe(func, *args, **kwargs)
  • Takes a function whose first argument is the dataframe
  • Transforms the dataframe in some way
  • Returns the transformed dataframe
In [ ]:
def func(df, args):
    transformed_df = transform(df, args)
    return transformed_df

Usage¶

In [ ]:
transformed_df = (
    df.pipe(func, args)
    .pipe(some_other_func, other_args)
    .pipe(yet_another_func, yet_more_args)
)

Practical Example¶

In [ ]:
FEATURES = ["column 1", "column 2", ...]


def drop_non_features(df):
    return df[FEATURES]

Stateful functions for .pipe()¶

In [ ]:
class NormalizeNumerical:
    def __call__(self, df, is_train):
        if is_train:
            self.scaler = Normalizer()
            self.scaler.fit(df[NUMERICAL_COLUMNS])

        df = df.copy()
        df[NUMERICAL_COLUMNS] = self.scaler.transform(df[NUMERICAL_COLUMNS])
        return df


normalize_numerical = NormalizeNumerical()
In [ ]:
class StandardizeNumerical:
    def __call__(self, df, is_train):
        if is_train:
            self.scaler = StandardScaler()
            self.scaler.fit(df[NUMERICAL_COLUMNS])

        df = df.copy()
        df[NUMERICAL_COLUMNS] = self.scaler.transform(df[NUMERICAL_COLUMNS])
        return df


standardize_numerical = StandardizeNumerical()

Fitting, Testing, and Predicting¶

In [ ]:
def preprocess(df, is_train):
    return (
        df.pipe(drop_non_features)
        # .pipe(normalize_numerical, is_train=is_train)
        .pipe(standardize_numerical, is_train=is_train)
    )

Fit¶

In [ ]:
model = Model() # Using whatever model we want
model.fit(X_train.pipe(preprocess, is_train=True), y_train)

Test¶

In [ ]:
model.score(X_test.pipe(preprocess, is_train=False), y_test)

Predict¶

In [ ]:
y_unlabeled = model.predict(X_unlabeled.pipe(preprocess, is_train=False))

Example from our notebook for the Datathon¶

In [ ]:
def preprocess(df, is_train):
    return (
        df.pipe(drop_non_features)
        # .pipe(normalize_numerical, is_train=is_train)
        .pipe(standardize_numerical, is_train=is_train)
        # .pipe(replace_missing_numerical, fill=-999)
        .pipe(impute_missing_numerical, is_train=is_train)
        .pipe(onehot_encode_categorical, is_train=is_train)
    )

Benefits of using .pipe()¶

  • Keep the original dataframe intact,
  • Easily select and switch around pre-processing steps,
  • Try different classifiers more easily.

How is this different from just using functions?¶

The main difference is that you either need to use a temporary variable:

In [ ]:
def preprocess(df, is_train):
    tdf = df.copy()
    tdf = drop_non_features(tdf)
    # tdf = normalize_numerical(tdf, is_train=is_train)
    tdf = standardize_numerical(tdf, is_train=is_train)
    # tdf = replace_missing_numerical(tdf, fill=-999)
    tdf = impute_missing_numerical(tdf, is_train=is_train)
    tdf = onehot_encode_categorical(tdf, is_train=is_train)
    return tdf

Or you have some ugly function chaining:

In [ ]:
def preprocess(df, is_train):
    return onehot_encode_categorical(
        impute_missing_numerical(
            standardize_numerical(drop_non_features(df), is_train=is_train),
            is_train=is_train,
        ),
        is_train=is_train,
    )