Count Thresholder

Count Thresholder allows you to map infrequent categorical variables to a new/separate category. Input columns to the CountThresholder must by of type string, int, list, or dict. For each column in the input, the transformed output is a column where the input category is retained as-is if it has occurred at least threshold times in the training data. Categories that do not satisfy the above are set to output_category_name.

The behaviour for different input data column types is as follows: (see transform() for examples).

string : Strings are marked with the output_category_name if the threshold condition described above is not satisfied.
int : Behave the same way as string. If output_category_name is of type string, then the entire column is cast to string.
list : Each of the values in the list are mapped in the same way as a string value.
dict : They key of the dictionary is treated as a namespace and the value is treated as a sub-category in the namespace. The categorical variable passed through the transformer is a combination of the namespace and the sub-category.

You specify the threshold at which to preserve the categories with the parameter "threshold".

Introductory Example

from graphlab.toolkits.feature_engineering import *

# Create data.
sf = gl.SFrame({'a': [1,2,3], 'b' : [2,3,4]})

# Create a transformer.
count_tr = gl.feature_engineering.create(sf, CountThresholder(threshold = 1))

# Transform the data.
transformed_sf = count_tr.transform(sf)

# Save the transformer.
count_tr.save('save-path')

# Return the categories that are not discarded.
count_tr['categories']

Columns:
        feature str
        category  str

Rows: 6

Data:
+---------+----------+
| feature | category |
+---------+----------+
|    a    |    1     |
|    a    |    2     |
|    a    |    3     |
|    b    |    2     |
|    b    |    3     |
|    b    |    4     |
+---------+----------+
[6 rows x 2 columns]

Fitting and transforming

Once a CountThresholder object is constructed, it must first be fitted and then the transform function can be called to generate encoded features.

# String/Integer columns
# ----------------------------------------------------------------------
sf = gl.SFrame({'a' : [1,2,3,2,3], 'b' : [2,3,4,2,3]})

# Set all categories that did not occur at least 2 times to None.
count_tr = gl.feature_engineering.CountThresholder(threshold = 2)

# Fit and transform on the same data.
transformed_sf = count_tr.fit_transform(sf)

Columns:
a   int
b   int

Rows: 3

Data:
+-------+--------+
|   a   |   b    |
+-------+--------+
| None  |    2   |
|   2   |    3   |
|   3   |  None  |
|   2   |    2   |
|   3   |    3   |
+-------+--------+
[5 rows x 2 columns]

# Lists can be used to encode sets of categories for each example.
# ----------------------------------------------------------------------
sf = gl.SFrame({'categories': [['cat', 'mammal'],
                               ['cat', 'mammal'],
                               ['human', 'mammal'],
                               ['seahawk', 'bird'],
                               ['duck', 'bird'],
                               ['seahawk', 'bird']]})

# Construct and fit.
from graphlab.toolkits.feature_engineering import CountThresholder
count_tr = graphlab.feature_engineering.create(sf, CountThresholder(threshold = 2))

# Transform the data
transformed_sf = count_tr.transform(sf)

Columns:
        categories  list

Rows: 6

Data:
+-----------------+
|    categories   |
+-----------------+
|  [cat, mammal]  |
|  [cat, mammal]  |
|  [None, mammal] |
| [seahawk, bird] |
|   [None, bird]  |
| [seahawk, bird] |
+-----------------+
[6 rows x 1 columns]

# Dictionaries can be used for name spaces & sub-categories.
# ----------------------------------------------------------------------
sf = gl.SFrame({'attributes':
                [{'height':'tall', 'age': 'senior', 'weight': 'thin'},
                 {'height':'short', 'age': 'child', 'weight': 'thin'},
                 {'height':'giant', 'age': 'adult', 'weight': 'fat'},
                 {'height':'short', 'age': 'child', 'weight': 'thin'},
                 {'height':'tall', 'age': 'child', 'weight': 'fat'}]})

# Construct and fit.
from graphlab.toolkits.feature_engineering import CountThresholder
count_tr = gl.feature_engineering.create(sf,
                 CountThresholder(threshold = 2))

# Transform the data
transformed_sf = count_tr.transform(sf)

Columns:
    attributes      dict

Rows: 5

Data:
+-------------------------------+
|           attributes          |
+-------------------------------+
| {'age': None, 'weight': 't... |
| {'age': 'child', 'weight':... |
| {'age': None, 'weight': No... |
| {'age': 'child', 'weight':... |
| {'age': 'child', 'weight':... |
+-------------------------------+