Categorical Data
This is an introduction to pandas categorical data type, including a short comparison with R’s factor
.
Categoricals
are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories
; levels
in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.
In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, …) are not possible.
All values of categorical data are either in categories
or np.nan
. Order is defined by the order of categories
, not lexical order of the values. Internally, the data structure consists of a categories
array and an integer array of codes
which point to the real value in the categories
array.
The categorical data type is useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
- As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).
See also the API docs on categoricals.
Object Creation
Series Creation
Categorical Series
or columns in a DataFrame
can be created in several ways:
By specifying dtype="category"
when constructing a Series
:
In [1]: s = pd.Series(["a", "b", "c", "a"], dtype="category") In [2]: s Out[2]: 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a, b, c]
By converting an existing Series
or column to a category
dtype:
In [3]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]}) In [4]: df["B"] = df["A"].astype('category') In [5]: df Out[5]: A B 0 a a 1 b b 2 c c 3 a a
By using special functions, such as cut()
, which groups data into discrete bins. See the example on tiling in the docs.
In [6]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)}) In [7]: labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)] In [8]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels) In [9]: df.head(10) Out[9]: value group 0 65 60 - 69 1 49 40 - 49 2 56 50 - 59 3 43 40 - 49 4 43 40 - 49 5 91 90 - 99 6 32 30 - 39 7 87 80 - 89 8 36 30 - 39 9 8 0 - 9
By passing a pandas.Categorical
object to a Series
or assigning it to a DataFrame
.
In [10]: raw_cat = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ....: ordered=False) ....: In [11]: s = pd.Series(raw_cat) In [12]: s Out[12]: 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): [b, c, d] In [13]: df = pd.DataFrame({"A": ["a", "b", "c", "a"]}) In [14]: df["B"] = raw_cat In [15]: df Out[15]: A B 0 a NaN 1 b b 2 c c 3 a NaN
Categorical data has a specific category
dtype:
In [16]: df.dtypes Out[16]: A object B category dtype: object
DataFrame Creation
Similar to the previous section where a single column was converted to categorical, all columns in a DataFrame
can be batch converted to categorical either during or after construction.
This can be done during construction by specifying dtype="category"
in the DataFrame
constructor:
In [17]: df = pd.DataFrame({'A': list('abca'), 'B': list('bccd')}, dtype="category") In [18]: df.dtypes Out[18]: A category B category dtype: object
Note that the categories present in each column differ; the conversion is done column by column, so only labels present in a given column are categories:
In [19]: df['A'] Out[19]: 0 a 1 b 2 c 3 a Name: A, dtype: category Categories (3, object): [a, b, c] In [20]: df['B']
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.24.2/user_guide/categorical.html