pandas.factorize
-
pandas.factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None)
[source] -
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values.
factorize
is available as both a top-level functionpandas.factorize()
, and as a methodSeries.factorize()
andIndex.factorize()
.Parameters: values : sequence
A 1-D seqeunce. Sequences that aren’t pandas objects are coereced to ndarrays before factorization.
sort : bool, default False
Sort
uniques
and shufflelabels
to maintain the relationship.order
Deprecated since version 0.23.0: This parameter has no effect and is deprecated.
na_sentinel : int, default -1
Value to mark “not found”.
size_hint : int, optional
Hint to the hashtable sizer.
Returns: labels : ndarray
An integer ndarray that’s an indexer into
uniques
.uniques.take(labels)
will have the same values asvalues
.uniques : ndarray, Index, or Categorical
The unique valid values. When
values
is Categorical,uniques
is a Categorical. Whenvalues
is some other pandas object, anIndex
is returned. Otherwise, a 1-D ndarray is returned.Note
Even if there’s a missing value in
values
,uniques
will not contain an entry for it.See also
-
pandas.cut
- Discretize continuous-valued array.
-
pandas.unique
- Find the unique valuse in an array.
Examples
These examples all show factorize as a top-level method like
pd.factorize(values)
. The results are identical for methods likeSeries.factorize()
.>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) >>> labels array([0, 0, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
With
sort=True
, theuniques
will be sorted, andlabels
will be shuffled so that the relationship is the maintained.>>> labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) >>> labels array([1, 1, 0, 2, 1]) >>> uniques array(['a', 'b', 'c'], dtype=object)
Missing values are indicated in
labels
withna_sentinel
(-1
by default). Note that missing values are never included inuniques
.>>> labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) >>> labels array([ 0, -1, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of
uniques
will differ. For Categoricals, aCategorical
is returned.>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) >>> labels, uniques = pd.factorize(cat) >>> labels array([0, 0, 1]) >>> uniques [a, c] Categories (3, object): [a, b, c]
Notice that
'b'
is inuniques.categories
, desipite not being present incat.values
.For all other pandas objects, an Index of the appropriate type is returned.
>>> cat = pd.Series(['a', 'a', 'c']) >>> labels, uniques = pd.factorize(cat) >>> labels array([0, 0, 1]) >>> uniques Index(['a', 'c'], dtype='object')
-
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.factorize.html