Sparse data structures
Note
The SparsePanel
class has been removed in 0.19.0
We have implemented “sparse” versions of Series
and DataFrame
. These are not sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN
/ missing value, though any value can be chosen) is omitted. A special SparseIndex
object tracks where data has been “sparsified”. This will make much more sense with an example. All of the standard pandas data structures have a to_sparse
method:
In [1]: ts = pd.Series(np.random.randn(10)) In [2]: ts[2:-2] = np.nan In [3]: sts = ts.to_sparse() In [4]: sts Out[4]: 0 0.469112 1 -0.282863 2 NaN 3 NaN 4 NaN 5 NaN 6 NaN 7 NaN 8 -0.861849 9 -2.104569 dtype: Sparse[float64, nan] BlockIndex Block locations: array([0, 8], dtype=int32) Block lengths: array([2, 2], dtype=int32)
The to_sparse
method takes a kind
argument (for the sparse index, see below) and a fill_value
. So if we had a mostly zero Series
, we could convert it to sparse with fill_value=0
:
In [5]: ts.fillna(0).to_sparse(fill_value=0) Out[5]: 0 0.469112 1 -0.282863 2 0.000000 3 0.000000 4 0.000000 5 0.000000 6 0.000000 7 0.000000 8 -0.861849 9 -2.104569 dtype: Sparse[float64, 0] BlockIndex Block locations: array([0, 8], dtype=int32) Block lengths: array([2, 2], dtype=int32)
The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame
:
In [6]: df = pd.DataFrame(np.random.randn(10000, 4)) In [7]: df.iloc[:9998] = np.nan In [8]: sdf = df.to_sparse() In [9]: sdf Out[9]: 0 1 2 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN ... ... ... ... ... 9993 NaN NaN NaN NaN 9994 NaN NaN NaN NaN 9995 NaN NaN NaN NaN 9996 NaN NaN NaN NaN 9997 NaN NaN NaN NaN 9998 0.509184 -0.774928 -1.369894 -0.382141 9999 0.280249 -1.648493 1.490865 -0.890819 [10000 rows x 4 columns] In [10]: sdf.density
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.24.2/user_guide/sparse.html