Comparison with R / R libraries
Since pandas
aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas
. In comparisons with R and CRAN libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame
objects from pandas
to R, one option is to use HDF5 files, see External Compatibility for an example.
Quick Reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
Querying, Filtering, Sampling
R | pandas |
---|---|
dim(df) | df.shape |
head(df) | df.head() |
slice(df, 1:10) | df.iloc[:9] |
filter(df, col1 == 1, col2 == 1) | df.query('col1 == 1 & col2 == 1') |
df[df$col1 == 1 & df$col2 == 1,] | df[(df.col1 == 1) & (df.col2 == 1)] |
select(df, col1, col2) | df[['col1', 'col2']] |
select(df, col1:col3) | df.loc[:, 'col1':'col3'] |
select(df, -(col1:col3)) |
df.drop(cols_to_drop, axis=1) but see [1]
|
distinct(select(df, col1)) | df[['col1']].drop_duplicates() |
distinct(select(df, col1, col2)) | df[['col1', 'col2']].drop_duplicates() |
sample_n(df, 10) | df.sample(n=10) |
sample_frac(df, 0.01) | df.sample(frac=0.01) |
[1] | R’s shorthand for a subrange of columns (select(df, col1:col3) ) can be approached cleanly in pandas, if you have the list of columns, for example df[cols[1:3]] or df.drop(cols[1:3]) , but doing this by column name is a bit messy. |
Sorting
R | pandas |
---|---|
arrange(df, col1, col2) | df.sort_values(['col1', 'col2']) |
arrange(df, desc(col1)) | df.sort_values('col1', ascending=False) |
Transforming
R | pandas |
---|---|
select(df, col_one = col1) | df.rename(columns={'col1': 'col_one'})['col_one'] |
rename(df, col_one = col1) | df.rename(columns={'col1': 'col_one'}) |
mutate(df, c=a-b) | df.assign(c=df.a-df.b) |
Grouping and Summarizing
R | pandas |
---|---|
summary(df) | df.describe() |
gdf <- group_by(df, col1) | gdf = df.groupby('col1') |
summarise(gdf, avg=mean(col1, na.rm=TRUE)) | df.groupby('col1').agg({'col1': 'mean'}) |
summarise(gdf, total=sum(col1)) | df.groupby('col1').sum() |
Base R
Slicing with R’s c
R makes it easy to access data.frame
columns by name
df <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5), d=rnorm(5), e=rnorm(5)) df[, c("a", "c", "e")]
or by integer location
df <- data.frame(matrix(rnorm(1000), ncol=100)) df[, c(1:10, 25:30, 40, 50:100)]
Selecting multiple columns by name in pandas
is straightforward
In [1]: df = pd.DataFrame(np.random.randn(10, 3), columns=list('abc')) In [2]: df[['a', 'c']] Out[2]: a c 0 0.469112 -1.509059 1 -1.135632 -0.173215 2 0.119209 -0.861849 3 -2.104569 1.071804 4 0.721555 -1.039575 5 0.271860 0.567020 6 0.276232 -0.673690 7 0.113648 0.524988 8 0.404705 -1.715002 9 -1.039268 -1.157892 In [3]: df.loc[:, ['a', 'c']]
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.24.2/getting_started/comparison/comparison_with_r.html