Comparison with SQL
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.
As is customary, we import pandas and numpy as follows:
In [1]: import pandas as pd In [2]: import numpy as np
Most of the examples will utilize the tips
dataset found within pandas tests. We’ll read the data into a DataFrame called tips
and assume we have a database table of the same name and structure.
In [3]: url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv' In [4]: tips = pd.read_csv(url) In [5]: tips.head() Out[5]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a *
to select all columns):
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
With pandas, column selection is done by passing a list of column names to your DataFrame:
In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5) Out[6]: total_bill tip smoker time 0 16.99 1.01 No Dinner 1 10.34 1.66 No Dinner 2 21.01 3.50 No Dinner 3 23.68 3.31 No Dinner 4 24.59 3.61 No Dinner
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *
).
WHERE
Filtering in SQL is done via a WHERE clause.
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
In [7]: tips[tips['time'] == 'Dinner'].head(5) Out[7]: total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
The above statement is simply passing a Series
of True/False objects to the DataFrame, returning all rows with True.
In [8]: is_dinner = tips['time'] == 'Dinner' In [9]: is_dinner.value_counts() Out[9]: True 176 False 68 Name: time, dtype: int64 In [10]: tips[is_dinner].head(5)
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.22.0/comparison_with_sql.html