IO Tools (Text, CSV, HDF5, …)
The pandas I/O API is a set of top level reader
functions accessed like pandas.read_csv()
that generally return a pandas object. The corresponding writer
functions are object methods that are accessed like DataFrame.to_csv()
. Below is a table containing available readers
and writers
.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Here is an informal performance comparison for some of these IO methods.
Note
For examples that use the StringIO
class, make sure you import it according to your Python version, i.e. from StringIO import StringIO
for Python 2 and from io import StringIO
for Python 3.
CSV & Text files
The workhorse function for reading text files (a.k.a. flat files) is read_csv()
. See the cookbook for some advanced strategies.
Parsing options
read_csv()
accepts the following common arguments:
Basic
-
filepath_or_buffer : various
- Either a path to a file (a
str
,pathlib.Path
, orpy._path.local.LocalPath
), URL (including http, ftp, and S3 locations), or any object with aread()
method (such as an open file orStringIO
). -
sep : str, defaults to ',' for read_csv(), \t for read_table()
- Delimiter to use. If sep is
None
, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool,csv.Sniffer
. In addition, separators longer than 1 character and different from'\s+'
will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example:'\\r\\t'
. -
delimiter : str, default None
- Alternative argument name for sep.
-
delim_whitespace : boolean, default False
-
Specifies whether or not whitespace (e.g.
' '
or'\t'
) will be used as the delimiter. Equivalent to settingsep='\s+'
. If this option is set toTrue
, nothing should be passed in for thedelimiter
parameter.New in version 0.18.1: support for the Python parser.
Column and Index Locations and Names
-
header : int or list of ints, default 'infer'
-
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to
header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical toheader=None
. Explicitly passheader=0
to be able to replace existing names.The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g.
[0,1,3]
. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines ifskip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file. -
names : array-like, default None
- List of column names to use. If file contains no header row, then you should explicitly pass
header=None
. Duplicates in this list will cause aUserWarning
to be issued. -
index_col : int or sequence or False, default None
- Column to use as the row labels of the
DataFrame
. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might considerindex_col=False
to force pandas to not use the first column as the index (row names). -
usecols : list-like or callable, default None
-
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in
names
or inferred from the document header row(s). For example, a valid list-likeusecols
parameter would be[0, 1, 2]
or['foo', 'bar', 'baz']
.Element order is ignored, so
usecols=[0, 1]
is the same as[1, 0]
. To instantiate a DataFrame fromdata
with element order preserved usepd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]
for columns in['foo', 'bar']
order orpd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]
for['bar', 'foo']
order.If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True:
In [1]: from pandas.compat import StringIO, BytesIO In [2]: data = ('col1,col2,col3\n' ...: 'a,b,1\n' ...: 'a,b,2\n' ...: 'c,d,3') ...: In [3]: pd.read_csv(StringIO(data)) Out[3]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [4]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.24.2/user_guide/io.html