How to Read .data File in Pandas
Sentry Now This tutorial has a related video course created by the Existent Python team. Watch it together with the written tutorial to deepen your understanding: Reading and Writing Files With Pandas
Pandas is a powerful and flexible Python parcel that allows you to piece of work with labeled and time series data. It likewise provides statistics methods, enables plotting, and more than. One crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the Pandas read_csv()
method enable y'all to piece of work with files effectively. You can use them to salve the data and labels from Pandas objects to a file and load them later as Pandas Series
or DataFrame
instances.
In this tutorial, you'll learn:
- What the Pandas IO tools API is
- How to read and write data to and from files
- How to work with various file formats
- How to work with big data efficiently
Let's start reading and writing files!
Installing Pandas
The code in this tutorial is executed with CPython 3.7.iv and Pandas 0.25.i. It would be beneficial to make sure yous accept the latest versions of Python and Pandas on your machine. You might want to create a new virtual surroundings and install the dependencies for this tutorial.
First, you'll need the Pandas library. You may already have it installed. If you don't, so yous can install it with pip:
Once the installation process completes, you should have Pandas installed and set up.
Anaconda is an excellent Python distribution that comes with Python, many useful packages similar Pandas, and a parcel and environment managing director called Conda. To acquire more almost Anaconda, check out Setting Upwards Python for Machine Learning on Windows.
If you don't accept Pandas in your virtual surround, and so you tin can install information technology with Conda:
Conda is powerful as information technology manages the dependencies and their versions. To learn more about working with Conda, you lot tin can check out the official documentation.
Preparing Data
In this tutorial, you'll use the data related to 20 countries. Here'south an overview of the data and sources you'll be working with:
-
Country is denoted by the country proper noun. Each country is in the superlative x list for either population, area, or gross domestic product (GDP). The row labels for the dataset are the three-letter of the alphabet country codes defined in ISO 3166-1. The column label for the dataset is
Country
. -
Population is expressed in millions. The information comes from a listing of countries and dependencies by population on Wikipedia. The column characterization for the dataset is
POP
. -
Area is expressed in thousands of kilometers squared. The data comes from a list of countries and dependencies by area on Wikipedia. The cavalcade label for the dataset is
AREA
. -
Gross domestic product is expressed in millions of U.S. dollars, according to the United nations data for 2017. Yous can find this data in the list of countries by nominal GDP on Wikipedia. The column characterization for the dataset is
Gross domestic product
. -
Continent is either Africa, Asia, Oceania, Europe, North America, or South America. Y'all can find this information on Wikipedia also. The column label for the dataset is
CONT
. -
Independence twenty-four hours is a appointment that commemorates a nation'south independence. The information comes from the list of national independence days on Wikipedia. The dates are shown in ISO 8601 format. The first four digits stand for the year, the next two numbers are the month, and the concluding two are for the day of the month. The column label for the dataset is
IND_DAY
.
This is how the data looks as a table:
Country | POP | Surface area | Gdp | CONT | IND_DAY | |
---|---|---|---|---|---|---|
CHN | People's republic of china | 1398.72 | 9596.96 | 12234.78 | Asia | |
IND | India | 1351.16 | 3287.26 | 2575.67 | Asia | 1947-08-15 |
USA | US | 329.74 | 9833.52 | 19485.39 | N.America | 1776-07-04 |
IDN | Indonesia | 268.07 | 1910.93 | 1015.54 | Asia | 1945-08-17 |
BRA | Brazil | 210.32 | 8515.77 | 2055.51 | South.America | 1822-09-07 |
PAK | Pakistan | 205.71 | 881.91 | 302.14 | Asia | 1947-08-fourteen |
NGA | Nigeria | 200.96 | 923.77 | 375.77 | Africa | 1960-ten-01 |
BGD | People's republic of bangladesh | 167.09 | 147.57 | 245.63 | Asia | 1971-03-26 |
RUS | Russia | 146.79 | 17098.25 | 1530.75 | 1992-06-12 | |
MEX | Mexico | 126.58 | 1964.38 | 1158.23 | N.America | 1810-09-16 |
JPN | Nippon | 126.22 | 377.97 | 4872.42 | Asia | |
DEU | Frg | 83.02 | 357.xi | 3693.twenty | Europe | |
FRA | French republic | 67.02 | 640.68 | 2582.49 | Europe | 1789-07-14 |
GBR | UK | 66.44 | 242.l | 2631.23 | Europe | |
ITA | Italia | 60.36 | 301.34 | 1943.84 | Europe | |
ARG | Argentina | 44.94 | 2780.40 | 637.49 | South.America | 1816-07-09 |
DZA | People's democratic republic of algeria | 43.38 | 2381.74 | 167.56 | Africa | 1962-07-05 |
Tin | Canada | 37.59 | 9984.67 | 1647.12 | N.America | 1867-07-01 |
AUS | Australia | 25.47 | 7692.02 | 1408.68 | Oceania | |
KAZ | Republic of kazakhstan | xviii.53 | 2724.90 | 159.41 | Asia | 1991-12-16 |
You may notice that some of the information is missing. For case, the continent for Russian federation is not specified considering information technology spreads across both Europe and Asia. At that place are also several missing independence days considering the data source omits them.
Y'all tin organize this data in Python using a nested dictionary:
data = { 'CHN' : { 'COUNTRY' : 'Communist china' , 'Pop' : 1_398.72 , 'AREA' : 9_596.96 , 'GDP' : 12_234.78 , 'CONT' : 'Asia' }, 'IND' : { 'State' : 'Bharat' , 'Pop' : 1_351.16 , 'AREA' : 3_287.26 , 'Gross domestic product' : 2_575.67 , 'CONT' : 'Asia' , 'IND_DAY' : '1947-08-15' }, 'Us' : { 'COUNTRY' : 'United states' , 'Pop' : 329.74 , 'Surface area' : 9_833.52 , 'Gdp' : 19_485.39 , 'CONT' : 'N.America' , 'IND_DAY' : '1776-07-04' }, 'IDN' : { 'Land' : 'Indonesia' , 'Popular' : 268.07 , 'AREA' : 1_910.93 , 'GDP' : 1_015.54 , 'CONT' : 'Asia' , 'IND_DAY' : '1945-08-17' }, 'BRA' : { 'COUNTRY' : 'Brazil' , 'Pop' : 210.32 , 'AREA' : 8_515.77 , 'GDP' : 2_055.51 , 'CONT' : 'S.America' , 'IND_DAY' : '1822-09-07' }, 'PAK' : { 'Country' : 'Pakistan' , 'Pop' : 205.71 , 'Expanse' : 881.91 , 'GDP' : 302.14 , 'CONT' : 'Asia' , 'IND_DAY' : '1947-08-fourteen' }, 'NGA' : { 'State' : 'Nigeria' , 'Popular' : 200.96 , 'AREA' : 923.77 , 'GDP' : 375.77 , 'CONT' : 'Africa' , 'IND_DAY' : '1960-ten-01' }, 'BGD' : { 'Land' : 'Bangladesh' , 'POP' : 167.09 , 'AREA' : 147.57 , 'GDP' : 245.63 , 'CONT' : 'Asia' , 'IND_DAY' : '1971-03-26' }, 'RUS' : { 'State' : 'Russian federation' , 'Popular' : 146.79 , 'AREA' : 17_098.25 , 'Gdp' : 1_530.75 , 'IND_DAY' : '1992-06-12' }, 'MEX' : { 'State' : 'Mexico' , 'POP' : 126.58 , 'Expanse' : 1_964.38 , 'GDP' : 1_158.23 , 'CONT' : 'Northward.America' , 'IND_DAY' : '1810-09-sixteen' }, 'JPN' : { 'COUNTRY' : 'Japan' , 'POP' : 126.22 , 'AREA' : 377.97 , 'GDP' : 4_872.42 , 'CONT' : 'Asia' }, 'DEU' : { 'State' : 'Deutschland' , 'Pop' : 83.02 , 'AREA' : 357.11 , 'Gross domestic product' : 3_693.20 , 'CONT' : 'Europe' }, 'FRA' : { 'COUNTRY' : 'France' , 'POP' : 67.02 , 'AREA' : 640.68 , 'Gross domestic product' : 2_582.49 , 'CONT' : 'Europe' , 'IND_DAY' : '1789-07-14' }, 'GBR' : { 'COUNTRY' : 'UK' , 'POP' : 66.44 , 'AREA' : 242.l , 'GDP' : 2_631.23 , 'CONT' : 'Europe' }, 'ITA' : { 'Country' : 'Italy' , 'Pop' : 60.36 , 'AREA' : 301.34 , 'GDP' : 1_943.84 , 'CONT' : 'Europe' }, 'ARG' : { 'State' : 'Argentine republic' , 'Popular' : 44.94 , 'AREA' : 2_780.forty , 'Gross domestic product' : 637.49 , 'CONT' : 'Southward.America' , 'IND_DAY' : '1816-07-09' }, 'DZA' : { 'COUNTRY' : 'Algeria' , 'Popular' : 43.38 , 'AREA' : 2_381.74 , 'GDP' : 167.56 , 'CONT' : 'Africa' , 'IND_DAY' : '1962-07-05' }, 'Tin can' : { 'COUNTRY' : 'Canada' , 'Popular' : 37.59 , 'Area' : 9_984.67 , 'GDP' : 1_647.12 , 'CONT' : 'N.America' , 'IND_DAY' : '1867-07-01' }, 'AUS' : { 'COUNTRY' : 'Australia' , 'Popular' : 25.47 , 'Expanse' : 7_692.02 , 'Gross domestic product' : 1_408.68 , 'CONT' : 'Oceania' }, 'KAZ' : { 'COUNTRY' : 'Kazakhstan' , 'POP' : 18.53 , 'AREA' : 2_724.xc , 'GDP' : 159.41 , 'CONT' : 'Asia' , 'IND_DAY' : '1991-12-16' } } columns = ( 'Land' , 'POP' , 'AREA' , 'Gross domestic product' , 'CONT' , 'IND_DAY' )
Each row of the table is written as an inner dictionary whose keys are the column names and values are the corresponding data. These dictionaries are and then nerveless as the values in the outer information
dictionary. The respective keys for data
are the three-letter country codes.
You can use this information
to create an example of a Pandas DataFrame
. Kickoff, y'all need to import Pandas:
>>>
>>> import pandas as pd
At present that you lot have Pandas imported, you lot tin use the DataFrame
constructor and data
to create a DataFrame
object.
information
is organized in such a way that the country codes represent to columns. You can reverse the rows and columns of a DataFrame
with the belongings .T
:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df Country POP Surface area GDP CONT IND_DAY CHN Red china 1398.72 9596.96 12234.8 Asia NaN IND Republic of india 1351.16 3287.26 2575.67 Asia 1947-08-15 USA U.s.a. 329.74 9833.52 19485.4 N.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.2 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Nihon 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.11 3693.2 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.5 2631.23 Europe NaN ITA Italian republic 60.36 301.34 1943.84 Europe NaN ARG Argentine republic 44.94 2780.4 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan xviii.53 2724.9 159.41 Asia 1991-12-sixteen
Now yous have your DataFrame
object populated with the data almost each country.
Versions of Python older than 3.six did not guarantee the lodge of keys in dictionaries. To ensure the lodge of columns is maintained for older versions of Python and Pandas, you can specify alphabetize=columns
:
>>>
>>> df = pd . DataFrame ( data = data , index = columns ) . T
At present that you've prepared your information, you're set to start working with files!
Using the Pandas read_csv()
and .to_csv()
Functions
A comma-separated values (CSV) file is a plaintext file with a .csv
extension that holds tabular information. This is ane of the most popular file formats for storing big amounts of data. Each row of the CSV file represents a unmarried table row. The values in the same row are past default separated with commas, but you could change the separator to a semicolon, tab, space, or another character.
Write a CSV File
You can relieve your Pandas DataFrame
as a CSV file with .to_csv()
:
>>>
>>> df . to_csv ( 'data.csv' )
That'southward it! You've created the file data.csv
in your current working directory. You lot tin expand the code block below to run into how your CSV file should wait:
,COUNTRY,Popular,AREA,GDP,CONT,IND_DAY CHN,Prc,1398.72,9596.96,12234.78,Asia, IND,Bharat,1351.16,3287.26,2575.67,Asia,1947-08-xv The states,US,329.74,9833.52,19485.39,Due north.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-14 NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16 JPN,Nihon,126.22,377.97,4872.42,Asia, DEU,Germany,83.02,357.11,3693.ii,Europe, FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,UK,66.44,242.5,2631.23,Europe, ITA,Italy,60.36,301.34,1943.84,Europe, ARG,Argentine republic,44.94,2780.4,637.49,S.America,1816-07-09 DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16
This text file contains the data separated with commas. The commencement column contains the row labels. In some cases, you'll find them irrelevant. If you don't want to continue them, and so you can pass the argument index=False
to .to_csv()
.
Read a CSV File
Once your information is saved in a CSV file, you'll likely want to load and use it from time to time. You can do that with the Pandas read_csv()
role:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 ) >>> df State POP AREA Gdp CONT IND_DAY CHN Cathay 1398.72 9596.96 12234.78 Asia NaN IND India 1351.sixteen 3287.26 2575.67 Asia 1947-08-15 The states United states 329.74 9833.52 19485.39 Northward.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 Due south.America 1822-09-07 PAK Pakistan 205.71 881.91 302.xiv Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD People's republic of bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-xvi JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.eleven 3693.twenty Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.50 2631.23 Europe NaN ITA Italian republic lx.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.xl 637.49 S.America 1816-07-09 DZA People's democratic republic of algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Republic of kazakhstan eighteen.53 2724.ninety 159.41 Asia 1991-12-16
In this case, the Pandas read_csv()
function returns a new DataFrame
with the data and labels from the file data.csv
, which you specified with the offset argument. This cord tin be any valid path, including URLs.
The parameter index_col
specifies the cavalcade from the CSV file that contains the row labels. You assign a zero-based column index to this parameter. You should determine the value of index_col
when the CSV file contains the row labels to avert loading them as data.
Yous'll learn more than about using Pandas with CSV files later on in this tutorial. Y'all tin also check out Reading and Writing CSV Files in Python to see how to handle CSV files with the congenital-in Python library csv likewise.
Using Pandas to Write and Read Excel Files
Microsoft Excel is probably the nigh widely-used spreadsheet software. While older versions used binary .xls
files, Excel 2007 introduced the new XML-based .xlsx
file. You tin can read and write Excel files in Pandas, similar to CSV files. Nevertheless, yous'll need to install the following Python packages first:
- xlwt to write to
.xls
files - openpyxl or XlsxWriter to write to
.xlsx
files - xlrd to read Excel files
You can install them using pip with a single command:
$ pip install xlwt openpyxl xlsxwriter xlrd
Yous can also use Conda:
$ conda install xlwt openpyxl xlsxwriter xlrd
Please annotation that y'all don't have to install all these packages. For example, you don't need both openpyxl and XlsxWriter. If you're going to work merely with .xls
files, so you don't need whatever of them! However, if you intend to piece of work simply with .xlsx
files, and then you lot're going to need at least one of them, simply non xlwt
. Take some time to decide which packages are right for your project.
Write an Excel File
Once you have those packages installed, you tin save your DataFrame
in an Excel file with .to_excel()
:
>>>
>>> df . to_excel ( 'information.xlsx' )
The argument 'data.xlsx'
represents the target file and, optionally, its path. The above statement should create the file information.xlsx
in your current working directory. That file should look like this:

The outset column of the file contains the labels of the rows, while the other columns store data.
Read an Excel File
You can load data from Excel files with read_excel()
:
>>>
>>> df = pd . read_excel ( 'data.xlsx' , index_col = 0 ) >>> df COUNTRY Pop AREA GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaN IND India 1351.16 3287.26 2575.67 Asia 1947-08-15 United states of america The states 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 Southward.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-xiv NGA Nigeria 200.96 923.77 375.77 Africa 1960-ten-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Nippon 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.xi 3693.20 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-fourteen GBR U.k. 66.44 242.50 2631.23 Europe NaN ITA Italian republic sixty.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Due north.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-sixteen
read_excel()
returns a new DataFrame
that contains the values from data.xlsx
. Y'all can also employ read_excel()
with OpenDocument spreadsheets, or .ods
files.
You'll learn more about working with Excel files afterward on in this tutorial. Yous tin can as well check out Using Pandas to Read Large Excel Files in Python.
Understanding the Pandas IO API
Pandas IO Tools is the API that allows you to save the contents of Series
and DataFrame
objects to the clipboard, objects, or files of various types. It also enables loading data from the clipboard, objects, or files.
Write Files
Serial
and DataFrame
objects have methods that enable writing data and labels to the clipboard or files. They're named with the pattern .to_<file-blazon>()
, where <file-type>
is the blazon of the target file.
Yous've learned about .to_csv()
and .to_excel()
, but there are others, including:
-
.to_json()
-
.to_html()
-
.to_sql()
-
.to_pickle()
At that place are nonetheless more file types that you tin write to, so this list is not exhaustive.
These methods have parameters specifying the target file path where you saved the data and labels. This is mandatory in some cases and optional in others. If this option is available and you choose to omit it, then the methods return the objects (like strings or iterables) with the contents of DataFrame
instances.
The optional parameter compression
decides how to compress the file with the data and labels. You'll learn more about information technology afterwards on. There are a few other parameters, but they're mostly specific to one or several methods. Y'all won't go into them in detail here.
Read Files
Pandas functions for reading the contents of files are named using the pattern .read_<file-blazon>()
, where <file-type>
indicates the blazon of the file to read. Y'all've already seen the Pandas read_csv()
and read_excel()
functions. Here are a few others:
-
read_json()
-
read_html()
-
read_sql()
-
read_pickle()
These functions take a parameter that specifies the target file path. It can be any valid cord that represents the path, either on a local machine or in a URL. Other objects are also acceptable depending on the file blazon.
The optional parameter compression
determines the type of decompression to use for the compressed files. You'll acquire near it subsequently on in this tutorial. There are other parameters, but they're specific to i or several functions. You won't go into them in detail here.
Working With Different File Types
The Pandas library offers a wide range of possibilities for saving your data to files and loading data from files. In this department, you lot'll learn more than about working with CSV and Excel files. You'll as well see how to use other types of files, like JSON, web pages, databases, and Python pickle files.
CSV Files
You lot've already learned how to read and write CSV files. At present let'southward dig a little deeper into the details. When you use .to_csv()
to relieve your DataFrame
, you lot can provide an statement for the parameter path_or_buf
to specify the path, name, and extension of the target file.
path_or_buf
is the first argument .to_csv()
will go. Information technology can exist any string that represents a valid file path that includes the file name and its extension. You lot've seen this in a previous example. However, if yous omit path_or_buf
, then .to_csv()
won't create whatsoever files. Instead, information technology'll return the respective string:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> s = df . to_csv () >>> impress ( south ) ,State,POP,AREA,GDP,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia, IND,India,1351.16,3287.26,2575.67,Asia,1947-08-xv USA,US,329.74,9833.52,19485.39,N.America,1776-07-04 IDN,Republic of indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,Due south.America,1822-09-07 PAK,Islamic republic of pakistan,205.71,881.91,302.14,Asia,1947-08-14 NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,Northward.America,1810-09-xvi JPN,Japan,126.22,377.97,4872.42,Asia, DEU,Federal republic of germany,83.02,357.11,3693.two,Europe, FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,Uk,66.44,242.5,2631.23,Europe, ITA,Italia,lx.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,Due south.America,1816-07-09 DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,North.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Republic of kazakhstan,18.53,2724.9,159.41,Asia,1991-12-16
Now you have the string s
instead of a CSV file. Yous likewise have some missing values in your DataFrame
object. For instance, the continent for Russia and the independence days for several countries (China, Japan, and and then on) are not available. In data science and machine learning, you must handle missing values carefully. Pandas excels here! By default, Pandas uses the NaN value to replace the missing values.
The continent that corresponds to Russia in df
is nan
:
>>>
>>> df . loc [ 'RUS' , 'CONT' ] nan
This example uses .loc[]
to get data with the specified row and column names.
When yous salvage your DataFrame
to a CSV file, empty strings (''
) will stand for the missing data. You tin run across this both in your file data.csv
and in the string s
. If you lot want to alter this behavior, then utilize the optional parameter na_rep
:
>>>
>>> df . to_csv ( 'new-information.csv' , na_rep = '(missing)' )
This code produces the file new-data.csv
where the missing values are no longer empty strings. You tin expand the code block below to run across how this file should look:
,Land,Pop,AREA,Gross domestic product,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia,(missing) IND,India,1351.sixteen,3287.26,2575.67,Asia,1947-08-xv Us,The states,329.74,9833.52,19485.39,Northward.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,South.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.fourteen,Asia,1947-08-14 NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01 BGD,People's republic of bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,(missing),1992-06-12 MEX,United mexican states,126.58,1964.38,1158.23,N.America,1810-09-16 JPN,Nihon,126.22,377.97,4872.42,Asia,(missing) DEU,Germany,83.02,357.eleven,3693.ii,Europe,(missing) FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,Britain,66.44,242.5,2631.23,Europe,(missing) ITA,Italy,60.36,301.34,1943.84,Europe,(missing) ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09 DZA,People's democratic republic of algeria,43.38,2381.74,167.56,Africa,1962-07-05 Tin,Canada,37.59,9984.67,1647.12,N.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania,(missing) KAZ,Kazakhstan,eighteen.53,2724.9,159.41,Asia,1991-12-16
Now, the string '(missing)'
in the file corresponds to the nan
values from df
.
When Pandas reads files, it considers the empty string (''
) and a few others as missing values past default:
-
'nan'
-
'-nan'
-
'NA'
-
'N/A'
-
'NaN'
-
'null'
If you don't desire this beliefs, then you can pass keep_default_na=False
to the Pandas read_csv()
function. To specify other labels for missing values, apply the parameter na_values
:
>>>
>>> pd . read_csv ( 'new-information.csv' , index_col = 0 , na_values = '(missing)' ) Country POP AREA GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaN IND India 1351.sixteen 3287.26 2575.67 Asia 1947-08-15 U.s. United states 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 Due south.America 1822-09-07 PAK Islamic republic of pakistan 205.71 881.91 302.fourteen Asia 1947-08-fourteen NGA Nigeria 200.96 923.77 375.77 Africa 1960-ten-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Deutschland 83.02 357.11 3693.20 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR United kingdom of great britain and northern ireland 66.44 242.50 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentine republic 44.94 2780.xl 637.49 Southward.America 1816-07-09 DZA People's democratic republic of algeria 43.38 2381.74 167.56 Africa 1962-07-05 Tin Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-xvi
Hither, you've marked the string '(missing)'
every bit a new missing information label, and Pandas replaced it with nan
when it read the file.
When you load data from a file, Pandas assigns the data types to the values of each cavalcade past default. You tin bank check these types with .dtypes
:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 ) >>> df . dtypes COUNTRY object POP float64 Expanse float64 GDP float64 CONT object IND_DAY object dtype: object
The columns with strings and dates ('State'
, 'CONT'
, and 'IND_DAY'
) have the data blazon object
. Meanwhile, the numeric columns contain 64-bit floating-point numbers (float64
).
You can use the parameter dtype
to specify the desired data types and parse_dates
to forcefulness use of datetimes:
>>>
>>> dtypes = { 'Pop' : 'float32' , 'AREA' : 'float32' , 'Gdp' : 'float32' } >>> df = pd . read_csv ( 'data.csv' , index_col = 0 , dtype = dtypes , ... parse_dates = [ 'IND_DAY' ]) >>> df . dtypes Land object POP float32 Expanse float32 GDP float32 CONT object IND_DAY datetime64[ns] dtype: object >>> df [ 'IND_DAY' ] CHN NaT IND 1947-08-15 United states of america 1776-07-04 IDN 1945-08-17 BRA 1822-09-07 PAK 1947-08-14 NGA 1960-10-01 BGD 1971-03-26 RUS 1992-06-12 MEX 1810-09-16 JPN NaT DEU NaT FRA 1789-07-14 GBR NaT ITA NaT ARG 1816-07-09 DZA 1962-07-05 CAN 1867-07-01 AUS NaT KAZ 1991-12-16 Proper noun: IND_DAY, dtype: datetime64[ns]
Now, you have 32-fleck floating-point numbers (float32
) as specified with dtype
. These differ slightly from the original 64-chip numbers because of smaller precision. The values in the last column are considered as dates and accept the data blazon datetime64
. That'due south why the NaN
values in this column are replaced with NaT
.
Now that you lot have existent dates, yous can save them in the format y'all like:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , parse_dates = [ 'IND_DAY' ]) >>> df . to_csv ( 'formatted-information.csv' , date_format = '%B %d , %Y' )
Here, yous've specified the parameter date_format
to exist '%B %d, %Y'
. You tin expand the lawmaking cake below to see the resulting file:
,COUNTRY,Popular,AREA,GDP,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia, IND,India,1351.16,3287.26,2575.67,Asia,"August xv, 1947" United states,Us,329.74,9833.52,19485.39,N.America,"July 04, 1776" IDN,Republic of indonesia,268.07,1910.93,1015.54,Asia,"August 17, 1945" BRA,Brazil,210.32,8515.77,2055.51,S.America,"September 07, 1822" PAK,Pakistan,205.71,881.91,302.14,Asia,"Baronial 14, 1947" NGA,Nigeria,200.96,923.77,375.77,Africa,"October 01, 1960" BGD,Bangladesh,167.09,147.57,245.63,Asia,"March 26, 1971" RUS,Russia,146.79,17098.25,1530.75,,"June 12, 1992" MEX,Mexico,126.58,1964.38,1158.23,Due north.America,"September 16, 1810" JPN,Japan,126.22,377.97,4872.42,Asia, DEU,Frg,83.02,357.11,3693.2,Europe, FRA,France,67.02,640.68,2582.49,Europe,"July 14, 1789" GBR,United kingdom,66.44,242.five,2631.23,Europe, ITA,Italian republic,60.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,South.America,"July 09, 1816" DZA,People's democratic republic of algeria,43.38,2381.74,167.56,Africa,"July 05, 1962" CAN,Canada,37.59,9984.67,1647.12,Due north.America,"July 01, 1867" AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,"December 16, 1991"
The format of the dates is dissimilar now. The format '%B %d, %Y'
ways the date volition first display the full name of the month, then the day followed by a comma, and finally the full year.
In that location are several other optional parameters that you can use with .to_csv()
:
-
sep
denotes a values separator. -
decimal
indicates a decimal separator. -
encoding
sets the file encoding. -
header
specifies whether you want to write column labels in the file.
Here's how you lot would laissez passer arguments for sep
and header
:
>>>
>>> s = df . to_csv ( sep = ';' , header = False ) >>> print ( s ) CHN;China;1398.72;9596.96;12234.78;Asia; IND;Bharat;1351.16;3287.26;2575.67;Asia;1947-08-fifteen U.s.a.;U.s.a.;329.74;9833.52;19485.39;N.America;1776-07-04 IDN;Republic of indonesia;268.07;1910.93;1015.54;Asia;1945-08-17 BRA;Brazil;210.32;8515.77;2055.51;Due south.America;1822-09-07 PAK;Pakistan;205.71;881.91;302.xiv;Asia;1947-08-14 NGA;Nigeria;200.96;923.77;375.77;Africa;1960-x-01 BGD;Bangladesh;167.09;147.57;245.63;Asia;1971-03-26 RUS;Russian federation;146.79;17098.25;1530.75;;1992-06-12 MEX;Mexico;126.58;1964.38;1158.23;N.America;1810-09-16 JPN;Nippon;126.22;377.97;4872.42;Asia; DEU;Germany;83.02;357.11;3693.2;Europe; FRA;France;67.02;640.68;2582.49;Europe;1789-07-xiv GBR;United kingdom;66.44;242.5;2631.23;Europe; ITA;Italy;60.36;301.34;1943.84;Europe; ARG;Argentine republic;44.94;2780.four;637.49;Due south.America;1816-07-09 DZA;Algeria;43.38;2381.74;167.56;Africa;1962-07-05 Can;Canada;37.59;9984.67;1647.12;N.America;1867-07-01 AUS;Australia;25.47;7692.02;1408.68;Oceania; KAZ;Kazakhstan;18.53;2724.9;159.41;Asia;1991-12-16
The data is separated with a semicolon (';'
) considering you've specified sep=';'
. Likewise, since yous passed header=False
, you see your data without the header row of cavalcade names.
The Pandas read_csv()
role has many additional options for managing missing data, working with dates and times, quoting, encoding, treatment errors, and more than. For case, if you have a file with one data column and want to get a Serial
object instead of a DataFrame
, then you can pass squeeze=True
to read_csv()
. Y'all'll learn afterwards on about information compression and decompression, besides as how to skip rows and columns.
JSON Files
JSON stands for JavaScript object notation. JSON files are plaintext files used for data interchange, and humans can read them easily. They follow the ISO/IEC 21778:2017 and ECMA-404 standards and apply the .json
extension. Python and Pandas work well with JSON files, as Python'southward json library offers congenital-in support for them.
You tin save the data from your DataFrame
to a JSON file with .to_json()
. Start by creating a DataFrame
object again. Use the dictionary data
that holds the data about countries and then apply .to_json()
:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df . to_json ( 'data-columns.json' )
This code produces the file data-columns.json
. You can expand the lawmaking cake below to see how this file should wait:
{ "Land" :{ "CHN" : "People's republic of china" , "IND" : "India" , "USA" : "US" , "IDN" : "Indonesia" , "BRA" : "Brazil" , "PAK" : "Pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russian federation" , "MEX" : "United mexican states" , "JPN" : "Japan" , "DEU" : "Deutschland" , "FRA" : "French republic" , "GBR" : "Uk" , "ITA" : "Italy" , "ARG" : "Argentina" , "DZA" : "People's democratic republic of algeria" , "Can" : "Canada" , "AUS" : "Australia" , "KAZ" : "Kazakhstan" }, "Popular" :{ "CHN" : 1398.72 , "IND" : 1351.16 , "Usa" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : 60.36 , "ARG" : 44.94 , "DZA" : 43.38 , "CAN" : 37.59 , "AUS" : 25.47 , "KAZ" : 18.53 }, "AREA" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "United states of america" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.11 , "FRA" : 640.68 , "GBR" : 242.five , "ITA" : 301.34 , "ARG" : 2780.4 , "DZA" : 2381.74 , "CAN" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.nine }, "GDP" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "United states of america" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.14 , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "Tin can" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "USA" : "N.America" , "IDN" : "Asia" , "BRA" : "South.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : null , "MEX" : "N.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "South.America" , "DZA" : "Africa" , "CAN" : "N.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : nothing , "IND" : "1947-08-15" , "USA" : "1776-07-04" , "IDN" : "1945-08-17" , "BRA" : "1822-09-07" , "PAK" : "1947-08-fourteen" , "NGA" : "1960-x-01" , "BGD" : "1971-03-26" , "RUS" : "1992-06-12" , "MEX" : "1810-09-16" , "JPN" : null , "DEU" : null , "FRA" : "1789-07-14" , "GBR" : cypher , "ITA" : zippo , "ARG" : "1816-07-09" , "DZA" : "1962-07-05" , "CAN" : "1867-07-01" , "AUS" : null , "KAZ" : "1991-12-xvi" }}
information-columns.json
has 1 large dictionary with the column labels every bit keys and the corresponding inner dictionaries equally values.
You lot tin get a different file construction if you laissez passer an argument for the optional parameter orient
:
>>>
>>> df . to_json ( 'data-index.json' , orient = 'index' )
The orient
parameter defaults to 'columns'
. Here, yous've set it to index
.
You should go a new file information-index.json
. You tin can expand the code block beneath to see the changes:
{ "CHN" :{ "State" : "Red china" , "POP" : 1398.72 , "AREA" : 9596.96 , "Gross domestic product" : 12234.78 , "CONT" : "Asia" , "IND_DAY" : goose egg }, "IND" :{ "COUNTRY" : "India" , "POP" : 1351.16 , "Surface area" : 3287.26 , "GDP" : 2575.67 , "CONT" : "Asia" , "IND_DAY" : "1947-08-15" }, "Us" :{ "COUNTRY" : "US" , "Pop" : 329.74 , "AREA" : 9833.52 , "GDP" : 19485.39 , "CONT" : "North.America" , "IND_DAY" : "1776-07-04" }, "IDN" :{ "Land" : "Republic of indonesia" , "Popular" : 268.07 , "Expanse" : 1910.93 , "Gdp" : 1015.54 , "CONT" : "Asia" , "IND_DAY" : "1945-08-17" }, "BRA" :{ "COUNTRY" : "Brazil" , "POP" : 210.32 , "AREA" : 8515.77 , "GDP" : 2055.51 , "CONT" : "S.America" , "IND_DAY" : "1822-09-07" }, "PAK" :{ "COUNTRY" : "Pakistan" , "Pop" : 205.71 , "Expanse" : 881.91 , "Gross domestic product" : 302.fourteen , "CONT" : "Asia" , "IND_DAY" : "1947-08-14" }, "NGA" :{ "COUNTRY" : "Nigeria" , "Pop" : 200.96 , "AREA" : 923.77 , "Gdp" : 375.77 , "CONT" : "Africa" , "IND_DAY" : "1960-10-01" }, "BGD" :{ "COUNTRY" : "People's republic of bangladesh" , "Pop" : 167.09 , "Expanse" : 147.57 , "Gross domestic product" : 245.63 , "CONT" : "Asia" , "IND_DAY" : "1971-03-26" }, "RUS" :{ "Country" : "Russian federation" , "Pop" : 146.79 , "AREA" : 17098.25 , "Gdp" : 1530.75 , "CONT" : nada , "IND_DAY" : "1992-06-12" }, "MEX" :{ "Country" : "Mexico" , "POP" : 126.58 , "Expanse" : 1964.38 , "GDP" : 1158.23 , "CONT" : "North.America" , "IND_DAY" : "1810-09-16" }, "JPN" :{ "COUNTRY" : "Japan" , "POP" : 126.22 , "Surface area" : 377.97 , "Gross domestic product" : 4872.42 , "CONT" : "Asia" , "IND_DAY" : nothing }, "DEU" :{ "COUNTRY" : "Frg" , "Popular" : 83.02 , "AREA" : 357.11 , "GDP" : 3693.2 , "CONT" : "Europe" , "IND_DAY" : null }, "FRA" :{ "COUNTRY" : "France" , "POP" : 67.02 , "Expanse" : 640.68 , "Gdp" : 2582.49 , "CONT" : "Europe" , "IND_DAY" : "1789-07-fourteen" }, "GBR" :{ "Land" : "Britain" , "Pop" : 66.44 , "AREA" : 242.5 , "GDP" : 2631.23 , "CONT" : "Europe" , "IND_DAY" : null }, "ITA" :{ "COUNTRY" : "Italia" , "POP" : 60.36 , "Area" : 301.34 , "Gdp" : 1943.84 , "CONT" : "Europe" , "IND_DAY" : null }, "ARG" :{ "Country" : "Argentine republic" , "POP" : 44.94 , "AREA" : 2780.4 , "GDP" : 637.49 , "CONT" : "S.America" , "IND_DAY" : "1816-07-09" }, "DZA" :{ "COUNTRY" : "Algeria" , "POP" : 43.38 , "AREA" : 2381.74 , "Gdp" : 167.56 , "CONT" : "Africa" , "IND_DAY" : "1962-07-05" }, "CAN" :{ "Country" : "Canada" , "POP" : 37.59 , "AREA" : 9984.67 , "GDP" : 1647.12 , "CONT" : "Northward.America" , "IND_DAY" : "1867-07-01" }, "AUS" :{ "Country" : "Australia" , "Popular" : 25.47 , "Expanse" : 7692.02 , "Gdp" : 1408.68 , "CONT" : "Oceania" , "IND_DAY" : nada }, "KAZ" :{ "COUNTRY" : "Kazakhstan" , "POP" : 18.53 , "AREA" : 2724.nine , "Gross domestic product" : 159.41 , "CONT" : "Asia" , "IND_DAY" : "1991-12-16" }}
data-alphabetize.json
besides has ane large dictionary, only this time the row labels are the keys, and the inner dictionaries are the values.
At that place are few more options for orient
. Ane of them is 'records'
:
>>>
>>> df . to_json ( 'data-records.json' , orient = 'records' )
This code should yield the file data-records.json
. You tin expand the code block below to see the content:
[{ "Country" : "Prc" , "Pop" : 1398.72 , "AREA" : 9596.96 , "GDP" : 12234.78 , "CONT" : "Asia" , "IND_DAY" : cypher },{ "State" : "India" , "POP" : 1351.16 , "AREA" : 3287.26 , "Gross domestic product" : 2575.67 , "CONT" : "Asia" , "IND_DAY" : "1947-08-15" },{ "Land" : "Us" , "POP" : 329.74 , "Expanse" : 9833.52 , "Gross domestic product" : 19485.39 , "CONT" : "N.America" , "IND_DAY" : "1776-07-04" },{ "State" : "Indonesia" , "Pop" : 268.07 , "Expanse" : 1910.93 , "Gross domestic product" : 1015.54 , "CONT" : "Asia" , "IND_DAY" : "1945-08-17" },{ "State" : "Brazil" , "Pop" : 210.32 , "AREA" : 8515.77 , "GDP" : 2055.51 , "CONT" : "S.America" , "IND_DAY" : "1822-09-07" },{ "COUNTRY" : "Pakistan" , "POP" : 205.71 , "Area" : 881.91 , "Gdp" : 302.14 , "CONT" : "Asia" , "IND_DAY" : "1947-08-fourteen" },{ "State" : "Nigeria" , "Popular" : 200.96 , "Area" : 923.77 , "Gdp" : 375.77 , "CONT" : "Africa" , "IND_DAY" : "1960-10-01" },{ "Country" : "Bangladesh" , "POP" : 167.09 , "AREA" : 147.57 , "Gross domestic product" : 245.63 , "CONT" : "Asia" , "IND_DAY" : "1971-03-26" },{ "COUNTRY" : "Russian federation" , "Pop" : 146.79 , "Expanse" : 17098.25 , "GDP" : 1530.75 , "CONT" : zippo , "IND_DAY" : "1992-06-12" },{ "State" : "Mexico" , "Popular" : 126.58 , "Surface area" : 1964.38 , "GDP" : 1158.23 , "CONT" : "N.America" , "IND_DAY" : "1810-09-xvi" },{ "COUNTRY" : "Japan" , "Pop" : 126.22 , "Expanse" : 377.97 , "GDP" : 4872.42 , "CONT" : "Asia" , "IND_DAY" : null },{ "State" : "Federal republic of germany" , "Popular" : 83.02 , "Expanse" : 357.11 , "GDP" : 3693.2 , "CONT" : "Europe" , "IND_DAY" : cypher },{ "Country" : "France" , "POP" : 67.02 , "AREA" : 640.68 , "Gdp" : 2582.49 , "CONT" : "Europe" , "IND_DAY" : "1789-07-14" },{ "COUNTRY" : "UK" , "Popular" : 66.44 , "AREA" : 242.5 , "GDP" : 2631.23 , "CONT" : "Europe" , "IND_DAY" : null },{ "Country" : "Italy" , "Pop" : 60.36 , "Expanse" : 301.34 , "GDP" : 1943.84 , "CONT" : "Europe" , "IND_DAY" : zip },{ "Land" : "Argentina" , "POP" : 44.94 , "Area" : 2780.four , "Gross domestic product" : 637.49 , "CONT" : "S.America" , "IND_DAY" : "1816-07-09" },{ "COUNTRY" : "Algeria" , "POP" : 43.38 , "AREA" : 2381.74 , "Gross domestic product" : 167.56 , "CONT" : "Africa" , "IND_DAY" : "1962-07-05" },{ "COUNTRY" : "Canada" , "Popular" : 37.59 , "Surface area" : 9984.67 , "Gross domestic product" : 1647.12 , "CONT" : "N.America" , "IND_DAY" : "1867-07-01" },{ "State" : "Commonwealth of australia" , "Popular" : 25.47 , "AREA" : 7692.02 , "GDP" : 1408.68 , "CONT" : "Oceania" , "IND_DAY" : cypher },{ "COUNTRY" : "Kazakhstan" , "POP" : xviii.53 , "Area" : 2724.9 , "GDP" : 159.41 , "CONT" : "Asia" , "IND_DAY" : "1991-12-16" }]
data-records.json
holds a list with i dictionary for each row. The row labels are non written.
Y'all can get some other interesting file structure with orient='split up'
:
>>>
>>> df . to_json ( 'data-split.json' , orient = 'carve up' )
The resulting file is data-split.json
. You can expand the code block beneath to come across how this file should look:
{ "columns" :[ "Land" , "Pop" , "AREA" , "Gdp" , "CONT" , "IND_DAY" ], "index" :[ "CHN" , "IND" , "USA" , "IDN" , "BRA" , "PAK" , "NGA" , "BGD" , "RUS" , "MEX" , "JPN" , "DEU" , "FRA" , "GBR" , "ITA" , "ARG" , "DZA" , "Tin can" , "AUS" , "KAZ" ], "data" :[[ "Prc" , 1398.72 , 9596.96 , 12234.78 , "Asia" , null ],[ "India" , 1351.16 , 3287.26 , 2575.67 , "Asia" , "1947-08-15" ],[ "US" , 329.74 , 9833.52 , 19485.39 , "N.America" , "1776-07-04" ],[ "Indonesia" , 268.07 , 1910.93 , 1015.54 , "Asia" , "1945-08-17" ],[ "Brazil" , 210.32 , 8515.77 , 2055.51 , "S.America" , "1822-09-07" ],[ "Islamic republic of pakistan" , 205.71 , 881.91 , 302.xiv , "Asia" , "1947-08-14" ],[ "Nigeria" , 200.96 , 923.77 , 375.77 , "Africa" , "1960-x-01" ],[ "Bangladesh" , 167.09 , 147.57 , 245.63 , "Asia" , "1971-03-26" ],[ "Russia" , 146.79 , 17098.25 , 1530.75 , null , "1992-06-12" ],[ "Mexico" , 126.58 , 1964.38 , 1158.23 , "Due north.America" , "1810-09-xvi" ],[ "Nihon" , 126.22 , 377.97 , 4872.42 , "Asia" , zero ],[ "Germany" , 83.02 , 357.11 , 3693.2 , "Europe" , aught ],[ "French republic" , 67.02 , 640.68 , 2582.49 , "Europe" , "1789-07-14" ],[ "UK" , 66.44 , 242.5 , 2631.23 , "Europe" , null ],[ "Italian republic" , 60.36 , 301.34 , 1943.84 , "Europe" , null ],[ "Argentina" , 44.94 , 2780.4 , 637.49 , "S.America" , "1816-07-09" ],[ "People's democratic republic of algeria" , 43.38 , 2381.74 , 167.56 , "Africa" , "1962-07-05" ],[ "Canada" , 37.59 , 9984.67 , 1647.12 , "N.America" , "1867-07-01" ],[ "Australia" , 25.47 , 7692.02 , 1408.68 , "Oceania" , null ],[ "Kazakhstan" , eighteen.53 , 2724.9 , 159.41 , "Asia" , "1991-12-16" ]]}
data-carve up.json
contains one dictionary that holds the following lists:
- The names of the columns
- The labels of the rows
- The inner lists (ii-dimensional sequence) that hold information values
If you don't provide the value for the optional parameter path_or_buf
that defines the file path, and then .to_json()
will return a JSON cord instead of writing the results to a file. This beliefs is consequent with .to_csv()
.
At that place are other optional parameters you tin can use. For instance, yous tin set index=False
to forgo saving row labels. You can manipulate precision with double_precision
, and dates with date_format
and date_unit
. These concluding two parameters are particularly important when you have time series among your data:
>>>
>>> df = pd . DataFrame ( information = data ) . T >>> df [ 'IND_DAY' ] = pd . to_datetime ( df [ 'IND_DAY' ]) >>> df . dtypes State object Popular object AREA object Gdp object CONT object IND_DAY datetime64[ns] dtype: object >>> df . to_json ( 'data-time.json' )
In this instance, you lot've created the DataFrame
from the dictionary data
and used to_datetime()
to convert the values in the last column to datetime64
. You tin expand the code block below to run across the resulting file:
{ "COUNTRY" :{ "CHN" : "Red china" , "IND" : "Republic of india" , "USA" : "US" , "IDN" : "Republic of indonesia" , "BRA" : "Brazil" , "PAK" : "Islamic republic of pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russia" , "MEX" : "Mexico" , "JPN" : "Japan" , "DEU" : "Germany" , "FRA" : "France" , "GBR" : "Uk" , "ITA" : "Italy" , "ARG" : "Argentina" , "DZA" : "Algeria" , "CAN" : "Canada" , "AUS" : "Australia" , "KAZ" : "Kazakhstan" }, "POP" :{ "CHN" : 1398.72 , "IND" : 1351.16 , "USA" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : threescore.36 , "ARG" : 44.94 , "DZA" : 43.38 , "Tin" : 37.59 , "AUS" : 25.47 , "KAZ" : 18.53 }, "AREA" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "United states of america" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.xi , "FRA" : 640.68 , "GBR" : 242.5 , "ITA" : 301.34 , "ARG" : 2780.four , "DZA" : 2381.74 , "Tin can" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.9 }, "Gdp" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "Usa" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.14 , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "Can" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "USA" : "N.America" , "IDN" : "Asia" , "BRA" : "South.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : naught , "MEX" : "N.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "S.America" , "DZA" : "Africa" , "CAN" : "N.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : null , "IND" : -706320000000 , "United states of america" : -6106060800000 , "IDN" : -769219200000 , "BRA" : -4648924800000 , "PAK" : -706406400000 , "NGA" : -291945600000 , "BGD" : 38793600000 , "RUS" : 708307200000 , "MEX" : -5026838400000 , "JPN" : null , "DEU" : null , "FRA" : -5694969600000 , "GBR" : null , "ITA" : aught , "ARG" : -4843411200000 , "DZA" : -236476800000 , "Tin can" : -3234729600000 , "AUS" : goose egg , "KAZ" : 692841600000 }}
In this file, you have large integers instead of dates for the independence days. That's because the default value of the optional parameter date_format
is 'epoch'
whenever orient
isn't 'tabular array'
. This default behavior expresses dates every bit an epoch in milliseconds relative to midnight on January ane, 1970.
However, if you pass date_format='iso'
, and so you'll get the dates in the ISO 8601 format. In addition, date_unit
decides the units of fourth dimension:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df [ 'IND_DAY' ] = pd . to_datetime ( df [ 'IND_DAY' ]) >>> df . to_json ( 'new-data-time.json' , date_format = 'iso' , date_unit = 'due south' )
This code produces the following JSON file:
{ "COUNTRY" :{ "CHN" : "China" , "IND" : "India" , "USA" : "Us" , "IDN" : "Indonesia" , "BRA" : "Brazil" , "PAK" : "Islamic republic of pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russia" , "MEX" : "Mexico" , "JPN" : "Nihon" , "DEU" : "Germany" , "FRA" : "France" , "GBR" : "UK" , "ITA" : "Italy" , "ARG" : "Argentina" , "DZA" : "Algeria" , "CAN" : "Canada" , "AUS" : "Australia" , "KAZ" : "Kazakhstan" }, "POP" :{ "CHN" : 1398.72 , "IND" : 1351.xvi , "USA" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : threescore.36 , "ARG" : 44.94 , "DZA" : 43.38 , "Can" : 37.59 , "AUS" : 25.47 , "KAZ" : xviii.53 }, "Expanse" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "Us" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.11 , "FRA" : 640.68 , "GBR" : 242.five , "ITA" : 301.34 , "ARG" : 2780.four , "DZA" : 2381.74 , "CAN" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.9 }, "Gdp" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "United states of america" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.fourteen , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "CAN" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "USA" : "N.America" , "IDN" : "Asia" , "BRA" : "S.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : zilch , "MEX" : "Due north.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "S.America" , "DZA" : "Africa" , "Tin can" : "North.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : null , "IND" : "1947-08-15T00:00:00Z" , "USA" : "1776-07-04T00:00:00Z" , "IDN" : "1945-08-17T00:00:00Z" , "BRA" : "1822-09-07T00:00:00Z" , "PAK" : "1947-08-14T00:00:00Z" , "NGA" : "1960-x-01T00:00:00Z" , "BGD" : "1971-03-26T00:00:00Z" , "RUS" : "1992-06-12T00:00:00Z" , "MEX" : "1810-09-16T00:00:00Z" , "JPN" : null , "DEU" : nothing , "FRA" : "1789-07-14T00:00:00Z" , "GBR" : null , "ITA" : null , "ARG" : "1816-07-09T00:00:00Z" , "DZA" : "1962-07-05T00:00:00Z" , "Tin can" : "1867-07-01T00:00:00Z" , "AUS" : nil , "KAZ" : "1991-12-16T00:00:00Z" }}
The dates in the resulting file are in the ISO 8601 format.
Y'all can load the data from a JSON file with read_json()
:
>>>
>>> df = pd . read_json ( 'data-alphabetize.json' , orient = 'index' , ... convert_dates = [ 'IND_DAY' ])
The parameter convert_dates
has a similar purpose as parse_dates
when y'all utilize information technology to read CSV files. The optional parameter orient
is very of import considering it specifies how Pandas understands the construction of the file.
There are other optional parameters you tin can use also:
- Set the encoding with
encoding
. - Manipulate dates with
convert_dates
andkeep_default_dates
. - Bear upon precision with
dtype
andprecise_float
. - Decode numeric information directly to NumPy arrays with
numpy=True
.
Note that y'all might lose the club of rows and columns when using the JSON format to store your data.
HTML Files
An HTML is a plaintext file that uses hypertext markup linguistic communication to help browsers render web pages. The extensions for HTML files are .html
and .htm
. You'll need to install an HTML parser library like lxml or html5lib to be able to piece of work with HTML files:
$pip install lxml html5lib
You can also utilize Conda to install the same packages:
$ conda install lxml html5lib
Once you have these libraries, you can salvage the contents of your DataFrame
as an HTML file with .to_html()
:
>>>
df = pd.DataFrame(data=data).T df.to_html('information.html')
This code generates a file data.html
. You lot tin can expand the code block below to see how this file should look:
< table border = "1" class = "dataframe" > < thead > < tr style = "text-align: correct;" > < th ></ th > < th >COUNTRY</ th > < thursday >POP</ th > < th >AREA</ th > < thursday >GDP</ th > < th >CONT</ th > < thursday >IND_DAY</ th > </ tr > </ thead > < tbody > < tr > < th >CHN</ th > < td >Red china</ td > < td >1398.72</ td > < td >9596.96</ td > < td >12234.eight</ td > < td >Asia</ td > < td >NaN</ td > </ tr > < tr > < th >IND</ thursday > < td >Bharat</ td > < td >1351.sixteen</ td > < td >3287.26</ td > < td >2575.67</ td > < td >Asia</ td > < td >1947-08-15</ td > </ tr > < tr > < th >USA</ thursday > < td >United states of america</ td > < td >329.74</ td > < td >9833.52</ td > < td >19485.4</ td > < td >N.America</ td > < td >1776-07-04</ td > </ tr > < tr > < th >IDN</ th > < td >Indonesia</ td > < td >268.07</ td > < td >1910.93</ td > < td >1015.54</ td > < td >Asia</ td > < td >1945-08-17</ td > </ tr > < tr > < th >BRA</ thursday > < td >Brazil</ td > < td >210.32</ td > < td >8515.77</ td > < td >2055.51</ td > < td >S.America</ td > < td >1822-09-07</ td > </ tr > < tr > < th >PAK</ th > < td >Pakistan</ td > < td >205.71</ td > < td >881.91</ td > < td >302.xiv</ td > < td >Asia</ td > < td >1947-08-14</ td > </ tr > < tr > < th >NGA</ th > < td >Nigeria</ td > < td >200.96</ td > < td >923.77</ td > < td >375.77</ td > < td >Africa</ td > < td >1960-x-01</ td > </ tr > < tr > < th >BGD</ th > < td >Bangladesh</ td > < td >167.09</ td > < td >147.57</ td > < td >245.63</ td > < td >Asia</ td > < td >1971-03-26</ td > </ tr > < tr > < th >RUS</ th > < td >Russia</ td > < td >146.79</ td > < td >17098.2</ td > < td >1530.75</ td > < td >NaN</ td > < td >1992-06-12</ td > </ tr > < tr > < th >MEX</ th > < td >Mexico</ td > < td >126.58</ td > < td >1964.38</ td > < td >1158.23</ td > < td >Northward.America</ td > < td >1810-09-xvi</ td > </ tr > < tr > < th >JPN</ thursday > < td >Japan</ td > < td >126.22</ td > < td >377.97</ td > < td >4872.42</ td > < td >Asia</ td > < td >NaN</ td > </ tr > < tr > < thursday >DEU</ th > < td >Germany</ td > < td >83.02</ td > < td >357.11</ td > < td >3693.ii</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < th >FRA</ th > < td >France</ td > < td >67.02</ td > < td >640.68</ td > < td >2582.49</ td > < td >Europe</ td > < td >1789-07-fourteen</ td > </ tr > < tr > < th >GBR</ th > < td >UK</ td > < td >66.44</ td > < td >242.v</ td > < td >2631.23</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < th >ITA</ th > < td >Italy</ td > < td >60.36</ td > < td >301.34</ td > < td >1943.84</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < thursday >ARG</ th > < td >Argentina</ td > < td >44.94</ td > < td >2780.iv</ td > < td >637.49</ td > < td >S.America</ td > < td >1816-07-09</ td > </ tr > < tr > < th >DZA</ th > < td >People's democratic republic of algeria</ td > < td >43.38</ td > < td >2381.74</ td > < td >167.56</ td > < td >Africa</ td > < td >1962-07-05</ td > </ tr > < tr > < th >Tin</ thursday > < td >Canada</ td > < td >37.59</ td > < td >9984.67</ td > < td >1647.12</ td > < td >N.America</ td > < td >1867-07-01</ td > </ tr > < tr > < th >AUS</ th > < td >Commonwealth of australia</ td > < td >25.47</ td > < td >7692.02</ td > < td >1408.68</ td > < td >Oceania</ td > < td >NaN</ td > </ tr > < tr > < th >KAZ</ th > < td >Kazakhstan</ td > < td >xviii.53</ td > < td >2724.ix</ td > < td >159.41</ td > < td >Asia</ td > < td >1991-12-xvi</ td > </ tr > </ tbody > </ tabular array >
This file shows the DataFrame
contents nicely. Even so, notice that yous haven't obtained an entire spider web page. You've but output the information that corresponds to df
in the HTML format.
.to_html()
won't create a file if y'all don't provide the optional parameter buf
, which denotes the buffer to write to. If you go out this parameter out, and so your code will render a string as information technology did with .to_csv()
and .to_json()
.
Here are some other optional parameters:
-
header
determines whether to save the column names. -
index
determines whether to relieve the row labels. -
classes
assigns cascading style sheet (CSS) classes. -
render_links
specifies whether to convert URLs to HTML links. -
table_id
assigns the CSSid
to thetable
tag. -
escape
decides whether to catechumen the characters<
,>
, and&
to HTML-safe strings.
You employ parameters like these to specify different aspects of the resulting files or strings.
You can create a DataFrame
object from a suitable HTML file using read_html()
, which will render a DataFrame
example or a list of them:
>>>
>>> df = pd . read_html ( 'data.html' , index_col = 0 , parse_dates = [ 'IND_DAY' ])
This is very similar to what yous did when reading CSV files. You also have parameters that help you work with dates, missing values, precision, encoding, HTML parsers, and more.
Excel Files
You've already learned how to read and write Excel files with Pandas. However, there are a few more options worth considering. For 1, when you use .to_excel()
, y'all tin specify the proper noun of the target worksheet with the optional parameter sheet_name
:
>>>
>>> df = pd . DataFrame ( data = information ) . T >>> df . to_excel ( 'data.xlsx' , sheet_name = 'COUNTRIES' )
Here, you create a file information.xlsx
with a worksheet chosen COUNTRIES
that stores the data. The string 'information.xlsx'
is the argument for the parameter excel_writer
that defines the proper noun of the Excel file or its path.
The optional parameters startrow
and startcol
both default to 0
and betoken the upper left-most prison cell where the data should start being written:
>>>
>>> df . to_excel ( 'data-shifted.xlsx' , sheet_name = 'COUNTRIES' , ... startrow = 2 , startcol = 4 )
Here, you specify that the tabular array should showtime in the third row and the fifth column. Yous also used nothing-based indexing, so the third row is denoted past 2
and the fifth cavalcade by 4
.
At present the resulting worksheet looks similar this:

Equally y'all tin can see, the table starts in the tertiary row two
and the fifth column E
.
.read_excel()
also has the optional parameter sheet_name
that specifies which worksheets to read when loading information. It can take on i of the post-obit values:
- The zero-based index of the worksheet
- The name of the worksheet
- The list of indices or names to read multiple sheets
- The value
None
to read all sheets
Hither'southward how you lot would use this parameter in your code:
>>>
>>> df = pd . read_excel ( 'information.xlsx' , sheet_name = 0 , index_col = 0 , ... parse_dates = [ 'IND_DAY' ]) >>> df = pd . read_excel ( 'information.xlsx' , sheet_name = 'COUNTRIES' , index_col = 0 , ... parse_dates = [ 'IND_DAY' ])
Both statements above create the aforementioned DataFrame
considering the sheet_name
parameters have the aforementioned values. In both cases, sheet_name=0
and sheet_name='COUNTRIES'
refer to the same worksheet. The argument parse_dates=['IND_DAY']
tells Pandas to endeavor to consider the values in this column equally dates or times.
There are other optional parameters you can employ with .read_excel()
and .to_excel()
to determine the Excel engine, the encoding, the style to handle missing values and infinities, the method for writing cavalcade names and row labels, and so on.
SQL Files
Pandas IO tools can also read and write databases. In this side by side example, you'll write your information to a database called information.db
. To get started, you'll need the SQLAlchemy package. To larn more most information technology, you lot can read the official ORM tutorial. Y'all'll also need the database driver. Python has a built-in commuter for SQLite.
Y'all can install SQLAlchemy with pip:
You tin also install it with Conda:
$ conda install sqlalchemy
Once y'all have SQLAlchemy installed, import create_engine()
and create a database engine:
>>>
>>> from sqlalchemy import create_engine >>> engine = create_engine ( 'sqlite:///data.db' , echo = Imitation )
Now that you have everything gear up upwardly, the next pace is to create a DataFrame
object. It's convenient to specify the information types and apply .to_sql()
.
>>>
>>> dtypes = { 'Pop' : 'float64' , 'AREA' : 'float64' , 'GDP' : 'float64' , ... 'IND_DAY' : 'datetime64' } >>> df = pd . DataFrame ( data = data ) . T . astype ( dtype = dtypes ) >>> df . dtypes COUNTRY object Popular float64 AREA float64 Gdp float64 CONT object IND_DAY datetime64[ns] dtype: object
.astype()
is a very convenient method y'all can apply to prepare multiple data types at once.
One time you've created your DataFrame
, you can save it to the database with .to_sql()
:
>>>
>>> df . to_sql ( 'information.db' , con = engine , index_label = 'ID' )
The parameter con
is used to specify the database connection or engine that you want to employ. The optional parameter index_label
specifies how to call the database column with the row labels. You lot'll frequently encounter it take on the value ID
, Id
, or id
.
You should go the database data.db
with a single table that looks like this:

The get-go column contains the row labels. To omit writing them into the database, pass alphabetize=False
to .to_sql()
. The other columns represent to the columns of the DataFrame
.
In that location are a few more optional parameters. For example, you can use schema
to specify the database schema and dtype
to make up one's mind the types of the database columns. You can also utilise if_exists
, which says what to do if a database with the same name and path already exists:
-
if_exists='fail'
raises a ValueError and is the default. -
if_exists='replace'
drops the tabular array and inserts new values. -
if_exists='append'
inserts new values into the table.
You lot tin can load the data from the database with read_sql()
:
>>>
>>> df = pd . read_sql ( 'information.db' , con = engine , index_col = 'ID' ) >>> df COUNTRY Pop AREA Gdp CONT IND_DAY ID CHN China 1398.72 9596.96 12234.78 Asia NaT IND India 1351.sixteen 3287.26 2575.67 Asia 1947-08-15 United states US 329.74 9833.52 19485.39 North.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 Southward.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-fourteen NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 None 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaT DEU Germany 83.02 357.eleven 3693.twenty Europe NaT FRA French republic 67.02 640.68 2582.49 Europe 1789-07-14 GBR United kingdom 66.44 242.50 2631.23 Europe NaT ITA Italia sixty.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16
The parameter index_col
specifies the proper noun of the column with the row labels. Notation that this inserts an extra row after the header that starts with ID
. You can fix this behavior with the following line of code:
>>>
>>> df . alphabetize . name = None >>> df State POP AREA GDP CONT IND_DAY CHN Red china 1398.72 9596.96 12234.78 Asia NaT IND India 1351.sixteen 3287.26 2575.67 Asia 1947-08-xv United states of america US 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-xiv NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD People's republic of bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 None 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaT DEU Germany 83.02 357.xi 3693.20 Europe NaT FRA French republic 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.50 2631.23 Europe NaT ITA Italian republic 60.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.40 637.49 Due south.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Northward.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan eighteen.53 2724.xc 159.41 Asia 1991-12-16
Now you accept the same DataFrame
object every bit before.
Note that the continent for Russian federation is now None
instead of nan
. If you want to fill the missing values with nan
, and so you can use .fillna()
:
>>>
>>> df . fillna ( value = float ( 'nan' ), inplace = True )
.fillna()
replaces all missing values with any you pass to value
. Hither, you passed float('nan')
, which says to fill all missing values with nan
.
Likewise annotation that you didn't have to pass parse_dates=['IND_DAY']
to read_sql()
. That's because your database was able to observe that the last column contains dates. However, you tin pass parse_dates
if you'd similar. Yous'll get the same results.
In that location are other functions that you can utilize to read databases, like read_sql_table()
and read_sql_query()
. Feel costless to try them out!
Pickle Files
Pickling is the act of converting Python objects into byte streams. Unpickling is the changed process. Python pickle files are the binary files that keep the information and hierarchy of Python objects. They usually have the extension .pickle
or .pkl
.
You can save your DataFrame
in a pickle file with .to_pickle()
:
>>>
>>> dtypes = { 'POP' : 'float64' , 'Expanse' : 'float64' , 'GDP' : 'float64' , ... 'IND_DAY' : 'datetime64' } >>> df = pd . DataFrame ( data = data ) . T . astype ( dtype = dtypes ) >>> df . to_pickle ( 'data.pickle' )
Like you did with databases, it can be user-friendly first to specify the data types. Then, you create a file data.pickle
to contain your data. Y'all could also laissez passer an integer value to the optional parameter protocol
, which specifies the protocol of the pickler.
Y'all can get the data from a pickle file with read_pickle()
:
>>>
>>> df = pd . read_pickle ( 'information.pickle' ) >>> df Country POP AREA GDP CONT IND_DAY CHN Prc 1398.72 9596.96 12234.78 Asia NaT IND Bharat 1351.16 3287.26 2575.67 Asia 1947-08-15 USA US 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-fourteen NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD People's republic of bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-sixteen JPN Japan 126.22 377.97 4872.42 Asia NaT DEU Germany 83.02 357.xi 3693.20 Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.fifty 2631.23 Europe NaT ITA Italy threescore.36 301.34 1943.84 Europe NaT ARG Argentine republic 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Northward.America 1867-07-01 AUS Commonwealth of australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16
read_pickle()
returns the DataFrame
with the stored data. You can too check the data types:
>>>
>>> df . dtypes COUNTRY object POP float64 AREA float64 GDP float64 CONT object IND_DAY datetime64[ns] dtype: object
These are the aforementioned ones that you specified before using .to_pickle()
.
As a word of caution, you should always beware of loading pickles from untrusted sources. This can be dangerous! When y'all unpickle an untrustworthy file, it could execute arbitrary lawmaking on your machine, gain remote access to your computer, or otherwise exploit your device in other ways.
Working With Big Data
If your files are also large for saving or processing, then there are several approaches you tin take to reduce the required disk space:
- Compress your files
- Choose merely the columns you lot want
- Omit the rows you don't demand
- Force the use of less precise data types
- Split the information into chunks
You'll take a look at each of these techniques in plough.
Compress and Decompress Files
You can create an archive file similar you would a regular one, with the addition of a suffix that corresponds to the desired compression type:
-
'.gz'
-
'.bz2'
-
'.zip'
-
'.xz'
Pandas can deduce the compression blazon by itself:
>>>
>>> df = pd . DataFrame ( information = data ) . T >>> df . to_csv ( 'data.csv.zip' )
Hither, you create a compressed .csv
file as an archive. The size of the regular .csv
file is 1048 bytes, while the compressed file merely has 766 bytes.
You can open up this compressed file as usual with the Pandas read_csv()
office:
>>>
>>> df = pd . read_csv ( 'information.csv.zip' , index_col = 0 , ... parse_dates = [ 'IND_DAY' ]) >>> df COUNTRY Popular AREA Gdp CONT IND_DAY CHN People's republic of china 1398.72 9596.96 12234.78 Asia NaT IND India 1351.16 3287.26 2575.67 Asia 1947-08-15 USA US 329.74 9833.52 19485.39 North.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 South.America 1822-09-07 PAK Islamic republic of pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Nihon 126.22 377.97 4872.42 Asia NaT DEU Deutschland 83.02 357.11 3693.20 Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR U.k. 66.44 242.50 2631.23 Europe NaT ITA Italian republic 60.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA People's democratic republic of algeria 43.38 2381.74 167.56 Africa 1962-07-05 Tin Canada 37.59 9984.67 1647.12 North.America 1867-07-01 AUS Commonwealth of australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan eighteen.53 2724.90 159.41 Asia 1991-12-16
read_csv()
decompresses the file earlier reading it into a DataFrame
.
You lot tin can specify the type of compression with the optional parameter compression
, which can take on whatever of the following values:
-
'infer'
-
'gzip'
-
'bz2'
-
'null'
-
'xz'
-
None
The default value compression='infer'
indicates that Pandas should deduce the pinch type from the file extension.
Hither'south how you would shrink a pickle file:
>>>
>>> df = pd . DataFrame ( data = information ) . T >>> df . to_pickle ( 'data.pickle.shrink' , compression = 'gzip' )
Y'all should get the file data.pickle.shrink
that you tin can later decompress and read:
>>>
>>> df = pd . read_pickle ( 'data.pickle.compress' , compression = 'gzip' )
df
again corresponds to the DataFrame
with the same data as before.
You can give the other compression methods a try, as well. If you're using pickle files, and then keep in mind that the .zip
format supports reading only.
Choose Columns
The Pandas read_csv()
and read_excel()
functions accept the optional parameter usecols
that y'all can apply to specify the columns you want to load from the file. You can laissez passer the list of cavalcade names as the corresponding argument:
>>>
>>> df = pd . read_csv ( 'data.csv' , usecols = [ 'COUNTRY' , 'AREA' ]) >>> df Country AREA 0 Red china 9596.96 1 India 3287.26 2 Us 9833.52 3 Indonesia 1910.93 4 Brazil 8515.77 5 Pakistan 881.91 6 Nigeria 923.77 7 Bangladesh 147.57 8 Russia 17098.25 9 Mexico 1964.38 10 Nippon 377.97 xi Germany 357.11 12 French republic 640.68 13 United kingdom of great britain and northern ireland 242.l 14 Italy 301.34 15 Argentina 2780.xl 16 Algeria 2381.74 17 Canada 9984.67 xviii Australia 7692.02 19 Kazakhstan 2724.90
Now you accept a DataFrame
that contains less data than earlier. Hither, in that location are but the names of the countries and their areas.
Instead of the column names, you lot can also pass their indices:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , usecols = [ 0 , 1 , three ]) >>> df COUNTRY Surface area CHN China 9596.96 IND India 3287.26 USA US 9833.52 IDN Indonesia 1910.93 BRA Brazil 8515.77 PAK Pakistan 881.91 NGA Nigeria 923.77 BGD Bangladesh 147.57 RUS Russian federation 17098.25 MEX Mexico 1964.38 JPN Japan 377.97 DEU Germany 357.11 FRA France 640.68 GBR UK 242.50 ITA Italia 301.34 ARG Argentina 2780.twoscore DZA Algeria 2381.74 Tin can Canada 9984.67 AUS Australia 7692.02 KAZ Republic of kazakhstan 2724.ninety
Expand the code block below to compare these results with the file 'data.csv'
:
,State,POP,Expanse,Gross domestic product,CONT,IND_DAY CHN,Communist china,1398.72,9596.96,12234.78,Asia, IND,Bharat,1351.sixteen,3287.26,2575.67,Asia,1947-08-15 The states,US,329.74,9833.52,19485.39,N.America,1776-07-04 IDN,Republic of indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07 PAK,Islamic republic of pakistan,205.71,881.91,302.14,Asia,1947-08-14 NGA,Nigeria,200.96,923.77,375.77,Africa,1960-ten-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-sixteen JPN,Japan,126.22,377.97,4872.42,Asia, DEU,Germany,83.02,357.xi,3693.2,Europe, FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,UK,66.44,242.v,2631.23,Europe, ITA,Italy,60.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09 DZA,People's democratic republic of algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Kazakhstan,18.53,2724.9,159.41,Asia,1991-12-sixteen
You can see the following columns:
- The column at index
0
contains the row labels. - The column at alphabetize
i
contains the country names. - The column at index
3
contains the areas.
Simlarly, read_sql()
has the optional parameter columns
that takes a list of column names to read:
>>>
>>> df = pd . read_sql ( 'data.db' , con = engine , index_col = 'ID' , ... columns = [ 'Country' , 'AREA' ]) >>> df . alphabetize . name = None >>> df Land AREA CHN China 9596.96 IND Bharat 3287.26 United states United states 9833.52 IDN Indonesia 1910.93 BRA Brazil 8515.77 PAK Pakistan 881.91 NGA Nigeria 923.77 BGD People's republic of bangladesh 147.57 RUS Russian federation 17098.25 MEX United mexican states 1964.38 JPN Japan 377.97 DEU Germany 357.11 FRA France 640.68 GBR UK 242.l ITA Italy 301.34 ARG Argentine republic 2780.40 DZA People's democratic republic of algeria 2381.74 CAN Canada 9984.67 AUS Commonwealth of australia 7692.02 KAZ Republic of kazakhstan 2724.90
Once again, the DataFrame
merely contains the columns with the names of the countries and areas. If columns
is None
or omitted, and then all of the columns will exist read, as you saw before. The default behavior is columns=None
.
Omit Rows
When you test an algorithm for data processing or machine learning, you frequently don't need the unabridged dataset. It's convenient to load only a subset of the information to speed up the procedure. The Pandas read_csv()
and read_excel()
functions have some optional parameters that allow you to select which rows you want to load:
-
skiprows
: either the number of rows to skip at the offset of the file if it's an integer, or the zero-based indices of the rows to skip if information technology'due south a list-like object -
skipfooter
: the number of rows to skip at the end of the file -
nrows
: the number of rows to read
Hither's how you would skip rows with odd zero-based indices, keeping the fifty-fifty ones:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , skiprows = range ( 1 , xx , 2 )) >>> df COUNTRY POP AREA GDP CONT IND_DAY IND Bharat 1351.sixteen 3287.26 2575.67 Asia 1947-08-15 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 PAK Islamic republic of pakistan 205.71 881.91 302.14 Asia 1947-08-fourteen BGD People's republic of bangladesh 167.09 147.57 245.63 Asia 1971-03-26 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 DEU Frg 83.02 357.11 3693.20 Europe NaN GBR United kingdom of great britain and northern ireland 66.44 242.l 2631.23 Europe NaN ARG Argentina 44.94 2780.40 637.49 Southward.America 1816-07-09 Tin can Canada 37.59 9984.67 1647.12 North.America 1867-07-01 KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-xvi
In this example, skiprows
is range(ane, 20, 2)
and corresponds to the values ane
, 3
, …, 19
. The instances of the Python built-in class range
deport like sequences. The get-go row of the file information.csv
is the header row. It has the index 0
, so Pandas loads information technology in. The second row with index 1
corresponds to the label CHN
, and Pandas skips it. The third row with the index two
and label IND
is loaded, so on.
If y'all desire to choose rows randomly, and so skiprows
can exist a listing or NumPy array with pseudo-random numbers, obtained either with pure Python or with NumPy.
Force Less Precise Data Types
If you're okay with less precise data types, then yous tin potentially save a significant corporeality of memory! First, become the data types with .dtypes
over again:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , parse_dates = [ 'IND_DAY' ]) >>> df . dtypes COUNTRY object Popular float64 Expanse float64 GDP float64 CONT object IND_DAY datetime64[ns] dtype: object
The columns with the floating-point numbers are 64-flake floats. Each number of this type float64
consumes 64 bits or viii bytes. Each column has 20 numbers and requires 160 bytes. You tin can verify this with .memory_usage()
:
>>>
>>> df . memory_usage () Index 160 COUNTRY 160 Pop 160 Area 160 Gdp 160 CONT 160 IND_DAY 160 dtype: int64
.memory_usage()
returns an case of Series
with the retentivity usage of each column in bytes. You tin can conveniently combine it with .loc[]
and .sum()
to get the retentivity for a group of columns:
>>>
>>> df . loc [:, [ 'Pop' , 'Surface area' , 'GDP' ]] . memory_usage ( index = Fake ) . sum () 480
This case shows how you tin combine the numeric columns 'POP'
, 'Expanse'
, and 'Gdp'
to get their total memory requirement. The argument index=False
excludes data for row labels from the resulting Series
object. For these 3 columns, you'll demand 480 bytes.
Y'all can also extract the data values in the form of a NumPy array with .to_numpy()
or .values
. Then, use the .nbytes
attribute to get the full bytes consumed by the items of the array:
>>>
>>> df . loc [:, [ 'Pop' , 'Surface area' , 'GDP' ]] . to_numpy () . nbytes 480
The issue is the same 480 bytes. So, how do you lot salve memory?
In this case, you tin can specify that your numeric columns 'POP'
, 'Area'
, and 'Gross domestic product'
should have the type float32
. Utilise the optional parameter dtype
to do this:
>>>
>>> dtypes = { 'Popular' : 'float32' , 'Expanse' : 'float32' , 'Gdp' : 'float32' } >>> df = pd . read_csv ( 'data.csv' , index_col = 0 , dtype = dtypes , ... parse_dates = [ 'IND_DAY' ])
The lexicon dtypes
specifies the desired data types for each column. It's passed to the Pandas read_csv()
function as the argument that corresponds to the parameter dtype
.
At present yous can verify that each numeric column needs 80 bytes, or 4 bytes per item:
>>>
>>> df . dtypes COUNTRY object POP float32 AREA float32 GDP float32 CONT object IND_DAY datetime64[ns] dtype: object >>> df . memory_usage () Index 160 Land 160 Popular 80 Surface area fourscore GDP 80 CONT 160 IND_DAY 160 dtype: int64 >>> df . loc [:, [ 'POP' , 'AREA' , 'Gross domestic product' ]] . memory_usage ( index = Fake ) . sum () 240 >>> df . loc [:, [ 'POP' , 'AREA' , 'Gross domestic product' ]] . to_numpy () . nbytes 240
Each value is a floating-signal number of 32 bits or 4 bytes. The three numeric columns contain xx items each. In total, yous'll demand 240 bytes of memory when you work with the blazon float32
. This is one-half the size of the 480 bytes you'd need to work with float64
.
In addition to saving retentiveness, you can significantly reduce the fourth dimension required to process data by using float32
instead of float64
in some cases.
Utilize Chunks to Iterate Through Files
Another fashion to deal with very large datasets is to split the information into smaller chunks and process one chunk at a time. If yous use read_csv()
, read_json()
or read_sql()
, then you can specify the optional parameter chunksize
:
>>>
>>> data_chunk = pd . read_csv ( 'data.csv' , index_col = 0 , chunksize = 8 ) >>> type ( data_chunk ) <class 'pandas.io.parsers.TextFileReader'> >>> hasattr ( data_chunk , '__iter__' ) True >>> hasattr ( data_chunk , '__next__' ) True
chunksize
defaults to None
and can have on an integer value that indicates the number of items in a single clamper. When chunksize
is an integer, read_csv()
returns an iterable that you can use in a for
loop to get and process only a fragment of the dataset in each iteration:
>>>
>>> for df_chunk in pd . read_csv ( 'data.csv' , index_col = 0 , chunksize = 8 ): ... print ( df_chunk , terminate = ' \n\n ' ) ... print ( 'retentiveness:' , df_chunk . memory_usage () . sum (), 'bytes' , ... end = ' \n\northward\n ' ) ... COUNTRY Popular AREA GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaN IND Bharat 1351.16 3287.26 2575.67 Asia 1947-08-15 U.s.a. U.s. 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 retentivity: 448 bytes State Pop Area Gross domestic product CONT IND_DAY RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 Due north.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.xi 3693.xx Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.fifty 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 memory: 448 bytes COUNTRY POP Area GDP CONT IND_DAY DZA People's democratic republic of algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan xviii.53 2724.ninety 159.41 Asia 1991-12-16 memory: 224 bytes
In this example, the chunksize
is 8
. The starting time iteration of the for
loop returns a DataFrame
with the showtime eight rows of the dataset just. The 2d iteration returns some other DataFrame
with the adjacent eight rows. The third and last iteration returns the remaining four rows.
In each iteration, you get and process the DataFrame
with the number of rows equal to chunksize
. It'southward possible to accept fewer rows than the value of chunksize
in the last iteration. You can employ this functionality to control the amount of retentiveness required to process data and proceed that amount reasonably small.
Conclusion
You now know how to save the data and labels from Pandas DataFrame
objects to unlike kinds of files. You too know how to load your information from files and create DataFrame
objects.
You've used the Pandas read_csv()
and .to_csv()
methods to read and write CSV files. You also used similar methods to read and write Excel, JSON, HTML, SQL, and pickle files. These functions are very convenient and widely used. They permit you to save or load your information in a single function or method call.
Y'all've also learned how to relieve fourth dimension, retention, and disk infinite when working with large information files:
- Compress or decompress files
- Choose the rows and columns you want to load
- Use less precise data types
- Separate data into chunks and procedure them 1 by i
You've mastered a pregnant step in the machine learning and data scientific discipline process! If you have whatsoever questions or comments, then please put them in the comments section beneath.
Lookout man Now This tutorial has a related video form created past the Real Python squad. Watch information technology together with the written tutorial to deepen your agreement: Reading and Writing Files With Pandas
johnsoncoughts1965.blogspot.com
Source: https://realpython.com/pandas-read-write-files/
0 Response to "How to Read .data File in Pandas"
Post a Comment