Pandas
- pandas is a Python library for data processing and analysis. Pandas provides a great range of methods to modify and operate on data; in particular, it allows queries and joins of tables.
- In pandas, you can use various types (for example, integers, dates, floating-point numbers, and strings) in a single table.
Another valuable tool provided by pandas is its ability to ingest from a great variety of file formats and databases, like SQL, Excel files, JSON, and comma-separated values (CSV) files.
It is built around a data structure called the Series and DataFrame.
loading the library
import pandas as pd
Basic
Making a Series
basic
import pandas as pd sales = ["David", "Bob", "Josh"] series = pd.Series(data=sales) print(series)
You can set the column name if you like.
import pandas as pd sales = ["David", "Bob", "Josh"] column = "Members" series = pd.Series(data=sales, name=column) print(series)
Making a DataFrame
- DataFrame is more useful than Series, so it is more often used.
- DataFrame assembles Series, so you can think like the Series is a vector and DataFrame is a matrix.
import pandas as pd sales = [["P001", "Windows", 85000], ["P002", "Mac", 120000], ["P003", "Windows", 200000], ["P004", "Mac", 130000]] columns = ["Product ID", "OS", "Price"] df = pd.DataFrame(data=sales, columns=columns) print(df)
Statictics
- Using simple functions
df["Price"].min() df["Price"].max() df["Price"].mean()
Sort
- ascending: from low value to high value
- descending: from high value to low value
df = df.sort_values(by="Price") # ascending df = df.sort_values(by="Price", ascending=False) # descending
Print several rows
print(df.head())
Filtering: taking subgroup
df_win = df[df["OS"]=="Windows"]
df_mac = df[df["OS"]=="Mac"]
df_new = df[(df["Price"] > 100000) & (df["OS"]=="Mac")] # multiple condition
Importing from CSV (comma separated file) file
import pandas as pd
df = pd.read_csv("sample.csv")
print(df.head())
Exercise
- Suppose you have sales data in a CSV file (sales_data.csv) containing a column named "Sales".
- Load this data into a Pandas DataFrame and see the DataFrame with
df.head()
function. - Calculate the total, average, max, min sales with
.sum()
,mean()
,max()
, andmin()
functions. answer
Some important commands
Let's analye the following DataFrame again.
import pandas as pd sales = [["P001", "Windows", 85000], ["P002", "Mac", 120000], ["P003", "Windows", 200000], ["P004", "Mac", 130000]] columns = ["Product ID", "OS", "Price"] df = pd.DataFrame(data=sales, columns=columns) print(df)
Here are examples showcasing basic usages of Pandas:
Access specific rows using iloc (by index)
specific_element = df.iloc[0, 0]
Summary statistics of numerical columns
summary_stats = df.describe()
Add a new column
df["2xPrice"] = df["Price"]*2
Drop a column
df_new = df.drop(columns=["Price"], inplace=False)
inplace=True
will replace the df itself.
Taking mean value for grouped data
grouped_data = df.groupby("OS").mean()
# When numeric data and categorical data are mixed
grouped_data = df.groupby("OS").mean(numeric_only=True)
Plot
import matplotlib.pyplot as plt
df.plot(x="Product ID", y="Price", kind="bar")
plt.show()
- Following plots can be made by setting
kind
argument.- "line" : line plot (default)
- "scatter": scatter plot
- "bar" : vertical bar plot
- "barh" : horizontal bar plot
- "hist" : histogram
- "box" : boxplot
- "kde" : Kernel Density Estimation plot
- "density" : same as ‘kde’
- "area" : area plot
- "pie" : pie plot
Exporting to CSV file
- You can export th DataFrame to a CSV file
df.to_csv("computers.csv", index=False)
Reading Microsoft Excel data
- After installing xlrd by
pip install xlrd
, you can dodf = pd.read_excel("water-spectra.xls")
Exercise 2
- Let's analyze the emplyee information stored in "employee.csv", as follows.
- Import a csv file to a Pandas DataFrame.
- Extract the high-saraly group, whose "Salary" is higher than "55000".
- Calculate and print the mean age of high-salary group and all the employee.
- Calculate the mean values of each department, by using
groupby()
function. - Make a scatter plot of "Age" vs. "Salary". answer