Pandas

pandas is a Python library for data processing and analysis. Pandas provides a great range of methods to modify and operate on data; in particular, it allows queries and joins of tables.
In pandas, you can use various types (for example, integers, dates, floating-point numbers, and strings) in a single table.
Another valuable tool provided by pandas is its ability to ingest from a great variety of file formats and databases, like SQL, Excel files, JSON, and comma-separated values (CSV) files.
It is built around a data structure called the Series and DataFrame.

loading the library

  import pandas as pd

Basic

Making a Series

basic

import pandas as pd
sales = ["David", "Bob", "Josh"]
series = pd.Series(data=sales)
print(series)

You can set the column name if you like.

import pandas as pd
sales  = ["David", "Bob", "Josh"]
column = "Members"
series = pd.Series(data=sales, name=column)
print(series)

Making a DataFrame

DataFrame is more useful than Series, so it is more often used.

DataFrame assembles Series, so you can think like the Series is a vector and DataFrame is a matrix.

import pandas as pd
sales = [["P001", "Windows", 85000],
        ["P002", "Mac", 120000],
        ["P003", "Windows", 200000],
        ["P004", "Mac", 130000]]
columns = ["Product ID", "OS", "Price"]
df = pd.DataFrame(data=sales, columns=columns)
print(df)

Statictics

Using simple functions

df["Price"].min()
df["Price"].max()
df["Price"].mean()

Sort

ascending: from low value to high value

descending: from high value to low value

df = df.sort_values(by="Price")  # ascending
df = df.sort_values(by="Price", ascending=False)  # descending

Print several rows

print(df.head())

Filtering: taking subgroup

df_win = df[df["OS"]=="Windows"]
df_mac = df[df["OS"]=="Mac"]

df_new = df[(df["Price"] > 100000) & (df["OS"]=="Mac")]  # multiple condition

Importing from CSV (comma separated file) file

import pandas as pd
df = pd.read_csv("sample.csv")
print(df.head())

Exercise

Suppose you have sales data in a CSV file (sales_data.csv) containing a column named "Sales".
Load this data into a Pandas DataFrame and see the DataFrame with df.head() function.
Calculate the total, average, max, min sales with .sum(), mean(), max(), and min() functions. answer

Some important commands

Let's analye the following DataFrame again.

import pandas as pd
sales = [["P001", "Windows", 85000],
        ["P002", "Mac", 120000],
        ["P003", "Windows", 200000],
        ["P004", "Mac", 130000]]
columns = ["Product ID", "OS", "Price"]
df = pd.DataFrame(data=sales, columns=columns)
print(df)

Here are examples showcasing basic usages of Pandas:

Access specific rows using iloc (by index)

specific_element = df.iloc[0, 0]

Summary statistics of numerical columns

summary_stats = df.describe()

Add a new column

df["2xPrice"] = df["Price"]*2

Drop a column

df_new = df.drop(columns=["Price"], inplace=False)

inplace=True will replace the df itself.

Taking mean value for grouped data

grouped_data = df.groupby("OS").mean()

# When numeric data and categorical data are mixed
grouped_data = df.groupby("OS").mean(numeric_only=True)

Plot

import matplotlib.pyplot as plt
df.plot(x="Product ID", y="Price", kind="bar")
plt.show()

Following plots can be made by setting kind argument.
- "line" : line plot (default)
- "scatter": scatter plot
- "bar" : vertical bar plot
- "barh" : horizontal bar plot
- "hist" : histogram
- "box" : boxplot
- "kde" : Kernel Density Estimation plot
- "density" : same as ‘kde’
- "area" : area plot
- "pie" : pie plot

Exporting to CSV file

You can export th DataFrame to a CSV file

df.to_csv("computers.csv", index=False)

Reading Microsoft Excel data

After installing xlrd by pip install xlrd, you can do
```
df = pd.read_excel("water-spectra.xls")
```

Exercise 2

Let's analyze the emplyee information stored in "employee.csv", as follows.
Import a csv file to a Pandas DataFrame.
Extract the high-saraly group, whose "Salary" is higher than "55000".
Calculate and print the mean age of high-salary group and all the employee.
Calculate the mean values of each department, by using groupby() function.
Make a scatter plot of "Age" vs. "Salary". answer

Pandas

Pandas

loading the library

Basic

Making a Series

Making a DataFrame

Statictics

Sort

Print several rows

Filtering: taking subgroup

Importing from CSV (comma separated file) file

Exercise

Some important commands

Access specific rows using iloc (by index)

Summary statistics of numerical columns

Add a new column

Drop a column

Taking mean value for grouped data

Plot

Exporting to CSV file

Reading Microsoft Excel data

Exercise 2

results matching ""

No results matching ""