Module 19 - Python Data Analysis Tools Header

Module 19 - Python Data Analysis Tools

Introduction

Overview

In this module, we will introduce the basics of data science using Python. We will cover common Python modules and tools used for data analysis, as well as various Python libraries for data visualization. By the end of this unit, you will have a solid foundation in using Python for data manipulation, analysis, and visualization.

What is Data Science?

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves various stages, including data collection, data cleaning, data analysis, data visualization, and the generation of actionable insights.

Data Science Workflow

The typical data science workflow involves the following steps:

  • Data Collection: Gathering data from various sources.
  • Data Cleaning: Removing inconsistencies and errors from the data.
  • Data Exploration: Exploring the data to understand its structure and patterns.
  • Data Analysis: Applying statistical and computational methods to derive insights.
  • Data Visualization: Presenting data and analysis in visual formats.
  • Model Building: Constructing predictive or descriptive models (optional).
  • Communication: Sharing the results with stakeholders.

While we won't have the time in this module to cover data science in real depth, the information and activities below will give you an idea of what data science is all about, and why Python is often the language of choice for doing data analysis.



Python Libraries for Data Science

NumPy: Numerical Computing in Python

NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

Example:

import numpy as np

# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])

# Perform arithmetic operations on the array
squared_data = data ** 2

Pandas: Data Analysis with Python

Pandas is a powerful library for data manipulation and analysis. It provides data structures like Series and DataFrame to handle and analyze tabular data efficiently.

Example:

import pandas as pd

# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Select rows and columns
age_data = df['Age']

Matplotlib: Data Visualization in Python

Matplotlib is a popular library for creating static, interactive, and animated visualizations in Python. It supports various plot types, including line plots, bar plots, scatter plots, histograms, and more.

Example:

import matplotlib.pyplot as plt

# Create a simple line plot
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()


Data Analysis with Pandas

Loading and Exploring Data

Pandas can read data from various file formats, such as CSV, Excel, and JSON. It also allows us to view the structure and summary statistics of the data.

Example:

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('data.csv')

# View the first few rows of the data
print(data.head())

# Get summary statistics of the data
print(data.describe())

Data Cleaning and Preprocessing

Data cleaning involves handling missing values, removing duplicates, and converting data to the correct format.

Example:

# Handling missing values
data.dropna(inplace=True)

# Removing duplicates
data.drop_duplicates(inplace=True)

# Converting data types
data['Date'] = pd.to_datetime(data['Date'])

Basic Data Manipulation

Pandas provides powerful methods to filter, group, and transform data.

Example:

# Filtering data
filtered_data = data[data['Sales'] > 100]

# Grouping data
grouped_data = data.groupby('Category')['Sales'].sum()

# Adding a new column
data['Profit'] = data['Revenue'] - data['Cost']


Data Visualization

Line Plots, Bar Plots, and Scatter Plots

Matplotlib allows us to create various types of plots for visualizing data.

Example:

import matplotlib.pyplot as plt

# Line plot
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y, label='Data Line', color='blue', linestyle='dashed')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Line Plot')
plt.legend()

plt.show()

Histograms and Box Plots

Histograms and box plots are useful for visualizing the distribution and spread of data.

Example:

import matplotlib.pyplot as plt

# Histogram
data = [15, 20, 25, 30, 35, 40, 45, 50]
plt.hist(data, bins=5)

# Box plot
data = [15, 20, 25, 30, 35, 40, 45, 50]
plt.boxplot(data)

plt.show()

Customizing Plots and Adding Labels

Matplotlib provides extensive options for customizing plots, such as adding labels, titles, legends, and adjusting plot appearance.

Example:

import matplotlib.pyplot as plt

x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y, label='Data Line', color='blue', linestyle='dashed')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Line Plot')
plt.legend()

plt.show()
 

Videos for Module 19 - Python Data Analysis Tools

19-1: Introduction to Data Science (12:06)

19-2: Jupyter Notebooks for Data Science (6:22)

19-3: Introducing Pandas for Data Science (8:39)

19-4: Data Science - Reading and Writing Files with Pandas (2:31)

19-5: Data Science - Subsets of Data Using Pandas (1:54)

19-6: Data Science: Descriptive Statistics Using Pandas (3:36)

19-7: Data Science - Sorting Data Using Pandas (3:01)

19-8: Data Science - Grouping Data Using Pandas (3:41)

19-9: Data Science - Machine Learning Basics (7:34)

19-10: Data Science - Machine Learning Code Example (12:17)

19-11: Data Science - Introducing Data Visualizations (3:31)

19-12: Data Science - Defining a Problem (2:27)

19-13: Data Science - Plotly Pie Charts (7:22)

19-14: Data Science - Plotly Bar Chart (6:42)

19-15: Data Science - Plotly Line Charts (3:06)

19-16: Data Science - Plotly Scatter Plot (3:43)

19-17: Data Science - Plotly Multidimensional Plots (4:57)

19-18: Reviewing Plotly Options (2:05)

19-19: A19 Explanation (3:43)

Key Terms for Module 19 - Python Data Analysis Tools

No terms have been published for this module.

Quiz Yourself - Module 19 - Python Data Analysis Tools

Test your knowledge of this module by choosing options below. You can keep trying until you get the right answer.

Skip to the Next Question 

Activities for this Module

A19 - Data Science Basics

The Challenge

My kids are competitive swimmers.  At swim meets, there are automated timing systems that capture their times in each race. For backup, parent volunteers act as backup timers, using stopwatches to get each swimmer's time and writing it down on a paper.  There are usually two backup timers per lane. The accuracy of the backup times is important, because sometimes that backup time becomes the swimmer's official time.

The data file below contains three sets of data: The times for the automated timing system, and the recorded times from two backup times for the same races.  Obviously it is very difficult to get a perfect time every race using a stopwatch, but the officials are concerned because the backup times seem to be all over the place.  Use Pandas to generate some descriptive statistics and create a scatter plot to help everyone understand what's going on.  You can create a Word document with screenshots, or any other similar document to submit for this work.  Try to explain in your own best words what's happening in this data.

Download the Data File

Constraints / Success Criteria

  • Must use the provided data.
  • Must include a scatter plot.
  • Must include a description or analysis of the data.
  • All code should be commented.
  • Use only what we have covered to this point in class.