In this module, we will introduce the basics of data science using Python. We will cover common Python modules and tools used for data analysis, as well as various Python libraries for data visualization. By the end of this unit, you will have a solid foundation in using Python for data manipulation, analysis, and visualization.
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It involves various stages, including data collection, data cleaning, data analysis, data visualization, and the generation of actionable insights.
The typical data science workflow involves the following steps:
While we won't have the time in this module to cover data science in real depth, the information and activities below will give you an idea of what data science is all about, and why Python is often the language of choice for doing data analysis.
NumPy is a fundamental library for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
import numpy as np
# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])
# Perform arithmetic operations on the array
squared_data = data ** 2
Pandas is a powerful library for data manipulation and analysis. It provides data structures like Series and DataFrame to handle and analyze tabular data efficiently.
import pandas as pd
# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Select rows and columns
age_data = df['Age']
Matplotlib is a popular library for creating static, interactive, and animated visualizations in Python. It supports various plot types, including line plots, bar plots, scatter plots, histograms, and more.
import matplotlib.pyplot as plt
# Create a simple line plot
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Simple Line Plot')
plt.show()
Pandas can read data from various file formats, such as CSV, Excel, and JSON. It also allows us to view the structure and summary statistics of the data.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('data.csv')
# View the first few rows of the data
print(data.head())
# Get summary statistics of the data
print(data.describe())
Data cleaning involves handling missing values, removing duplicates, and converting data to the correct format.
# Handling missing values
data.dropna(inplace=True)
# Removing duplicates
data.drop_duplicates(inplace=True)
# Converting data types
data['Date'] = pd.to_datetime(data['Date'])
Pandas provides powerful methods to filter, group, and transform data.
# Filtering data
filtered_data = data[data['Sales'] > 100]
# Grouping data
grouped_data = data.groupby('Category')['Sales'].sum()
# Adding a new column
data['Profit'] = data['Revenue'] - data['Cost']
Matplotlib allows us to create various types of plots for visualizing data.
import matplotlib.pyplot as plt
# Line plot
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y, label='Data Line', color='blue', linestyle='dashed')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Line Plot')
plt.legend()
plt.show()
Histograms and box plots are useful for visualizing the distribution and spread of data.
import matplotlib.pyplot as plt
# Histogram
data = [15, 20, 25, 30, 35, 40, 45, 50]
plt.hist(data, bins=5)
# Box plot
data = [15, 20, 25, 30, 35, 40, 45, 50]
plt.boxplot(data)
plt.show()
Matplotlib provides extensive options for customizing plots, such as adding labels, titles, legends, and adjusting plot appearance.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y, label='Data Line', color='blue', linestyle='dashed')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Customized Line Plot')
plt.legend()
plt.show()
No terms have been published for this module.
Test your knowledge of this module by choosing options below. You can keep trying until you get the right answer.
Skip to the Next QuestionMy kids are competitive swimmers. At swim meets, there are automated timing systems that capture their times in each race. For backup, parent volunteers act as backup timers, using stopwatches to get each swimmer's time and writing it down on a paper. There are usually two backup timers per lane. The accuracy of the backup times is important, because sometimes that backup time becomes the swimmer's official time.
The data file below contains three sets of data: The times for the automated timing system, and the recorded times from two backup times for the same races. Obviously it is very difficult to get a perfect time every race using a stopwatch, but the officials are concerned because the backup times seem to be all over the place. Use Pandas to generate some descriptive statistics and create a scatter plot to help everyone understand what's going on. You can create a Word document with screenshots, or any other similar document to submit for this work. Try to explain in your own best words what's happening in this data.