Home All Chapters Previous Next

Chapter 7. Working with Python in Cloud-Based Environments

"The best tool is the one you'll actually use." — Anonymous Data Scientist

In today's business analytics landscape, the ability to work with data programmatically has become essential. While spreadsheets remain valuable for many tasks, modern analytics increasingly requires the flexibility, scalability, and reproducibility that programming languages provide. Python has emerged as the lingua franca of business analytics, offering a gentle learning curve for beginners while providing the power and sophistication needed for advanced work.

This chapter introduces Python for business analytics through the lens of cloud-based environments, particularly Google Colab. Rather than wrestling with software installation and environment configuration, cloud-based platforms allow you to start analyzing data immediately—all you need is a web browser. Whether you're a business analyst looking to expand your technical skills, a manager seeking to understand your team's tools, or a student preparing for a data-driven career, this chapter provides the foundation you need to work confidently with Python in modern analytics environments.

We'll explore why Python has become the tool of choice for business analytics, how to leverage cloud-based platforms for immediate productivity, and the essential Python skills needed for data manipulation, analysis, and visualization. By the end of this chapter, you'll be able to set up your own analytics environment, import and analyze data, create compelling visualizations, and share reproducible analyses with colleagues.

7.1 Why Python for Business Analytics

The Rise of Python in Business

Python has experienced explosive growth in business analytics over the past decade. According to the 2024 Stack Overflow Developer Survey, Python is the most wanted programming language, and the 2024 Kaggle State of Data Science survey shows that over 80% of data professionals use Python regularly.

Why has Python become so dominant?

1. Readability and Accessibility

Python's syntax is designed to be readable and intuitive, resembling natural language more than traditional programming languages.

Example Comparison:

# Python: Calculate average sales

average_sales = sum(sales_data) / len(sales_data)

# Compare to equivalent in other languages (more verbose)

# Java would require type declarations, class structure, etc.

# SQL would require aggregation syntax

# Excel formula: =AVERAGE(A1:A100)

Python code reads almost like English: "average sales equals the sum of sales data divided by the length of sales data."

2. Comprehensive Ecosystem for Analytics

Python offers a rich ecosystem of libraries specifically designed for data work:

Library

Purpose

Business Use Cases

pandas

Data manipulation and analysis

Cleaning data, aggregating sales reports, merging datasets

numpy

Numerical computing

Statistical calculations, array operations, financial modeling

matplotlib

Static visualizations

Creating charts for reports and presentations

seaborn

Statistical visualizations

Exploring distributions, correlations, patterns

scikit-learn

Machine learning

Predictive modeling, customer segmentation, forecasting

statsmodels

Statistical modeling

Regression analysis, hypothesis testing, time series

plotly

Interactive visualizations

Dashboards, interactive reports

requests

API interactions

Pulling data from web services, APIs

openpyxl/xlsxwriter

Excel integration

Reading/writing Excel files, automating reports

3. Versatility Across the Analytics Workflow

Python supports every stage of the analytics process:

Data Collection → Data Cleaning → Analysis → Visualization → Modeling → Deployment

Example Workflow:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# 1. Data Collection

data = pd.read_csv('sales_data.csv')

# 2. Data Cleaning

data = data.dropna()

data['date'] = pd.to_datetime(data['date'])

# 3. Analysis

monthly_sales = data.groupby(data['date'].dt.to_period('M'))['sales'].sum()

# 4. Visualization

monthly_sales.plot(kind='line', title='Monthly Sales Trend')

plt.show()

# 5. Modeling

X = data[['marketing_spend', 'price']]

y = data['sales']

model = LinearRegression().fit(X, y)

# 6. Prediction

predicted_sales = model.predict([[50000, 29.99]])

All in one language, one environment, one workflow.

4. Free and Open Source

Unlike proprietary tools (SAS, STATA, Tableau), Python is:

Business Impact:  A company can equip an entire analytics team with Python at zero software cost, compared to tens of thousands of dollars for proprietary alternatives.

5. Industry Adoption and Career Value

Python skills are in high demand:

6. Integration and Automation

Python excels at connecting different systems and automating workflows:

Example Use Cases:

Example: Automated weekly sales report

import pandas as pd

import smtplib

from email.mime.text import MIMEText

from datetime import datetime

# Pull data

data = pd.read_sql("SELECT * FROM sales WHERE date >= CURRENT_DATE - 7", connection)

# Analyze

weekly_total = data['sales'].sum()

top_products = data.groupby('product')['sales'].sum().nlargest(5)

# Create report

report = f"""

Weekly Sales Report - {datetime.now().strftime('%Y-%m-%d')}

Total Sales: ${weekly_total:,.2f}

Top 5 Products:

{top_products.to_string()}

"""

# Email report (simplified)

# send_email(report, recipients=['team@company.com'])

7. Reproducibility and Documentation

Python code serves as documentation of your analysis:

Contrast with Excel:

When to Use Python vs. Other Tools

Python isn't always the best choice. Here's a practical decision framework:

Scenario

Best Tool

Why

Quick calculation, small dataset (<1000 rows)

Excel

Faster for simple tasks, visual interface

Ad-hoc data exploration, one-time analysis

Excel or Python

Either works; choose based on complexity

Recurring analysis, automation needed

Python

Reproducibility, automation capabilities

Large datasets (>100K rows)

Python or SQL

Performance, memory efficiency

Complex transformations, multiple data sources

Python

Flexibility, integration capabilities

Statistical modeling, machine learning

Python or R

Specialized libraries, industry standard

Interactive dashboards for business users

Tableau/Power BI

User-friendly interface, no coding required

Production deployment, real-time systems

Python

Integration, scalability, automation

The Hybrid Approach:

Many analysts use a combination:

Python vs. R: A Brief Comparison

Both Python and R are excellent for analytics. Here's a practical comparison:

Aspect

Python

R

Primary Design

General-purpose programming

Statistical computing

Learning Curve

Gentler for programming beginners

Steeper for non-statisticians

Business Adoption

Broader (analytics + engineering)

Primarily analytics/research

Visualization

Good (matplotlib, seaborn, plotly)

Excellent (ggplot2)

Machine Learning

Excellent (scikit-learn, TensorFlow)

Good (caret, tidymodels)

Statistical Analysis

Good (statsmodels, scipy)

Excellent (built-in, specialized packages)

Deployment

Easier (general-purpose language)

More challenging

Community

Larger, more diverse

Smaller, more specialized

Job Market

Broader demand

More specialized roles

Recommendation for Business Analysts:  Start with Python. It's more versatile, has broader industry adoption, and skills transfer to other domains (web development, automation, data engineering).

The Business Case for Python

For Individual Analysts:

For Organizations:

ROI Example:

A business analyst spending 10 hours/week on manual data processing can often reduce this to 2 hours/week with Python automation:

7.2 Introduction to Cloud-Based Analytics Environments

The Evolution of Analytics Environments

Traditional Approach:

  1. Install Python on your computer
  2. Configure environment and install libraries
  3. Manage dependencies and versions
  4. Deal with "works on my machine" problems
  5. Struggle with collaboration and sharing

Cloud-Based Approach:

  1. Open a web browser
  2. Start analyzing data
  3. Collaborate in real-time
  4. Access from anywhere
  5. No installation or configuration needed

What Are Cloud-Based Analytics Environments?

Cloud-based analytics environments are web-based platforms that provide:

Benefits of Cloud-Based Environments

1. Zero Setup Time

Traditional Installation:

# Install Python

# Install pip

# Install libraries

pip install pandas numpy matplotlib seaborn scikit-learn

# Troubleshoot version conflicts

# Configure IDE

# Set up virtual environments

# Debug path issues

# ...hours later...

Cloud-Based:

1. Go to colab.research.google.com

2. Click "New Notebook"

3. Start codin g

2. Consistent Environment

Everyone uses the same Python version and libraries—no more "works on my machine" problems.

3. Access to Powerful Hardware

Cloud platforms provide:

4. Collaboration and Sharing

Share notebooks like Google Docs:

5. Accessibility

Work from anywhere:

6. Integration with Cloud Services

Easy connection to:

7.2.1 Google Colab: Features and Limitations

Google Colaboratory  (Colab) is a free, cloud-based Jupyter notebook environment provided by Google.

Key Features

1. Free Access to Computing Resources

2. Pre-installed Libraries

Most common libraries are already installed:

import pandas as pd          # ✓ Pre-installed

import numpy as np           # ✓ Pre-installed

import matplotlib.pyplot as plt  # ✓ Pre-installed

import seaborn as sns        # ✓ Pre-installed

import sklearn               # ✓ Pre-installed

For additional libraries:

!pip install library_name

3. Google Drive Integration

Mount your Google Drive to access files:

from google.colab import drive

drive.mount('/content/drive')

# Now access files like:

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

4. Interactive Widgets and Forms

Create user-friendly interfaces:

#@title Enter Parameters { run: "auto" }

start_date = "2024-01-01" #@param {type:"date"}

region = "North" #@param ["North", "South", "East", "West"]

threshold = 1000 #@param {type:"slider", min:0, max:10000, step:100}

print(f"Analyzing {region} region from {start_date} with threshold {threshold}")

5. Markdown and Rich Text

Mix code with formatted documentation:

# Analysis Title

This notebook analyzes **Q3 sales data** to identify trends.

## Key Findings

- Sales increased 15% YoY

- Northeast region outperformed

6. Built-in Data Visualization

Visualizations render inline:

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [10, 20, 25, 30])

plt.title('Sales Trend')

plt.show()  # Displays directly in notebook

7. Code Snippets Library

Access pre-written code for common tasks:

8. Keyboard Shortcuts

Efficient navigation and editing:

Limitations

1. Session Timeouts

# Save intermediate results

data.to_csv('/content/drive/MyDrive/intermediate_results.csv', index=False)

2. Temporary Storage

# BAD: Saved to temporary storage

data.to_csv('results.csv')  # Lost when session ends!

# GOOD: Saved to Google Drive

data.to_csv('/content/drive/MyDrive/results.csv')  # Persists

3. Resource Limits

Free tier limitations:

Workaround for large datasets:

# Read data in chunks

chunk_size = 10000

chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):

    # Process each chunk

    processed = chunk[chunk['sales'] > 1000]

    chunks.append(processed)

data = pd.concat(chunks, ignore_index=True)

4. No Terminal Access

# Shell commands work with ! prefix

!ls /content/drive/MyDrive/

!wget https://example.com/data.csv

5. Internet Dependency

6. Privacy Considerations

7. Limited Customization

Colab Pro and Pro+

For power users, Google offers paid tiers:

Feature

Free

Pro (~$10/month)

Pro+ (~$50/month)

RAM

12-13 GB

Up to 32 GB

Up to 52 GB

Session Length

12 hours

24 hours

24 hours

Idle Timeout

90 min

Longer

Longer

GPU Access

Limited

Priority

Priority

Background Execution

No

Yes

Yes

Compute Units

Limited

More

Most

When to upgrade:

7.2.2 Alternatives: Jupyter Notebook, VS Code, and Hosted Platforms (Conceptual)

While this chapter focuses on Google Colab, it's valuable to understand the broader ecosystem.

Jupyter Notebook (Local)

What it is:  The original notebook interface, running on your own computer.

Pros:

Cons:

When to use:

Getting Started (conceptual):

# Install Jupyter

pip install jupyter

# Launch

jupyter notebook

# Opens in browser at localhost:8888

JupyterLab (Local)

What it is:  Next-generation interface for Jupyter, more IDE-like.

Additional features over Jupyter Notebook:

When to use:  Same as Jupyter Notebook, but for users who want a more integrated development environment.

VS Code with Jupyter Extension

What it is:  Microsoft's Visual Studio Code editor with Jupyter notebook support.

Pros:

Cons:

When to use:

Kaggle Notebooks

What it is:  Free cloud notebooks from Kaggle (owned by Google).

Pros:

Cons:

When to use:

Databricks Community Edition

What it is:  Free tier of Databricks, a cloud analytics platform.

Pros:

Cons:

When to use:

Amazon SageMaker Studio Lab

What it is:  Free Jupyter environment from AWS.

Pros:

Cons:

Microsoft Azure Notebooks (Deprecated) / Azure ML Studio

What it is:  Microsoft's cloud notebook offerings.

Note:  Azure Notebooks was deprecated; Azure ML Studio is the current offering but requires Azure subscription.

Deepnote

What it is:  Collaborative data science notebook platform.

Pros:

Cons:

Comparison Matrix

Platform

Cost

Setup

Collaboration

Resources

Best For

Google Colab

Free

None

Good

Moderate

General business analytics, learning

Jupyter (Local)

Free

Medium

Poor

Your hardware

Sensitive data, offline work

VS Code

Free

Medium

Medium

Your hardware

Software development, complex projects

Kaggle

Free

None

Good

Moderate

Learning, competitions

Databricks

Free tier

Low

Excellent

Good

Big data, Spark

Colab Pro

$10/mo

None

Good

High

Power users, ML

Recommendation for Business Analysts

Start with Google Colab:

Transition to local Jupyter when:

7.3 Getting Started with Google Colab

7.3.1 Creating and Managing Notebooks

Creating Your First Notebook

Step 1: Access Google Colab

  1. Open your web browser
  2. Go to   https://colab.research.google.com
  3. Sign in with your Google account

Step 2: Create a New Notebook

Option A: From the welcome screen

Option B: From Google Drive

Option C: From File menu

Your first notebook opens with an empty code cell.

Understanding the Interface

┌─────────────────────────────────────────────────────┐

│ File Edit View Insert Runtime Tools Help            │ ← Menu bar

├─────────────────────────────────────────────────────┤

│ + Code  + Text  📁 Files  🔍 Search                 │ ← Toolbar

├─────────────────────────────────────────────────────┤

│ Untitled0.ipynb                                     │ ← Notebook name

├─────────────────────────────────────────────────────┤

│ [ ] # Your code here                                │ ← Code cell

│                                                      │

├─────────────────────────────────────────────────────┤

│ + Code  + Text                                      │ ← Add cells

└─────────────────────────────────────────────────────┘

Key Interface Elements:

  1. Menu Bar : File operations, editing, runtime control
  2. Toolbar : Quick access to common actions
  3. Notebook Name : Click to rename
  4. Code Cells : Where you write Python code
  5. Text Cells : Markdown for documentation
  6. Sidebar : Files, table of contents, code snippets

Your First Code

Click in the code cell and type:

print("Hello, Business Analytics!")

Run the cell:

Output appears below the cell:

Hello, Business Analytics!

Adding Cells

Add a code cell:

Add a text cell:

Working with Text Cells (Markdown)

Text cells use Markdown formatting:

# Main Heading

## Subheading

### Sub-subheading

**Bold text** and *italic text*

- Bullet point 1

- Bullet point 2

1. Numbered item

2. Another item

[Link text](https://example.com)

`inline code`

code block

> Blockquote

Example text cell:

# Q3 Sales Analysis

This notebook analyzes Q3 2024 sales data to identify:

- Top performing products

- Regional trends

- Seasonal patterns

**Data Source:** Salesforce export, October 15, 2024

Organizing Your Notebook

Best practices:

  1. Start with a title and overview

# Customer Churn Analysis - October 2024

**Objective:** Identify factors contributing to customer churn

**Data:** CRM export (10,000 customers, 2023-2024)

**Methodology:**

1. Data cleaning and exploration

2. Descriptive statistics

3. Visualization of key patterns

4. Predictive modeling

  1. Use section headings

## 1. Data Import and Cleaning

## 2. Exploratory Data Analysis

## 3. Key Findings

## 4. Recommendations

  1. Document your code

# Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Load data from Google Drive

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

# Display first few rows to verify

print(f"Dataset shape: {data.shape}")

data.head()

  1. Use table of contents

Renaming Your Notebook

Method 1:  Click the notebook name at the top

Method 2:  File → Rename

Naming conventions:

Best practices:

Saving Your Notebook

Automatic saving:

Manual saving:

Save location:

Downloading Your Notebook

Download as .ipynb (Jupyter format):

Download as Python script:

Download with outputs:

Copying and Sharing Notebooks

Make a copy:

Share with others:

Sharing permissions:

Best practice for collaboration:

Opening Existing Notebooks

From Colab:

From Google Drive:

From GitHub:

From URL:

Managing Multiple Notebooks

Organization strategies:

  1. Folder structure in Google Drive:

My Drive/

├── Analytics Projects/

│   ├── Customer Churn/

│   │   ├── data/

│   │   ├── 01_Data_Exploration.ipynb

│   │   ├── 02_Modeling.ipynb

│   │   └── 03_Final_Report.ipynb

│   └── Sales Analysis/

│       ├── data/

│       └── Q3_Analysis.ipynb

  1. Naming conventions:
  1. Version control:

7.3.2 Connecting to Google Drive and Data Sources

Mounting Google Drive

Why mount Google Drive?

How to mount:

from google.colab import drive

drive.mount('/content/drive')

What happens:

  1. A link appears: "Go to this URL in a browser"
  2. Click the link (or it opens automatically)
  3. Choose your Google account
  4. Click "Allow"
  5. Copy the authorization code (if prompted)
  6. Paste into the notebook (if prompted)

Modern experience:  Usually just one click to authorize.

Output:

Mounted at /content/drive

Verify the mount:

# List files in your Google Drive

!ls "/content/drive/MyDrive"

File path structure:

/content/drive/MyDrive/  ← Your Google Drive root

/content/drive/MyDrive/Colab Notebooks/  ← Default notebook location

/content/drive/MyDrive/data/  ← Your data folder (example)

Reading Files from Google Drive

CSV file:

import pandas as pd

# Read CSV from Google Drive

data = pd.read_csv('/content/drive/MyDrive/data/sales_data.csv')

# Display first few rows

print(f"Loaded {len(data)} rows")

data.head()

Excel file:

# Read Excel file

data = pd.read_excel('/content/drive/MyDrive/data/sales_data.xlsx', sheet_name='Q3 Sales')

data.head()

Multiple sheets:

# Read all sheets

excel_file = pd.ExcelFile('/content/drive/MyDrive/data/sales_data.xlsx')

# See sheet names

print(excel_file.sheet_names)

# Read specific sheets

q3_data = pd.read_excel(excel_file, sheet_name='Q3')

q4_data = pd.read_excel(excel_file, sheet_name='Q4' )

Text file:

# Read text file

with open('/content/drive/MyDrive/data/notes.txt', 'r') as f:

    content = f.read()

    print(content)

Saving Files to Google Drive

Save DataFrame to CSV:

# Process data

results = data.groupby('region')['sales'].sum()

# Save to Google Drive

results.to_csv('/content/drive/MyDrive/results/regional_sales.csv')

print("Results saved to Google Drive")

Save DataFrame to Excel:

# Save to Excel

results.to_excel('/content/drive/MyDrive/results/regional_sales.xlsx',

                 sheet_name='Summary',

                 index=False)

Save multiple DataFrames to Excel:

# Create Excel writer

with pd.ExcelWriter('/content/drive/MyDrive/results/sales_report.xlsx') as writer:

    summary.to_excel(writer, sheet_name='Summary', index=False)

    details.to_excel(writer, sheet_name='Details', index=False)

    trends.to_excel(writer, sheet_name='Trends', index=False)

print("Multi-sheet report saved")

Save plot:

import matplotlib.pyplot as plt

# Create plot

plt.figure(figsize=(10, 6))

plt.plot(data['date'], data['sales'])

plt.title('Sales Trend')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

# Save to Google Drive

plt.savefig('/content/drive/MyDrive/results/sales_trend.png', dpi=300, bbox_inches='tight')

plt.show()

print("Plot saved to Google Drive")

Uploading Files Directly to Colab

For small, one-time uploads:

from google.colab import files

# Upload file(s)

uploaded = files.upload()

# Access uploaded file

import pandas as pd

for filename in uploaded.keys():

    print(f"Uploaded: {filename}")

    data = pd.read_csv(filename)

What happens:

  1. "Choose Files" button appears
  2. Select file(s) from your computer
  3. Files upload to /content/  (temporary storage)
  4. ⚠️ Warning:  Files deleted when session ends!

Best practice:  Upload to Google Drive for permanent storage, or save results to Drive before session ends.

Reading Data from URLs

CSV from URL:

import pandas as pd

# Read directly from URL

url = 'https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv'

data = pd.read_csv(url)

data.head()

Download file from URL:

# Download file using wget

!wget https://example.com/data/sales_data.csv -O /content/drive/MyDrive/data/sales_data.csv

# Or using Python

import requests

url = 'https://example.com/data/sales_data.csv'

response = requests.get(url)

with open('/content/drive/MyDrive/data/sales_data.csv', 'wb') as f:

    f.write(response.content)

print("File downloaded to Google Drive")

Connecting to Google Sheets

Read from Google Sheets:

# Method 1: Export as CSV (simple)

sheet_url = 'https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID/export?format=csv&gid=0'

data = pd.read_csv(sheet_url)

# Method 2: Using gspread library (more features)

!pip install gspread

from google.colab import auth

import gspread

from google.auth import default

# Authenticate

auth.authenticate_user()

creds, _ = default()

gc = gspread.authorize(creds)

# Open sheet

sheet = gc.open('Sales Data 2024').sheet1

# Get all values

values = sheet.get_all_values()

data = pd.DataFrame(values[1:], columns=values[0])

data.head()

Write to Google Sheets:

# Update existing sheet

sheet.update([data.columns.values.tolist()] + data.values.tolist())

print("Data written to Google Sheets ")

Connecting to Databases

SQLite (local):

import sqlite3

import pandas as pd

# Connect to database

conn = sqlite3.connect('/content/drive/MyDrive/data/sales.db')

# Query data

query = "SELECT * FROM sales WHERE date >= '2024-01-01'"

data = pd.read_sql(query, conn)

# Close connection

conn.close()

data.head()

PostgreSQL/MySQL (remote):

# Install library

!pip install psycopg2-binary  # PostgreSQL

# or

!pip install pymysql  # MySQL

import pandas as pd

from sqlalchemy import create_engine

# Create connection (PostgreSQL example)

engine = create_engine('postgresql://username:password@host:port/database')

# Query data

query = "SELECT * FROM sales WHERE date >= '2024-01-01'"

data = pd.read_sql(query, engine)

data.head()

⚠️ Security Warning:  Never hardcode credentials in notebooks!

Better approach:

from google.colab import userdata

# Store secrets in Colab's secret manager

# (Left sidebar → Key icon → Add secret)

db_password = userdata.get('DB_PASSWORD')

# Use in connection string

engine = create_engine(f'postgresql://username:{db_password}@host:port/database')

Connecting to APIs

Simple API request:

import requests

import pandas as pd

# API request

url = 'https://api.example.com/sales'

headers = {'Authorization': 'Bearer YOUR_API_KEY'}

response = requests.get(url, headers=headers)

# Parse JSON response

data = response.json()

# Convert to DataFrame

df = pd.DataFrame(data['results'])

df.head()

Pagination example:

import requests

import pandas as pd

def fetch_all_pages(base_url, headers):

    all_data = []

    page = 1

   

    while True:

        response = requests.get(f"{base_url}?page={page}", headers=headers)

        data = response.json()

       

        if not data['results']:

            break

           

        all_data.extend(data['results'])

        page += 1

       

        print(f"Fetched page {page-1}")

   

    return pd.DataFrame(all_data)

# Use function

df = fetch_all_pages('https://api.example.com/sales', headers={'Authorization': 'Bearer KEY'})

print(f"Total records: {len(df)}")

Best Practices for Data Access

1. Organize your data:

Google Drive/

├── Analytics Projects/

│   ├── data/

│   │   ├── raw/           ← Original, unmodified data

│   │   ├── processed/     ← Cleaned, transformed data

│   │   └── external/      ← Data from APIs, downloads

│   ├── results/           ← Analysis outputs

│   └── notebooks/         ← Your .ipynb files

2. Document data sources:

"""

DATA SOURCES:

- sales_data.csv: Salesforce export, 2024-10-15, John Smith

- customer_data.xlsx: CRM database, 2024-10-14, automated export

- market_data.csv: Public API (https://api.example.com), 2024-10-15

"""

3. Validate data after loading:

# Load data

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

# Validate

print(f"Shape: {data.shape}")

print(f"Columns: {data.columns.tolist()}")

print(f"Date range: {data['date'].min()} to {data['date'].max()}")

print(f"Missing values: {data.isnull().sum().sum()}")

# Check for expected columns

required_columns = ['date', 'product', 'sales', 'region']

missing_columns = set(required_columns) - set(data.columns)

if missing_columns:

    print(f"⚠️ WARNING: Missing columns: {missing_columns}")

else:

    print("✓ All required columns present")

4. Handle errors gracefully:

import pandas as pd

import os

file_path = '/content/drive/MyDrive/data/sales.csv'

try:

    data = pd.read_csv(file_path)

    print(f"✓ Successfully loaded {len(data)} rows")

except FileNotFoundError:

    print(f"❌ Error: File not found at {file_path}")

    print("Available files:")

    print(os.listdir('/content/drive/MyDrive/data/'))

except pd.errors.EmptyDataError:

    print("❌ Error: File is empty")

except Exception as e:

    print(f"❌ Error loading data: {e}")

7.4 Python Essentials for Analytics

7.4.1 Basic Syntax, Data Types, and Control Structures

Python Basics

Comments:

# This is a single-line comment

"""

This is a multi-line comment

or docstring

"""

# Use comments to explain WHY, not WHAT

# BAD: x = x + 1  # increment x

# GOOD: x = x + 1  # adjust for zero-based indexing

Variables:

# Variables don't need type declarations

company_name = "Acme Corp"

revenue = 1250000

growth_rate = 0.15

is_profitable = True

# Variable naming conventions

# ✓ snake_case for variables and functions

# ✓ PascalCase for classes

# ✓ UPPER_CASE for constants

# Good names

customer_count = 150

average_order_value = 45.50

# Bad names

x = 150  # What is x?

aov = 45.50  # Unclear abbreviation

Printing output:

# Basic print

print("Hello, World!")

# Print variables

revenue = 1250000

print(revenue)

# Formatted strings (f-strings) - BEST PRACTICE

print(f"Revenue: ${revenue:,.2f}")

# Output: Revenue: $1,250,000.00

# Multiple variables

company = "Acme"

year = 2024

print(f"{company} revenue in {year}: ${revenue:,.2f}")

Data Types

Numbers:

# Integers

customer_count = 150

year = 2024

# Floats

price = 29.99

growth_rate = 0.15

# Arithmetic operations

total_revenue = customer_count * price

print(f"Total revenue: ${total_revenue:,.2f}")

# Common operations

a = 10

b = 3

print(a + b)   # 13 (addition)

print(a - b)   # 7 (subtraction)

print(a * b)   # 30 (multiplication)

print(a / b)   # 3.333... (division)

print(a // b)  # 3 (integer division)

print(a % b)   # 1 (modulo/remainder)

print(a ** b)  # 1000 (exponentiation)

# Useful functions

import math

print(round(3.14159, 2))  # 3.14

print(abs(-10))           # 10

print(math.sqrt(16))      # 4.0

print(math.ceil(3.2))     # 4

print(math.floor(3.8))    # 3

Strings:

# String creation

company = "Acme Corp"

product = 'Widget Pro'  # Single or double quotes

# String concatenation

full_name = "Acme" + " " + "Corp"

# Better: use f-strings

full_name = f"Acme Corp"

# String methods

text = "  Business Analytics  "

print(text.strip())        # "Business Analytics" (remove whitespace)

print(text.lower())        # "  business analytics  "

print(text.upper())        # "  BUSINESS ANALYTICS  "

print(text.replace("Business", "Data"))  # "  Data Analytics  "

print(text.split())        # ['Business', 'Analytics']

# String formatting

revenue = 1250000

print(f"Revenue: ${revenue:,.2f}")  # Revenue: $1,250,000.00

print(f"Revenue: ${revenue/1000:.1f}K")  # Revenue: $1250.0K

# Multi-line strings

report = """

Q3 2024 Sales Report

====================

Total Revenue: $1.25M

Growth: 15%

"""

print(report)

Booleans:

# Boolean values

is_profitable = True

has_debt = False

# Comparison operators

revenue = 1000000

target = 900000

print(revenue > target)   # True

print(revenue < target)   # False

print(revenue == target)  # False

print(revenue != target)  # True

print(revenue >= target)  # True

print(revenue <= target)  # False

# Logical operators

high_revenue = revenue > 1000000

low_cost = True

print(high_revenue and low_cost)  # False (both must be True)

print(high_revenue or low_cost)   # True (at least one is True)

print(not high_revenue)           # True (negation)

Lists:

# Create lists

products = ["Widget", "Gadget", "Doohickey"]

sales = [1000, 1500, 1200]

mixed = ["Acme", 2024, True, 3.14]  # Can mix types

# Access elements (zero-indexed)

print(products[0])   # "Widget" (first element)

print(products[1])   # "Gadget"

print(products[-1])  # "Doohickey" (last element)

print(products[-2])  # "Gadget" (second to last)

# Slicing

print(products[0:2])   # ['Widget', 'Gadget'] (elements 0 and 1)

print(products[:2])    # ['Widget', 'Gadget'] (first 2)

print(products[1:])    # ['Gadget', 'Doohickey'] (from index 1 to end)

# List methods

products.append("Thingamajig")  # Add to end

products.insert(0, "Whatsit")   # Insert at position

products.remove("Widget")       # Remove by value

last_product = products.pop()   # Remove and return last

# List operations

print(len(products))           # Length

print(sum(sales))              # Sum (for numeric lists)

print(max(sales))              # Maximum

print(min(sales))              # Minimum

print(sorted(sales))           # Sorted copy

print(sales.index(1500))       # Find index of value

# List comprehensions (powerful!)

# Create new list by transforming another

doubled_sales = [x * 2 for x in sales]

high_sales = [x for x in sales if x > 1100]

print(doubled_sales)  # [2000, 3000, 2400]

print(high_sales)     # [1500, 1200]

Dictionaries:

# Create dictionaries (key-value pairs)

customer = {

    "name": "John Smith",

    "email": "john@example.com",

    "age": 35,

    "is_active": True

}

# Access values

print(customer["name"])        # "John Smith"

print(customer.get("email"))   # "john@example.com"

print(customer.get("phone", "N/A"))  # "N/A" (default if key doesn't exist)

# Modify dictionaries

customer["age"] = 36           # Update value

customer["phone"] = "555-1234" # Add new key-value pair

del customer["is_active"]      # Delete key

# Dictionary methods

print(customer.keys())         # dict_keys(['name', 'email', 'age', 'phone'])

print(customer.values())       # dict_values(['John Smith', 'john@example.com', 36, '555-1234'])

print(customer.items())        # Key-value pairs

# Check if key exists

if "email" in customer:

    print(f"Email: {customer['email']}")

# Business example: sales by region

regional_sales = {

    "North": 250000,

    "South": 180000,

    "East": 310000,

    "West": 220000

}

total_sales = sum(regional_sales.values())

print(f"Total sales: ${total_sales:,.0f}")

# Find top region

top_region = max(regional_sales, key=regional_sales.get)

print(f"Top region: {top_region} (${regional_sales[top_region]:,.0f})")

Control Structures

If statements:

# Basic if statement

revenue = 1250000

if revenue > 1000000:

    print("Revenue exceeds $1M!")

# If-else

if revenue > 1000000:

    print("High revenue")

else:

    print("Revenue below $1M")

# If-elif-else

if revenue > 2000000:

    status = "Excellent"

elif revenue > 1000000:

    status = "Good"

elif revenue > 500000:

    status = "Fair"

else:

    status = "Needs Improvement"

print(f"Status: {status}")

# Multiple conditions

revenue = 1250000

profit_margin = 0.18

if revenue > 1000000 and profit_margin > 0.15:

    print("Strong performance!")

# Business example: customer segmentation

customer_value = 5000

years_active = 3

if customer_value > 10000:

    segment = "VIP"

elif customer_value > 5000 and years_active > 2:

    segment = "Premium"

elif customer_value > 1000:

    segment = "Standard"

else:

    segment = "Basic"

print(f"Customer segment: {segment}")

For loops:

# Loop through list

products = ["Widget", "Gadget", "Doohickey"]

for product in products:

    print(f"Product: {product}")

# Loop through range

for i in range(5):  # 0, 1, 2, 3, 4

    print(i)

for i in range(1, 6):  # 1, 2, 3, 4, 5

    print(i)

for i in range(0, 10, 2):  # 0, 2, 4, 6, 8 (step by 2)

    print(i)

# Loop with index

for i, product in enumerate(products):

    print(f"{i+1}. {product}")

# Loop through dictionary

regional_sales = {

    "North": 250000,

    "South": 180000,

    "East": 310000

}

for region, sales in regional_sales.items():

    print(f"{region}: ${sales:,.0f}")

# Business example: calculate total sales

monthly_sales = [45000, 52000, 48000, 61000, 55000, 58000]

total = 0

for sales in monthly_sales:

    total += sales

average = total / len(monthly_sales)

print(f"Total: ${total:,.0f}")

print(f"Average: ${average:,.0f}")

# Better: use built-in functions

total = sum(monthly_sales)

average = sum(monthly_sales) / len(monthly_sales)

While loops:

# Basic while loop

count = 0

while count < 5:

    print(count)

    count += 1

# Business example: compound growth

initial_investment = 10000

target = 20000

annual_return = 0.08

years = 0

value = initial_investment

while value < target:

    value = value * (1 + annual_return)

    years += 1

    print(f"Year {years}: ${value:,.2f}")

print(f"Reached target in {years} years")

# Break and continue

for i in range(10):

    if i == 3:

        continue  # Skip this iteration

    if i == 7:

        break  # Exit loop

    print(i)

Functions

# Define a function

def calculate_revenue(units_sold, price_per_unit):

    """Calculate total revenue from units and price."""

    revenue = units_sold * price_per_unit

    return revenue

# Call the function

total_revenue = calculate_revenue(1000, 29.99)

print(f"Revenue: ${total_revenue:,.2f}")

# Function with default parameters

def calculate_discount(price, discount_rate=0.10):

    """Calculate discounted price. Default discount is 10%."""

    discounted_price = price * (1 - discount_rate)

    return discounted_price

print(calculate_discount(100))      # $90.00 (10% discount)

print(calculate_discount(100, 0.20))  # $80.00 (20% discount)

# Function with multiple return values

def calculate_metrics(revenue, costs):

    """Calculate profit and margin."""

    profit = revenue - costs

    margin = profit / revenue if revenue > 0 else 0

    return profit, margin

profit, margin = calculate_metrics(1000000, 750000)

print(f"Profit: ${profit:,.0f}")

print(f"Margin: {margin:.1%}")

# Business example: customer lifetime value

def calculate_ltv(avg_purchase, purchase_frequency, customer_lifespan):

    """

    Calculate Customer Lifetime Value.

   

    Parameters:

    - avg_purchase: Average purchase amount

    - purchase_frequency: Purchases per year

    - customer_lifespan: Years as customer

   

    Returns:

    - Customer Lifetime Value

    """

    ltv = avg_purchase * purchase_frequency * customer_lifespan

    return ltv

customer_ltv = calculate_ltv(avg_purchase=50,

                              purchase_frequency=12,

                              customer_lifespan=5)

print(f"Customer LTV: ${customer_ltv:,.0f}")

7.4.2 Working with Libraries: pandas, numpy, matplotlib, seaborn

Introduction to Libraries

Libraries extend Python's capabilities. Think of them as toolboxes:

Importing libraries:

# Standard imports

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# These aliases (pd, np, plt, sns) are conventions

# Everyone uses them, so you should too

pandas: Data Manipulation

Creating DataFrames:

import pandas as pd

# From dictionary

data = {

    'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],

    'sales': [1000, 1500, 1200, 1800],

    'region': ['North', 'South', 'North', 'East']

}

df = pd.DataFrame(data)

print(df)

Output:

      product  sales region

0       Widget   1000  North

1       Gadget   1500  South

2   Doohickey   1200  North

3  Thingamajig   1800   East

Reading data:

# From CSV

df = pd.read_csv('sales_data.csv')

# From Excel

df = pd.read_excel('sales_data.xlsx', sheet_name='Q3')

# From URL

url = 'https://example.com/data.csv'

df = pd.read_csv(url)

Exploring data:

# First/last rows

print(df.head())      # First 5 rows

print(df.head(10))    # First 10 rows

print(df.tail())      # Last 5 rows

# Shape

print(df.shape)       # (rows, columns)

# Column names

print(df.columns)

# Data types

print(df.dtypes)

# Summary statistics

print(df.describe())

# Info

print(df.info())

Selecting data:

# Select column

sales = df['sales']

print(sales)

# Select multiple columns

subset = df[['product', 'sales']]

print(subset)

# Select rows by condition

high_sales = df[df['sales'] > 1200]

print(high_sales)

# Multiple conditions

north_high_sales = df[(df['region'] == 'North') & (df['sales'] > 1000)]

# Select by position

first_row = df.iloc[0]      # First row

first_cell = df.iloc[0, 0]  # First row, first column

# Select by label

df_indexed = df.set_index('product')

widget_sales = df_indexed.loc['Widget', 'sales']

Adding/modifying columns:

# Add new column

df['revenue'] = df['sales'] * 29.99

# Modify existing column

df['sales'] = df['sales'] * 1.1  # 10% increase

# Conditional column

df['performance'] = df['sales'].apply(

    lambda x: 'High' if x > 1200 else 'Low'

)

# Or using np.where

df['performance'] = np.where(df['sales'] > 1200, 'High', 'Low')

Grouping and aggregating:

# Group by region, sum sales

regional_sales = df.groupby('region')['sales'].sum()

print(regional_sales)

# Multiple aggregations

summary = df.groupby('region').agg({

    'sales': ['sum', 'mean', 'count'],

    'revenue': 'sum'

})

print(summary)

# Reset index to make it a regular DataFrame

summary_df = summary.reset_index()

Sorting:

# Sort by sales (ascending)

df_sorted = df.sort_values('sales')

# Sort descending

df_sorted = df.sort_values('sales', ascending=False)

# Sort by multiple columns

df_sorted = df.sort_values(['region', 'sales'], ascending=[True, False])

Handling missing data:

# Check for missing values

print(df.isnull().sum())

# Drop rows with missing values

df_clean = df.dropna()

# Fill missing values

df_filled = df.fillna(0)  # Fill with 0

df_filled = df.fillna(df.mean())  # Fill with mean

# Fill specific column

df['sales'] = df['sales'].fillna(df['sales'].median())

Merging DataFrames:

# Two DataFrames

sales_df = pd.DataFrame({

    'product': ['Widget', 'Gadget'],

    'sales': [1000, 1500]

})

price_df = pd.DataFrame({

    'product': ['Widget', 'Gadget'],

    'price': [29.99, 39.99]

})

# Merge (like SQL JOIN)

merged = pd.merge(sales_df, price_df, on='product')

print(merged)

Business example:

import pandas as pd

# Load sales data

sales = pd.read_csv('sales_data.csv')

# Clean data

sales = sales.dropna()

sales['date'] = pd.to_datetime(sales['date'])

# Add calculated columns

sales['revenue'] = sales['units'] * sales['price']

sales['month'] = sales['date'].dt.to_period('M')

# Aggregate by month

monthly_summary = sales.groupby('month').agg({

    'revenue': 'sum',

    'units': 'sum',

    'customer_id': 'nunique'  # Count unique customers

}).reset_index()

monthly_summary.columns = ['month', 'total_revenue', 'total_units', 'unique_customers']

# Calculate average order value

monthly_summary['avg_order_value'] = (

    monthly_summary['total_revenue'] / monthly_summary['unique_customers']

)

# Display top 5 months

top_months = monthly_summary.sort_values('total_revenue', ascending=False).head()

print(top_months)

numpy: Numerical Computing

Creating arrays:

import numpy as np

# From list

arr = np.array([1, 2, 3, 4, 5])

print(arr)

# 2D array

matrix = np.array([[1, 2, 3], [4, 5, 6]])

print(matrix)

# Special arrays

zeros = np.zeros(5)           # [0. 0. 0. 0. 0.]

ones = np.ones(5)             # [1. 1. 1. 1. 1.]

range_arr = np.arange(0, 10, 2)  # [0 2 4 6 8]

linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

Array operations:

# Element-wise operations

sales = np.array([1000, 1500, 1200, 1800])

# Add 10% to all sales

increased_sales = sales * 1.1

print(increased_sales)

# Operations between arrays

costs = np.array([600, 900, 700, 1000])

profit = sales - costs

print(profit)

# Statistical functions

print(np.mean(sales))      # Average

print(np.median(sales))    # Median

print(np.std(sales))       # Standard deviation

print(np.sum(sales))       # Sum

print(np.min(sales))       # Minimum

Exercises

These hands-on exercises will help you apply the concepts covered in this chapter. Work through them in Google Colab to build practical skills in Python-based analytics.

Exercise 1: Set Up and Explore Google Colab

Objective : Familiarize yourself with the Google Colab interface and basic operations.

Tasks :

  1. Go to   colab.research.google.com  and create a new notebook.
  2. Rename your notebook to "Chapter7_Exercise1_YourName".
  3. Create a markdown cell with a title and brief description of the notebook.
  4. Create a code cell and run a simple Python command: print("Hello, Analytics!") .
  5. Add another markdown cell explaining what you just did.

Insert a code cell that displays the current Python version:
import sys

print(f"Python version: {sys.version}")

Check which libraries are pre-installed by running:
 !pip list | grep -E "pandas|numpy|matplotlib|seaborn"

  1. Save your notebook to Google Drive.
  2. Share the notebook with "view" permissions and copy the shareable link.

Deliverable : A well-organized Colab notebook with markdown documentation and the shareable link.


Exercise 2: Import a Dataset from Multiple Sources

Objective : Practice importing data from different sources into Google Colab.

Tasks :

Part A: Import from URL

#Import the Iris dataset directly from a URL:
import pandas as pd

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

iris_df = pd.read_csv(url)

print(iris_df.head())

print(f"\nDataset shape: {iris_df.shape}")

Part B: Import from Google Drive

  1. Mount your Google Drive.
  2. Create a simple CSV file locally or download one, then upload it to your Google Drive.
  3. Read the file from Google Drive into a pandas DataFrame.
  4. Display the first few rows and basic information about the dataset.

Part C: Import from Local Upload

Use the file upload widget to upload a CSV file:
from google.colab import files
uploaded = files.upload()

# Read the uploaded file

import io

for filename in uploaded.keys():

    df = pd.read_csv(io.BytesIO(uploaded[filename]))

    print(f"Loaded {filename}")

    print(df.head())

Part D: Documentation

  1. Add markdown cells before each part explaining:

Deliverable : A Colab notebook demonstrating all three import methods with proper documentation.


Exercise 3: Calculate Descriptive Statistics with pandas

Objective : Use pandas to perform exploratory data analysis and calculate key statistics.

Tasks :

Load the Dataset : Use the Iris dataset from Exercise 2 or load a sales dataset:

 import pandas as pd

import numpy as np

# Sample sales data

np.random.seed(42)

sales_data = pd.DataFrame({

    'date': pd.date_range('2023-01-01', periods=100, freq='D'),

    'product': np.random.choice(['Product A', 'Product B', 'Product C'], 100),

    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),

    'sales': np.random.randint(100, 1000, 100),

    'units': np.random.randint(1, 50, 100),

    'cost': np.random.randint(50, 500, 100)

})

# Calculate profit

sales_data['profit'] = sales_data['sales'] - sales_data['cost']

Basic Descriptive Statistics :

 # Overall statistics

print("=== Overall Statistics ===")

print(sales_data[['sales', 'units', 'profit']].describe())

# Mean, median, mode

print("\n=== Central Tendency ===")

print(f"Mean sales: ${sales_data['sales'].mean():.2f}")

print(f"Median sales: ${sales_data['sales'].median():.2f}")

print(f"Mode product: {sales_data['product'].mode()[0]}")

# Variability

print("\n=== Variability ===")

print(f"Sales std dev: ${sales_data['sales'].std():.2f}")

print(f"Sales variance: {sales_data['sales'].var():.2f}")

print(f"Sales range: ${sales_data['sales'].max() - sales_data['sales'].min():.2f}")

Grouped Statistics :

 # By product

print("\n=== Statistics by Product ===")

product_stats = sales_data.groupby('product').agg({

    'sales': ['mean', 'sum', 'count'],

    'profit': ['mean', 'sum'],

    'units': 'sum'

}).round(2)

print(product_stats)

# By region

print("\n=== Statistics by Region ===")

region_stats = sales_data.groupby('region')['sales'].agg([

    ('total_sales', 'sum'),

    ('avg_sales', 'mean'),

    ('min_sales', 'min'),

    ('max_sales', 'max')

]).round(2)

print(region_stats)

Time-Based Analysis :

 # Add month column

sales_data['month'] = sales_data['date'].dt.to_period('M')

# Monthly statistics

print("\n=== Monthly Statistics ===")

monthly_stats = sales_data.groupby('month').agg({

    'sales': 'sum',

    'profit': 'sum',

    'units': 'sum'

}).round(2)

print(monthly_stats)

Correlation Analysis :

 print("\n=== Correlation Matrix ===")

correlation = sales_data[['sales', 'units', 'cost', 'profit']].corr()

print(correlation.round(3))

Custom Insights :

Deliverable : A Colab notebook with:

Exercise 4: Create Basic Visualizations

Objective : Build fundamental visualizations using matplotlib and seaborn.

Tasks :

  1. Setup : Use the sales dataset from Exercise 3 or create your own.

Line Chart - Sales Over Time :

 import matplotlib.pyplot as plt

import seaborn as sns

# Set style

sns.set_style("whitegrid")

plt.figure(figsize=(12, 5))

# Daily sales

plt.plot(sales_data['date'], sales_data['sales'], marker='o',

         linestyle='-', linewidth=1, markersize=3, alpha=0.7)

plt.title('Daily Sales Over Time', fontsize=14, fontweight='bold')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

Bar Chart - Sales by Product :

 plt.figure(figsize=(10, 6))

# Aggregate by product

product_sales = sales_data.groupby('product')['sales'].sum().sort_values(ascending=False)

# Create bar chart

plt.bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])

plt.title('Total Sales by Product', fontsize=14, fontweight='bold')

plt.xlabel('Product')

plt.ylabel('Total Sales ($)')

plt.xticks(rotation=0)

# Add value labels on bars

for i, v in enumerate(product_sales.values):

    plt.text(i, v + 500, f'${v:,.0f}', ha='center', fontweight='bold')

plt.tight_layout()

plt.show()

Histogram - Distribution of Sales :

 plt.figure(figsize=(10, 6))

# Using matplotlib

plt.subplot(1, 2, 1)

plt.hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)

plt.title('Sales Distribution (Matplotlib)', fontweight='bold')

plt.xlabel('Sales ($)')

plt.ylabel('Frequency')

# Using seaborn

plt.subplot(1, 2, 2)

sns.histplot(sales_data['sales'], bins=20, kde=True, color='coral')

plt.title('Sales Distribution (Seaborn)', fontweight='bold')

plt.xlabel('Sales ($)')

plt.tight_layout()

plt.show()

Box Plot - Sales by Region :

plt.figure(figsize=(10, 6))

sns.boxplot(data=sales_data, x='region', y='sales', palette='Set2')

plt.title('Sales Distribution by Region', fontsize=14, fontweight='bold')

plt.xlabel('Region')

plt.ylabel('Sales ($)')

plt.tight_layout()

plt.show()

Scatter Plot - Sales vs. Units :

plt.figure(figsize=(10, 6))

sns.scatterplot(data=sales_data, x='units', y='sales', hue='product',

                style='product', s=100, alpha=0.7)

plt.title('Sales vs. Units Sold by Product', fontsize=14, fontweight='bold')

plt.xlabel('Units Sold')

plt.ylabel('Sales ($)')

plt.legend(title='Product', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()

plt.show()

Multi-Panel Dashboard :

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

fig.suptitle('Sales Analytics Dashboard', fontsize=16, fontweight='bold')

# Panel 1: Sales over time

axes[0, 0].plot(sales_data['date'], sales_data['sales'], color='blue', alpha=0.6)

axes[0, 0].set_title('Sales Trend')

axes[0, 0].set_xlabel('Date')

axes[0, 0].set_ylabel('Sales ($)')

axes[0, 0].tick_params(axis='x', rotation=45)

# Panel 2: Sales by product

product_sales = sales_data.groupby('product')['sales'].sum()

axes[0, 1].bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])

axes[0, 1].set_title('Sales by Product')

axes[0, 1].set_xlabel('Product')

axes[0, 1].set_ylabel('Total Sales ($)')

# Panel 3: Sales distribution

axes[1, 0].hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black')

axes[1, 0].set_title('Sales Distribution')

axes[1, 0].set_xlabel('Sales ($)')

axes[1, 0].set_ylabel('Frequency')

# Panel 4: Profit by region

region_profit = sales_data.groupby('region')['profit'].mean()

axes[1, 1].barh(region_profit.index, region_profit.values, color='coral')

axes[1, 1].set_title('Average Profit by Region')

axes[1, 1].set_xlabel('Average Profit ($)')

axes[1, 1].set_ylabel('Region')

plt.tight_layout()

plt.show()

Deliverable : A Colab notebook with:

Exercise 5: Document a Notebook for a Non-Technical Manager

Objective : Practice clear communication and documentation for business stakeholders.

Scenario : You've been asked to analyze quarterly sales performance and present findings to your manager who has limited technical background.

Tasks :

Create a Professional Notebook Structure :

# Quarterly Sales Performance Analysis

**Prepared by**: [Your Name]  

**Date**: [Current Date]  

**Period**: Q1 2023

## Executive Summary

[Write 2-3 sentences summarizing key findings]

## Table of Contents

1. Data Overview

2. Sales Performance Analysis

3. Product Performance

4. Regional Analysis

5. Key Insights and Recommendations

  1. Data Overview Section :
  1. Analysis Sections with Business Language :

Example Structure for One Analysis :

## 2. Sales Performance Analysis

### Question: How did our sales trend throughout the quarter?

We want to understand if sales were growing, declining, or stable during Q1.

 # Calculate daily and weekly sales trends

daily_sales = sales_data.groupby('date')['sales'].sum()

# Create visualization

plt.figure(figsize=(12, 5))

plt.plot(daily_sales.index, daily_sales.values, marker='o', linewidth=2)

plt.title('Daily Sales Trend - Q1 2023', fontsize=14, fontweight='bold')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

plt.grid(True, alpha=0.3)

plt.tight_layout()

plt.show()

# Calculate growth

first_week_avg = daily_sales[:7].mean()

last_week_avg = daily_sales[-7:].mean()

growth_rate = ((last_week_avg - first_week_avg) / first_week_avg) * 100

print(f"First week average: ${first_week_avg:,.2f}")

print(f"Last week average: ${last_week_avg:,.2f}")

print(f"Growth rate: {growth_rate:.1f}%")

 ### Key Finding

Sales showed a **positive trend** throughout Q1, with the last week averaging

${last_week_avg:,.2f} compared to ${first_week_avg:,.2f} in the first week—

a growth of {growth_rate:.1f}%.

### What This Means

- Our Q1 initiatives appear to be working

- We're building momentum heading into Q2

- Current trajectory suggests we'll exceed Q2 targets if maintained

### Recommended Action

Continue current marketing and sales strategies. Consider increasing inventory

for top-performing products to meet growing demand.

Key Insights and Recommendations Section :

## 5. Key Insights and Recommendations

### Top 3 Insights

1. **[Insight 1]**: [Explanation with supporting data]

2. **[Insight 2]**: [Explanation with supporting data]

3. **[Insight 3]**: [Explanation with supporting data]

### Recommendations

| Priority | Recommendation | Expected Impact | Timeline |

|----------|---------------|-----------------|----------|

| High | [Action 1] | [Impact] | [When] |

| Medium | [Action 2] | [Impact] | [When] |

| Low | [Action 3] | [Impact] | [When] |

### Next Steps

- [ ] [Action item 1]

- [ ] [Action item 2]

- [ ] [Action item 3]

  1. Best Practices to Follow :

Deliverable : A polished Colab notebook that:

Additional Resources

Python Learning

Google Colab Resources

pandas Documentation

Visualization Resources

Communities and Support