Chapter 7. Working with Python in Cloud-Based Environments

"The best tool is the one you'll actually use." — Anonymous Data Scientist

In today's business analytics landscape, the ability to work with data programmatically has become essential. While spreadsheets remain valuable for many tasks, modern analytics increasingly requires the flexibility, scalability, and reproducibility that programming languages provide. Python has emerged as the lingua franca of business analytics, offering a gentle learning curve for beginners while providing the power and sophistication needed for advanced work.

This chapter introduces Python for business analytics through the lens of cloud-based environments, particularly Google Colab. Rather than wrestling with software installation and environment configuration, cloud-based platforms allow you to start analyzing data immediately—all you need is a web browser. Whether you're a business analyst looking to expand your technical skills, a manager seeking to understand your team's tools, or a student preparing for a data-driven career, this chapter provides the foundation you need to work confidently with Python in modern analytics environments.

We'll explore why Python has become the tool of choice for business analytics, how to leverage cloud-based platforms for immediate productivity, and the essential Python skills needed for data manipulation, analysis, and visualization. By the end of this chapter, you'll be able to set up your own analytics environment, import and analyze data, create compelling visualizations, and share reproducible analyses with colleagues.

7.1 Why Python for Business Analytics

The Rise of Python in Business

Python has experienced explosive growth in business analytics over the past decade. According to the 2024 Stack Overflow Developer Survey, Python is the most wanted programming language, and the 2024 Kaggle State of Data Science survey shows that over 80% of data professionals use Python regularly.

Why has Python become so dominant?

1. Readability and Accessibility

Python's syntax is designed to be readable and intuitive, resembling natural language more than traditional programming languages.

Example Comparison:

# Python: Calculate average sales

average_sales = sum(sales_data) / len(sales_data)

# Compare to equivalent in other languages (more verbose)

# Java would require type declarations, class structure, etc.

# SQL would require aggregation syntax

# Excel formula: =AVERAGE(A1:A100)

Python code reads almost like English: "average sales equals the sum of sales data divided by the length of sales data."

2. Comprehensive Ecosystem for Analytics

Python offers a rich ecosystem of libraries specifically designed for data work:

Library	Purpose	Business Use Cases
pandas	Data manipulation and analysis	Cleaning data, aggregating sales reports, merging datasets
numpy	Numerical computing	Statistical calculations, array operations, financial modeling
matplotlib	Static visualizations	Creating charts for reports and presentations
seaborn	Statistical visualizations	Exploring distributions, correlations, patterns
scikit-learn	Machine learning	Predictive modeling, customer segmentation, forecasting
statsmodels	Statistical modeling	Regression analysis, hypothesis testing, time series
plotly	Interactive visualizations	Dashboards, interactive reports
requests	API interactions	Pulling data from web services, APIs
openpyxl/xlsxwriter	Excel integration	Reading/writing Excel files, automating reports

3. Versatility Across the Analytics Workflow

Python supports every stage of the analytics process:

Data Collection → Data Cleaning → Analysis → Visualization → Modeling → Deployment

Example Workflow:

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# 1. Data Collection

data = pd.read_csv('sales_data.csv')

# 2. Data Cleaning

data = data.dropna()

data['date'] = pd.to_datetime(data['date'])

# 3. Analysis

monthly_sales = data.groupby(data['date'].dt.to_period('M'))['sales'].sum()

# 4. Visualization

monthly_sales.plot(kind='line', title='Monthly Sales Trend')

plt.show()

# 5. Modeling

X = data[['marketing_spend', 'price']]

y = data['sales']

model = LinearRegression().fit(X, y)

# 6. Prediction

predicted_sales = model.predict([[50000, 29.99]])

All in one language, one environment, one workflow.

4. Free and Open Source

Unlike proprietary tools (SAS, STATA, Tableau), Python is:

Free to use : No licensing costs
Open source : Transparent, community-driven development
Platform-independent : Works on Windows, Mac, Linux
Extensible : Anyone can create and share libraries

Business Impact: A company can equip an entire analytics team with Python at zero software cost, compared to tens of thousands of dollars for proprietary alternatives.

5. Industry Adoption and Career Value

Python skills are in high demand:

Job postings : 70%+ of data analyst positions mention Python
Salary premium : Python skills correlate with 10-20% higher salaries
Industry standard : Used by Google, Netflix, Amazon, JPMorgan, and virtually every major tech company

6. Integration and Automation

Python excels at connecting different systems and automating workflows:

Example Use Cases:

Pull data from Salesforce API, analyze it, and email results
Automate monthly report generation from database queries
Connect to Google Sheets, process data, update dashboards
Schedule recurring analyses to run automatically

Example: Automated weekly sales report

import pandas as pd

import smtplib

from email.mime.text import MIMEText

from datetime import datetime

# Pull data

data = pd.read_sql("SELECT * FROM sales WHERE date >= CURRENT_DATE - 7", connection)

# Analyze

weekly_total = data['sales'].sum()

top_products = data.groupby('product')['sales'].sum().nlargest(5)

# Create report

report = f"""

Weekly Sales Report - {datetime.now().strftime('%Y-%m-%d')}

Total Sales: ${weekly_total:,.2f}

Top 5 Products:

{top_products.to_string()}

"""

# Email report (simplified)

# send_email(report, recipients=['team@company.com'])

7. Reproducibility and Documentation

Python code serves as documentation of your analysis:

Transparent : Every step is visible and reviewable
Reproducible : Others can run the same analysis and get the same results
Auditable : Critical for compliance and quality control
Collaborative : Easy to share and iterate with version control (Git)

Contrast with Excel:

Excel: Formulas hidden in cells, manual steps undocumented, "works on my machine" problems
Python: Every transformation explicit, version-controlled, reproducible

When to Use Python vs. Other Tools

Python isn't always the best choice. Here's a practical decision framework:

Scenario	Best Tool	Why
Quick calculation, small dataset (<1000 rows)	Excel	Faster for simple tasks, visual interface
Ad-hoc data exploration, one-time analysis	Excel or Python	Either works; choose based on complexity
Recurring analysis, automation needed	Python	Reproducibility, automation capabilities
Large datasets (>100K rows)	Python or SQL	Performance, memory efficiency
Complex transformations, multiple data sources	Python	Flexibility, integration capabilities
Statistical modeling, machine learning	Python or R	Specialized libraries, industry standard
Interactive dashboards for business users	Tableau/Power BI	User-friendly interface, no coding required
Production deployment, real-time systems	Python	Integration, scalability, automation

The Hybrid Approach:

Many analysts use a combination:

Excel : Initial exploration, simple calculations, sharing with non-technical stakeholders
Python : Data cleaning, complex analysis, automation, modeling
SQL : Data extraction from databases
Tableau/Power BI : Final dashboards for business users

Python vs. R: A Brief Comparison

Both Python and R are excellent for analytics. Here's a practical comparison:

Aspect	Python	R
Primary Design	General-purpose programming	Statistical computing
Learning Curve	Gentler for programming beginners	Steeper for non-statisticians
Business Adoption	Broader (analytics + engineering)	Primarily analytics/research
Visualization	Good (matplotlib, seaborn, plotly)	Excellent (ggplot2)
Machine Learning	Excellent (scikit-learn, TensorFlow)	Good (caret, tidymodels)
Statistical Analysis	Good (statsmodels, scipy)	Excellent (built-in, specialized packages)
Deployment	Easier (general-purpose language)	More challenging
Community	Larger, more diverse	Smaller, more specialized
Job Market	Broader demand	More specialized roles

Recommendation for Business Analysts: Start with Python. It's more versatile, has broader industry adoption, and skills transfer to other domains (web development, automation, data engineering).

The Business Case for Python

For Individual Analysts:

✅ Increase productivity through automation
✅ Handle larger, more complex datasets
✅ Create reproducible, auditable analyses
✅ Enhance career prospects and earning potential
✅ Collaborate more effectively with technical teams

For Organizations:

✅ Reduce software licensing costs
✅ Improve analysis quality and reproducibility
✅ Enable advanced analytics (ML, AI)
✅ Attract and retain technical talent
✅ Scale analytics capabilities efficiently

ROI Example:

A business analyst spending 10 hours/week on manual data processing can often reduce this to 2 hours/week with Python automation:

Time saved : 8 hours/week × 50 weeks = 400 hours/year
Value : At $50/hour fully loaded cost = $20,000/year
Learning investment : ~40 hours to reach productivity
Payback period : ~5 weeks

7.2 Introduction to Cloud-Based Analytics Environments

The Evolution of Analytics Environments

Traditional Approach:

Install Python on your computer
Configure environment and install libraries
Manage dependencies and versions
Deal with "works on my machine" problems
Struggle with collaboration and sharing

Cloud-Based Approach:

Open a web browser
Start analyzing data
Collaborate in real-time
Access from anywhere
No installation or configuration needed

What Are Cloud-Based Analytics Environments?

Cloud-based analytics environments are web-based platforms that provide:

Computational resources : CPU, memory, storage in the cloud
Pre-configured environments : Python and libraries already installed
Interactive notebooks : Mix code, visualizations, and documentation
Collaboration features : Share, comment, co-edit
Integration : Connect to cloud storage, databases, APIs

Benefits of Cloud-Based Environments

1. Zero Setup Time

Traditional Installation:

# Install Python

# Install pip

# Install libraries

pip install pandas numpy matplotlib seaborn scikit-learn

# Troubleshoot version conflicts

# Configure IDE

# Set up virtual environments

# Debug path issues

# ...hours later...

Cloud-Based:

1. Go to colab.research.google.com

2. Click "New Notebook"

3. Start codin g

2. Consistent Environment

Everyone uses the same Python version and libraries—no more "works on my machine" problems.

3. Access to Powerful Hardware

Cloud platforms provide:

Free tier : Sufficient for most business analytics
Upgraded options : GPUs for machine learning, more RAM for large datasets
Scalability : Handle datasets that wouldn't fit on your laptop

4. Collaboration and Sharing

Share notebooks like Google Docs:

Real-time collaboration : Multiple people editing simultaneously
Comments : Discuss specific code cells
Version history : Track changes over time
Easy sharing : Send a link, no file attachments

5. Accessibility

Work from anywhere:

Any device : Laptop, tablet, even phone (for viewing)
Any operating system : Windows, Mac, Linux, ChromeOS
No local storage needed : Everything saved in the cloud

6. Integration with Cloud Services

Easy connection to:

Google Drive, Dropbox, OneDrive
Cloud databases (BigQuery, AWS RDS)
APIs and web services
Cloud storage (S3, Google Cloud Storage)

7.2.1 Google Colab: Features and Limitations

Google Colaboratory (Colab) is a free, cloud-based Jupyter notebook environment provided by Google.

Key Features

1. Free Access to Computing Resources

CPU : Standard computing for most analytics tasks
GPU : Graphics processing units for machine learning (free tier available)
TPU : Tensor processing units for advanced deep learning (free tier available)
RAM : 12-13 GB in free tier
Disk : ~100 GB temporary storage

2. Pre-installed Libraries

Most common libraries are already installed:

import pandas as pd # ✓ Pre-installed

import numpy as np # ✓ Pre-installed

import matplotlib.pyplot as plt # ✓ Pre-installed

import seaborn as sns # ✓ Pre-installed

import sklearn # ✓ Pre-installed

For additional libraries:

!pip install library_name

3. Google Drive Integration

Mount your Google Drive to access files:

from google.colab import drive

drive.mount('/content/drive')

# Now access files like:

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

4. Interactive Widgets and Forms

Create user-friendly interfaces:

#@title Enter Parameters { run: "auto" }

start_date = "2024-01-01" #@param {type:"date"}

region = "North" #@param ["North", "South", "East", "West"]

threshold = 1000 #@param {type:"slider", min:0, max:10000, step:100}

print(f"Analyzing {region} region from {start_date} with threshold {threshold}")

5. Markdown and Rich Text

Mix code with formatted documentation:

# Analysis Title

This notebook analyzes **Q3 sales data** to identify trends.

## Key Findings

- Sales increased 15% YoY

- Northeast region outperformed

6. Built-in Data Visualization

Visualizations render inline:

import matplotlib.pyplot as plt

plt.plot([1, 2, 3, 4], [10, 20, 25, 30])

plt.title('Sales Trend')

plt.show() # Displays directly in notebook

7. Code Snippets Library

Access pre-written code for common tasks:

Camera capture
File upload
Data visualization templates
Machine learning examples

8. Keyboard Shortcuts

Efficient navigation and editing:

Ctrl/Cmd + Enter : Run current cell
Shift + Enter : Run cell and move to next
Ctrl/Cmd + M B : Insert cell below
Ctrl/Cmd + M D : Delete cell
Ctrl/Cmd + / : Comment/uncomment

Limitations

1. Session Timeouts

Idle timeout : ~90 minutes of inactivity
Maximum session : ~12 hours
Impact : Long-running analyses may be interrupted
Workaround : Save intermediate results frequently

# Save intermediate results

data.to_csv('/content/drive/MyDrive/intermediate_results.csv', index=False)

2. Temporary Storage

Files in /content/ are deleted when session ends
Solution : Save important files to Google Drive

# BAD: Saved to temporary storage

data.to_csv('results.csv') # Lost when session ends!

# GOOD: Saved to Google Drive

data.to_csv('/content/drive/MyDrive/results.csv') # Persists

3. Resource Limits

Free tier limitations:

RAM : 12-13 GB (can run out with large datasets)
GPU/TPU : Limited availability during peak times
Compute units : Fair usage limits (not publicly specified)

Workaround for large datasets:

# Read data in chunks

chunk_size = 10000

chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):

# Process each chunk

processed = chunk[chunk['sales'] > 1000]

chunks.append(processed)

data = pd.concat(chunks, ignore_index=True)

4. No Terminal Access

Limited command-line functionality
Can run shell commands with ! prefix, but not a full terminal

# Shell commands work with ! prefix

!ls /content/drive/MyDrive/

!wget https://example.com/data.csv

5. Internet Dependency

Requires stable internet connection
No offline work (unlike local Jupyter)

6. Privacy Considerations

Data is processed on Google's servers
Important : Don't upload sensitive/confidential data without proper authorization
Review your organization's data governance policies

7. Limited Customization

Can't customize the environment as extensively as local installations
Some advanced configurations not possible

Colab Pro and Pro+

For power users, Google offers paid tiers:

Feature	Free	Pro (~$10/month)	Pro+ (~$50/month)
RAM	12-13 GB	Up to 32 GB	Up to 52 GB
Session Length	12 hours	24 hours	24 hours
Idle Timeout	90 min	Longer	Longer
GPU Access	Limited	Priority	Priority
Background Execution	No	Yes	Yes
Compute Units	Limited	More	Most

When to upgrade:

Working with datasets >10 GB
Training complex machine learning models
Need longer uninterrupted sessions
Require background execution

7.2.2 Alternatives: Jupyter Notebook, VS Code, and Hosted Platforms (Conceptual)

While this chapter focuses on Google Colab, it's valuable to understand the broader ecosystem.

Jupyter Notebook (Local)

What it is: The original notebook interface, running on your own computer.

Pros:

✅ Full control over environment
✅ Works offline
✅ No resource limits (beyond your hardware)
✅ Complete privacy (data stays local)
✅ Extensive customization

Cons:

❌ Requires installation and configuration
❌ Limited by your computer's resources
❌ Harder to share and collaborate
❌ Environment inconsistencies across team members

When to use:

Working with sensitive data that can't leave your network
Need offline access
Require specific environment configurations
Have powerful local hardware

Getting Started (conceptual):

# Install Jupyter

pip install jupyter

# Launch

jupyter notebook

# Opens in browser at localhost:8888

JupyterLab (Local)

What it is: Next-generation interface for Jupyter, more IDE-like.

Additional features over Jupyter Notebook:

Multiple notebooks in tabs
File browser
Terminal access
Text editor
Extension system

When to use: Same as Jupyter Notebook, but for users who want a more integrated development environment.

VS Code with Jupyter Extension

What it is: Microsoft's Visual Studio Code editor with Jupyter notebook support.

Pros:

✅ Powerful code editor features (IntelliSense, debugging)
✅ Integrated terminal
✅ Git integration
✅ Extensions for Python, data science
✅ Can connect to remote Jupyter servers

Cons:

❌ Steeper learning curve
❌ More complex setup
❌ Heavier application

When to use:

You're already familiar with VS Code
Need advanced debugging capabilities
Working on larger projects with multiple files
Want tight Git integration

Kaggle Notebooks

What it is: Free cloud notebooks from Kaggle (owned by Google).

Pros:

✅ Free GPU access (30 hours/week)
✅ Large dataset library
✅ Community and competitions
✅ Similar to Colab

Cons:

❌ Less Google Drive integration
❌ Smaller storage limits
❌ Focused on data science competitions

When to use:

Learning data science
Participating in Kaggle competitions
Need free GPU for machine learning

Databricks Community Edition

What it is: Free tier of Databricks, a cloud analytics platform.

Pros:

✅ Built for big data (Apache Spark)
✅ Collaborative features
✅ Enterprise-grade platform

Cons:

❌ Steeper learning curve
❌ Overkill for small datasets
❌ Limited free tier

When to use:

Working with very large datasets (>100 GB)
Learning Spark
Your organization uses Databricks

Amazon SageMaker Studio Lab

What it is: Free Jupyter environment from AWS.

Pros:

✅ Free compute (no credit card required)
✅ 15 GB storage
✅ GPU access

Cons:

❌ Requires application/waitlist
❌ Less intuitive than Colab
❌ Smaller community

Microsoft Azure Notebooks (Deprecated) / Azure ML Studio

What it is: Microsoft's cloud notebook offerings.

Note: Azure Notebooks was deprecated; Azure ML Studio is the current offering but requires Azure subscription.

Deepnote

What it is: Collaborative data science notebook platform.

Pros:

✅ Real-time collaboration
✅ Beautiful interface
✅ Good integration features
✅ Free tier available

Cons:

❌ Smaller community
❌ Limited free tier
❌ Less mature than Colab

Comparison Matrix

Platform	Cost	Setup	Collaboration	Resources	Best For
Google Colab	Free	None	Good	Moderate	General business analytics, learning
Jupyter (Local)	Free	Medium	Poor	Your hardware	Sensitive data, offline work
VS Code	Free	Medium	Medium	Your hardware	Software development, complex projects
Kaggle	Free	None	Good	Moderate	Learning, competitions
Databricks	Free tier	Low	Excellent	Good	Big data, Spark
Colab Pro	$10/mo	None	Good	High	Power users, ML

Recommendation for Business Analysts

Start with Google Colab:

Zero setup barrier
Free and powerful enough for most business analytics
Easy collaboration and sharing
Smooth transition to other platforms later if needed

Transition to local Jupyter when:

Working with confidential data
Need offline access
Require specific environment configurations
Have outgrown Colab's resource limits

7.3 Getting Started with Google Colab

7.3.1 Creating and Managing Notebooks

Creating Your First Notebook

Step 1: Access Google Colab

Open your web browser
Go to https://colab.research.google.com
Sign in with your Google account

Step 2: Create a New Notebook

Option A: From the welcome screen

Click "New Notebook"

Option B: From Google Drive

Go to Google Drive
Click "New" → "More" → "Google Colaboratory"

Option C: From File menu

File → New Notebook

Your first notebook opens with an empty code cell.

Understanding the Interface

┌─────────────────────────────────────────────────────┐

│ File Edit View Insert Runtime Tools Help │ ← Menu bar

├─────────────────────────────────────────────────────┤

│ + Code + Text 📁 Files 🔍 Search │ ← Toolbar

├─────────────────────────────────────────────────────┤

│ Untitled0.ipynb │ ← Notebook name

├─────────────────────────────────────────────────────┤

│ [ ] # Your code here │ ← Code cell

│ │

├─────────────────────────────────────────────────────┤

│ + Code + Text │ ← Add cells

└─────────────────────────────────────────────────────┘

Key Interface Elements:

Menu Bar : File operations, editing, runtime control
Toolbar : Quick access to common actions
Notebook Name : Click to rename
Code Cells : Where you write Python code
Text Cells : Markdown for documentation
Sidebar : Files, table of contents, code snippets

Your First Code

Click in the code cell and type:

print("Hello, Business Analytics!")

Run the cell:

Click the play button (▶) on the left of the cell, OR
Press Shift + Enter , OR
Press Ctrl/Cmd + Enter

Output appears below the cell:

Hello, Business Analytics!

Adding Cells

Add a code cell:

Click "+ Code" button, OR
Hover between cells and click "+ Code", OR
Press Ctrl/Cmd + M B (below) or Ctrl/Cmd + M A (above)

Add a text cell:

Click "+ Text" button, OR
Hover between cells and click "+ Text", OR
Press Ctrl/Cmd + M M (convert code cell to text)

Working with Text Cells (Markdown)

Text cells use Markdown formatting:

# Main Heading

## Subheading

### Sub-subheading

**Bold text** and *italic text*

- Bullet point 1

- Bullet point 2

1. Numbered item

2. Another item

[Link text](https://example.com)

`inline code`

code block

> Blockquote

Example text cell:

# Q3 Sales Analysis

This notebook analyzes Q3 2024 sales data to identify:

- Top performing products

- Regional trends

- Seasonal patterns

**Data Source:** Salesforce export, October 15, 2024

Organizing Your Notebook

Best practices:

Start with a title and overview

# Customer Churn Analysis - October 2024

**Objective:** Identify factors contributing to customer churn

**Data:** CRM export (10,000 customers, 2023-2024)

**Methodology:**

1. Data cleaning and exploration

2. Descriptive statistics

3. Visualization of key patterns

4. Predictive modeling

Use section headings

## 1. Data Import and Cleaning

## 2. Exploratory Data Analysis

## 3. Key Findings

## 4. Recommendations

Document your code

# Import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Load data from Google Drive

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

# Display first few rows to verify

print(f"Dataset shape: {data.shape}")

data.head()

Use table of contents

Click the table of contents icon in the sidebar
Automatically generated from your headings
Click to navigate to sections

Renaming Your Notebook

Method 1: Click the notebook name at the top

Type new name
Press Enter

Method 2: File → Rename

Naming conventions:

✅ 2024-10_Customer_Churn_Analysis.ipynb
✅ Q3_Sales_Report_v2.ipynb
❌ Untitled0.ipynb
❌ notebook.ipynb

Best practices:

Include date or period
Descriptive name
Version number if iterating
Use underscores or hyphens, not spaces

Saving Your Notebook

Automatic saving:

Colab auto-saves to Google Drive every few minutes
Look for "All changes saved" in the top bar

Manual saving:

File → Save, OR
Ctrl/Cmd + S

Save location:

Default: "Colab Notebooks" folder in your Google Drive
Can be moved to any folder in your Drive

Downloading Your Notebook

Download as .ipynb (Jupyter format):

File → Download → Download .ipynb
Can be opened in Jupyter, JupyterLab, VS Code

Download as Python script:

File → Download → Download .py
Pure Python code (markdown cells become comments)

Download with outputs:

File → Print
Save as PDF from print dialog

Copying and Sharing Notebooks

Make a copy:

File → Save a copy in Drive
Creates a duplicate you can edit independently

Share with others:

Click "Share" button (top right)
Add people by email, OR
Get shareable link

Sharing permissions:

Viewer : Can view and run, cannot edit
Commenter : Can view, run, and comment
Editor : Can view, run, and edit

Best practice for collaboration:

Share as "Viewer" for read-only access
Share as "Editor" for collaboration
Use "Copy to Drive" to let others create their own editable version

Opening Existing Notebooks

From Colab:

File → Open notebook
Tabs: Recent, Google Drive, GitHub, Upload

From Google Drive:

Navigate to the .ipynb file
Double-click to open in Colab

From GitHub:

File → Open notebook → GitHub tab
Enter repository URL or search

From URL:

Replace github.com with colab.research.google.com/github in any GitHub notebook URL
Example: https://colab.research.google.com/github/username/repo/blob/main/notebook.ipynb

Managing Multiple Notebooks

Organization strategies:

Folder structure in Google Drive:

My Drive/

├── Analytics Projects/

│ ├── Customer Churn/

│ │ ├── data/

│ │ ├── 01_Data_Exploration.ipynb

│ │ ├── 02_Modeling.ipynb

│ │ └── 03_Final_Report.ipynb

│ └── Sales Analysis/

│ ├── data/

│ └── Q3_Analysis.ipynb

Naming conventions:

Prefix with numbers for sequence: 01_ , 02_ , 03_
Include dates: 2024-10-15_
Use descriptive names

Version control:

Save major versions: Analysis_v1.ipynb , Analysis_v2.ipynb
Use "Save a copy" before major changes
Consider GitHub for serious version control

7.3.2 Connecting to Google Drive and Data Sources

Mounting Google Drive

Why mount Google Drive?

Access your data files
Save results permanently
Share data across notebooks

How to mount:

from google.colab import drive

drive.mount('/content/drive')

What happens:

A link appears: "Go to this URL in a browser"
Click the link (or it opens automatically)
Choose your Google account
Click "Allow"
Copy the authorization code (if prompted)
Paste into the notebook (if prompted)

Modern experience: Usually just one click to authorize.

Output:

Mounted at /content/drive

Verify the mount:

# List files in your Google Drive

!ls "/content/drive/MyDrive"

File path structure:

/content/drive/MyDrive/ ← Your Google Drive root

/content/drive/MyDrive/Colab Notebooks/ ← Default notebook location

/content/drive/MyDrive/data/ ← Your data folder (example)

Reading Files from Google Drive

CSV file:

import pandas as pd

# Read CSV from Google Drive

data = pd.read_csv('/content/drive/MyDrive/data/sales_data.csv')

# Display first few rows

print(f"Loaded {len(data)} rows")

data.head()

Excel file:

# Read Excel file

data = pd.read_excel('/content/drive/MyDrive/data/sales_data.xlsx', sheet_name='Q3 Sales')

data.head()

Multiple sheets:

# Read all sheets

excel_file = pd.ExcelFile('/content/drive/MyDrive/data/sales_data.xlsx')

# See sheet names

print(excel_file.sheet_names)

# Read specific sheets

q3_data = pd.read_excel(excel_file, sheet_name='Q3')

q4_data = pd.read_excel(excel_file, sheet_name='Q4' )

Text file:

# Read text file

with open('/content/drive/MyDrive/data/notes.txt', 'r') as f:

content = f.read()

print(content)

Saving Files to Google Drive

Save DataFrame to CSV:

# Process data

results = data.groupby('region')['sales'].sum()

# Save to Google Drive

results.to_csv('/content/drive/MyDrive/results/regional_sales.csv')

print("Results saved to Google Drive")

Save DataFrame to Excel:

# Save to Excel

results.to_excel('/content/drive/MyDrive/results/regional_sales.xlsx',

sheet_name='Summary',

index=False)

Save multiple DataFrames to Excel:

# Create Excel writer

with pd.ExcelWriter('/content/drive/MyDrive/results/sales_report.xlsx') as writer:

summary.to_excel(writer, sheet_name='Summary', index=False)

details.to_excel(writer, sheet_name='Details', index=False)

trends.to_excel(writer, sheet_name='Trends', index=False)

print("Multi-sheet report saved")

Save plot:

import matplotlib.pyplot as plt

# Create plot

plt.figure(figsize=(10, 6))

plt.plot(data['date'], data['sales'])

plt.title('Sales Trend')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

# Save to Google Drive

plt.savefig('/content/drive/MyDrive/results/sales_trend.png', dpi=300, bbox_inches='tight')

plt.show()

print("Plot saved to Google Drive")

Uploading Files Directly to Colab

For small, one-time uploads:

from google.colab import files

# Upload file(s)

uploaded = files.upload()

# Access uploaded file

import pandas as pd

for filename in uploaded.keys():

print(f"Uploaded: {filename}")

data = pd.read_csv(filename)

What happens:

"Choose Files" button appears
Select file(s) from your computer
Files upload to /content/ (temporary storage)
⚠️ Warning: Files deleted when session ends!

Best practice: Upload to Google Drive for permanent storage, or save results to Drive before session ends.

Reading Data from URLs

CSV from URL:

import pandas as pd

# Read directly from URL

url = 'https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv'

data = pd.read_csv(url)

data.head()

Download file from URL:

# Download file using wget

!wget https://example.com/data/sales_data.csv -O /content/drive/MyDrive/data/sales_data.csv

# Or using Python

import requests

url = 'https://example.com/data/sales_data.csv'

response = requests.get(url)

with open('/content/drive/MyDrive/data/sales_data.csv', 'wb') as f:

f.write(response.content)

print("File downloaded to Google Drive")

Connecting to Google Sheets

Read from Google Sheets:

# Method 1: Export as CSV (simple)

sheet_url = 'https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID/export?format=csv&gid=0'

data = pd.read_csv(sheet_url)

# Method 2: Using gspread library (more features)

!pip install gspread

from google.colab import auth

import gspread

from google.auth import default

# Authenticate

auth.authenticate_user()

creds, _ = default()

gc = gspread.authorize(creds)

# Open sheet

sheet = gc.open('Sales Data 2024').sheet1

# Get all values

values = sheet.get_all_values()

data = pd.DataFrame(values[1:], columns=values[0])

data.head()

Write to Google Sheets:

# Update existing sheet

sheet.update([data.columns.values.tolist()] + data.values.tolist())

print("Data written to Google Sheets ")

Connecting to Databases

SQLite (local):

import sqlite3

import pandas as pd

# Connect to database

conn = sqlite3.connect('/content/drive/MyDrive/data/sales.db')

# Query data

query = "SELECT * FROM sales WHERE date >= '2024-01-01'"

data = pd.read_sql(query, conn)

# Close connection

conn.close()

data.head()

PostgreSQL/MySQL (remote):

# Install library

!pip install psycopg2-binary # PostgreSQL

# or

!pip install pymysql # MySQL

import pandas as pd

from sqlalchemy import create_engine

# Create connection (PostgreSQL example)

engine = create_engine('postgresql://username:password@host:port/database')

# Query data

query = "SELECT * FROM sales WHERE date >= '2024-01-01'"

data = pd.read_sql(query, engine)

data.head()

⚠️ Security Warning: Never hardcode credentials in notebooks!

Better approach:

from google.colab import userdata

# Store secrets in Colab's secret manager

# (Left sidebar → Key icon → Add secret)

db_password = userdata.get('DB_PASSWORD')

# Use in connection string

engine = create_engine(f'postgresql://username:{db_password}@host:port/database')

Connecting to APIs

Simple API request:

import requests

import pandas as pd

# API request

url = 'https://api.example.com/sales'

headers = {'Authorization': 'Bearer YOUR_API_KEY'}

response = requests.get(url, headers=headers)

# Parse JSON response

data = response.json()

# Convert to DataFrame

df = pd.DataFrame(data['results'])

df.head()

Pagination example:

import requests

import pandas as pd

def fetch_all_pages(base_url, headers):

all_data = []

page = 1

while True:

response = requests.get(f"{base_url}?page={page}", headers=headers)

data = response.json()

if not data['results']:

break

all_data.extend(data['results'])

page += 1

print(f"Fetched page {page-1}")

return pd.DataFrame(all_data)

# Use function

df = fetch_all_pages('https://api.example.com/sales', headers={'Authorization': 'Bearer KEY'})

print(f"Total records: {len(df)}")

Best Practices for Data Access

1. Organize your data:

Google Drive/

├── Analytics Projects/

│ ├── data/

│ │ ├── raw/ ← Original, unmodified data

│ │ ├── processed/ ← Cleaned, transformed data

│ │ └── external/ ← Data from APIs, downloads

│ ├── results/ ← Analysis outputs

│ └── notebooks/ ← Your .ipynb files

2. Document data sources:

"""

DATA SOURCES:

- sales_data.csv: Salesforce export, 2024-10-15, John Smith

- customer_data.xlsx: CRM database, 2024-10-14, automated export

- market_data.csv: Public API (https://api.example.com), 2024-10-15

"""

3. Validate data after loading:

# Load data

data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')

# Validate

print(f"Shape: {data.shape}")

print(f"Columns: {data.columns.tolist()}")

print(f"Date range: {data['date'].min()} to {data['date'].max()}")

print(f"Missing values: {data.isnull().sum().sum()}")

# Check for expected columns

required_columns = ['date', 'product', 'sales', 'region']

missing_columns = set(required_columns) - set(data.columns)

if missing_columns:

print(f"⚠️ WARNING: Missing columns: {missing_columns}")

else:

print("✓ All required columns present")

4. Handle errors gracefully:

import pandas as pd

import os

file_path = '/content/drive/MyDrive/data/sales.csv'

try:

data = pd.read_csv(file_path)

print(f"✓ Successfully loaded {len(data)} rows")

except FileNotFoundError:

print(f"❌ Error: File not found at {file_path}")

print("Available files:")

print(os.listdir('/content/drive/MyDrive/data/'))

except pd.errors.EmptyDataError:

print("❌ Error: File is empty")

except Exception as e:

print(f"❌ Error loading data: {e}")

7.4 Python Essentials for Analytics

7.4.1 Basic Syntax, Data Types, and Control Structures

Python Basics

Comments:

# This is a single-line comment

"""

This is a multi-line comment

or docstring

"""

# Use comments to explain WHY, not WHAT

# BAD: x = x + 1 # increment x

# GOOD: x = x + 1 # adjust for zero-based indexing

Variables:

# Variables don't need type declarations

company_name = "Acme Corp"

revenue = 1250000

growth_rate = 0.15

is_profitable = True

# Variable naming conventions

# ✓ snake_case for variables and functions

# ✓ PascalCase for classes

# ✓ UPPER_CASE for constants

# Good names

customer_count = 150

average_order_value = 45.50

# Bad names

x = 150 # What is x?

aov = 45.50 # Unclear abbreviation

Printing output:

# Basic print

print("Hello, World!")

# Print variables

revenue = 1250000

print(revenue)

# Formatted strings (f-strings) - BEST PRACTICE

print(f"Revenue: ${revenue:,.2f}")

# Output: Revenue: $1,250,000.00

# Multiple variables

company = "Acme"

year = 2024

print(f"{company} revenue in {year}: ${revenue:,.2f}")

Data Types

Numbers:

# Integers

customer_count = 150

year = 2024

# Floats

price = 29.99

growth_rate = 0.15

# Arithmetic operations

total_revenue = customer_count * price

print(f"Total revenue: ${total_revenue:,.2f}")

# Common operations

a = 10

b = 3

print(a + b) # 13 (addition)

print(a - b) # 7 (subtraction)

print(a * b) # 30 (multiplication)

print(a / b) # 3.333... (division)

print(a // b) # 3 (integer division)

print(a % b) # 1 (modulo/remainder)

print(a ** b) # 1000 (exponentiation)

# Useful functions

import math

print(round(3.14159, 2)) # 3.14

print(abs(-10)) # 10

print(math.sqrt(16)) # 4.0

print(math.ceil(3.2)) # 4

print(math.floor(3.8)) # 3

Strings:

# String creation

company = "Acme Corp"

product = 'Widget Pro' # Single or double quotes

# String concatenation

full_name = "Acme" + " " + "Corp"

# Better: use f-strings

full_name = f"Acme Corp"

# String methods

text = " Business Analytics "

print(text.strip()) # "Business Analytics" (remove whitespace)

print(text.lower()) # " business analytics "

print(text.upper()) # " BUSINESS ANALYTICS "

print(text.replace("Business", "Data")) # " Data Analytics "

print(text.split()) # ['Business', 'Analytics']

# String formatting

revenue = 1250000

print(f"Revenue: ${revenue:,.2f}") # Revenue: $1,250,000.00

print(f"Revenue: ${revenue/1000:.1f}K") # Revenue: $1250.0K

# Multi-line strings

report = """

Q3 2024 Sales Report

====================

Total Revenue: $1.25M

Growth: 15%

"""

print(report)

Booleans:

# Boolean values

is_profitable = True

has_debt = False

# Comparison operators

revenue = 1000000

target = 900000

print(revenue > target) # True

print(revenue < target) # False

print(revenue == target) # False

print(revenue != target) # True

print(revenue >= target) # True

print(revenue <= target) # False

# Logical operators

high_revenue = revenue > 1000000

low_cost = True

print(high_revenue and low_cost) # False (both must be True)

print(high_revenue or low_cost) # True (at least one is True)

print(not high_revenue) # True (negation)

Lists:

# Create lists

products = ["Widget", "Gadget", "Doohickey"]

sales = [1000, 1500, 1200]

mixed = ["Acme", 2024, True, 3.14] # Can mix types

# Access elements (zero-indexed)

print(products[0]) # "Widget" (first element)

print(products[1]) # "Gadget"

print(products[-1]) # "Doohickey" (last element)

print(products[-2]) # "Gadget" (second to last)

# Slicing

print(products[0:2]) # ['Widget', 'Gadget'] (elements 0 and 1)

print(products[:2]) # ['Widget', 'Gadget'] (first 2)

print(products[1:]) # ['Gadget', 'Doohickey'] (from index 1 to end)

# List methods

products.append("Thingamajig") # Add to end

products.insert(0, "Whatsit") # Insert at position

products.remove("Widget") # Remove by value

last_product = products.pop() # Remove and return last

# List operations

print(len(products)) # Length

print(sum(sales)) # Sum (for numeric lists)

print(max(sales)) # Maximum

print(min(sales)) # Minimum

print(sorted(sales)) # Sorted copy

print(sales.index(1500)) # Find index of value

# List comprehensions (powerful!)

# Create new list by transforming another

doubled_sales = [x * 2 for x in sales]

high_sales = [x for x in sales if x > 1100]

print(doubled_sales) # [2000, 3000, 2400]

print(high_sales) # [1500, 1200]

Dictionaries:

# Create dictionaries (key-value pairs)

customer = {

"name": "John Smith",

"email": "john@example.com",

"age": 35,

"is_active": True

}

# Access values

print(customer["name"]) # "John Smith"

print(customer.get("email")) # "john@example.com"

print(customer.get("phone", "N/A")) # "N/A" (default if key doesn't exist)

# Modify dictionaries

customer["age"] = 36 # Update value

customer["phone"] = "555-1234" # Add new key-value pair

del customer["is_active"] # Delete key

# Dictionary methods

print(customer.keys()) # dict_keys(['name', 'email', 'age', 'phone'])

print(customer.values()) # dict_values(['John Smith', 'john@example.com', 36, '555-1234'])

print(customer.items()) # Key-value pairs

# Check if key exists

if "email" in customer:

print(f"Email: {customer['email']}")

# Business example: sales by region

regional_sales = {

"North": 250000,

"South": 180000,

"East": 310000,

"West": 220000

}

total_sales = sum(regional_sales.values())

print(f"Total sales: ${total_sales:,.0f}")

# Find top region

top_region = max(regional_sales, key=regional_sales.get)

print(f"Top region: {top_region} (${regional_sales[top_region]:,.0f})")

Control Structures

If statements:

# Basic if statement

revenue = 1250000

if revenue > 1000000:

print("Revenue exceeds $1M!")

# If-else

if revenue > 1000000:

print("High revenue")

else:

print("Revenue below $1M")

# If-elif-else

if revenue > 2000000:

status = "Excellent"

elif revenue > 1000000:

status = "Good"

elif revenue > 500000:

status = "Fair"

else:

status = "Needs Improvement"

print(f"Status: {status}")

# Multiple conditions

revenue = 1250000

profit_margin = 0.18

if revenue > 1000000 and profit_margin > 0.15:

print("Strong performance!")

# Business example: customer segmentation

customer_value = 5000

years_active = 3

if customer_value > 10000:

segment = "VIP"

elif customer_value > 5000 and years_active > 2:

segment = "Premium"

elif customer_value > 1000:

segment = "Standard"

else:

segment = "Basic"

print(f"Customer segment: {segment}")

For loops:

# Loop through list

products = ["Widget", "Gadget", "Doohickey"]

for product in products:

print(f"Product: {product}")

# Loop through range

for i in range(5): # 0, 1, 2, 3, 4

print(i)

for i in range(1, 6): # 1, 2, 3, 4, 5

print(i)

for i in range(0, 10, 2): # 0, 2, 4, 6, 8 (step by 2)

print(i)

# Loop with index

for i, product in enumerate(products):

print(f"{i+1}. {product}")

# Loop through dictionary

regional_sales = {

"North": 250000,

"South": 180000,

"East": 310000

}

for region, sales in regional_sales.items():

print(f"{region}: ${sales:,.0f}")

# Business example: calculate total sales

monthly_sales = [45000, 52000, 48000, 61000, 55000, 58000]

total = 0

for sales in monthly_sales:

total += sales

average = total / len(monthly_sales)

print(f"Total: ${total:,.0f}")

print(f"Average: ${average:,.0f}")

# Better: use built-in functions

total = sum(monthly_sales)

average = sum(monthly_sales) / len(monthly_sales)

While loops:

# Basic while loop

count = 0

while count < 5:

print(count)

count += 1

# Business example: compound growth

initial_investment = 10000

target = 20000

annual_return = 0.08

years = 0

value = initial_investment

while value < target:

value = value * (1 + annual_return)

years += 1

print(f"Year {years}: ${value:,.2f}")

print(f"Reached target in {years} years")

# Break and continue

for i in range(10):

if i == 3:

continue # Skip this iteration

if i == 7:

break # Exit loop

print(i)

Functions

# Define a function

def calculate_revenue(units_sold, price_per_unit):

"""Calculate total revenue from units and price."""

revenue = units_sold * price_per_unit

return revenue

# Call the function

total_revenue = calculate_revenue(1000, 29.99)

print(f"Revenue: ${total_revenue:,.2f}")

# Function with default parameters

def calculate_discount(price, discount_rate=0.10):

"""Calculate discounted price. Default discount is 10%."""

discounted_price = price * (1 - discount_rate)

return discounted_price

print(calculate_discount(100)) # $90.00 (10% discount)

print(calculate_discount(100, 0.20)) # $80.00 (20% discount)

# Function with multiple return values

def calculate_metrics(revenue, costs):

"""Calculate profit and margin."""

profit = revenue - costs

margin = profit / revenue if revenue > 0 else 0

return profit, margin

profit, margin = calculate_metrics(1000000, 750000)

print(f"Profit: ${profit:,.0f}")

print(f"Margin: {margin:.1%}")

# Business example: customer lifetime value

def calculate_ltv(avg_purchase, purchase_frequency, customer_lifespan):

"""

Calculate Customer Lifetime Value.

Parameters:

- avg_purchase: Average purchase amount

- purchase_frequency: Purchases per year

- customer_lifespan: Years as customer

Returns:

- Customer Lifetime Value

"""

ltv = avg_purchase * purchase_frequency * customer_lifespan

return ltv

customer_ltv = calculate_ltv(avg_purchase=50,

purchase_frequency=12,

customer_lifespan=5)

print(f"Customer LTV: ${customer_ltv:,.0f}")

7.4.2 Working with Libraries: pandas, numpy, matplotlib, seaborn

Introduction to Libraries

Libraries extend Python's capabilities. Think of them as toolboxes:

pandas : Data manipulation (like Excel on steroids)
numpy : Numerical computing (fast array operations)
matplotlib : Plotting (create charts)
seaborn : Statistical visualization (beautiful charts)

Importing libraries:

# Standard imports

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# These aliases (pd, np, plt, sns) are conventions

# Everyone uses them, so you should too

pandas: Data Manipulation

Creating DataFrames:

import pandas as pd

# From dictionary

data = {

'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],

'sales': [1000, 1500, 1200, 1800],

'region': ['North', 'South', 'North', 'East']

}

df = pd.DataFrame(data)

print(df)

Output:

product sales region

0 Widget 1000 North

1 Gadget 1500 South

2 Doohickey 1200 North

3 Thingamajig 1800 East

Reading data:

# From CSV

df = pd.read_csv('sales_data.csv')

# From Excel

df = pd.read_excel('sales_data.xlsx', sheet_name='Q3')

# From URL

url = 'https://example.com/data.csv'

df = pd.read_csv(url)

Exploring data:

# First/last rows

print(df.head()) # First 5 rows

print(df.head(10)) # First 10 rows

print(df.tail()) # Last 5 rows

# Shape

print(df.shape) # (rows, columns)

# Column names

print(df.columns)

# Data types

print(df.dtypes)

# Summary statistics

print(df.describe())

# Info

print(df.info())

Selecting data:

# Select column

sales = df['sales']

print(sales)

# Select multiple columns

subset = df[['product', 'sales']]

print(subset)

# Select rows by condition

high_sales = df[df['sales'] > 1200]

print(high_sales)

# Multiple conditions

north_high_sales = df[(df['region'] == 'North') & (df['sales'] > 1000)]

# Select by position

first_row = df.iloc[0] # First row

first_cell = df.iloc[0, 0] # First row, first column

# Select by label

df_indexed = df.set_index('product')

widget_sales = df_indexed.loc['Widget', 'sales']

Adding/modifying columns:

# Add new column

df['revenue'] = df['sales'] * 29.99

# Modify existing column

df['sales'] = df['sales'] * 1.1 # 10% increase

# Conditional column

df['performance'] = df['sales'].apply(

lambda x: 'High' if x > 1200 else 'Low'

)

# Or using np.where

df['performance'] = np.where(df['sales'] > 1200, 'High', 'Low')

Grouping and aggregating:

# Group by region, sum sales

regional_sales = df.groupby('region')['sales'].sum()

print(regional_sales)

# Multiple aggregations

summary = df.groupby('region').agg({

'sales': ['sum', 'mean', 'count'],

'revenue': 'sum'

})

print(summary)

# Reset index to make it a regular DataFrame

summary_df = summary.reset_index()

Sorting:

# Sort by sales (ascending)

df_sorted = df.sort_values('sales')

# Sort descending

df_sorted = df.sort_values('sales', ascending=False)

# Sort by multiple columns

df_sorted = df.sort_values(['region', 'sales'], ascending=[True, False])

Handling missing data:

# Check for missing values

print(df.isnull().sum())

# Drop rows with missing values

df_clean = df.dropna()

# Fill missing values

df_filled = df.fillna(0) # Fill with 0

df_filled = df.fillna(df.mean()) # Fill with mean

# Fill specific column

df['sales'] = df['sales'].fillna(df['sales'].median())

Merging DataFrames:

# Two DataFrames

sales_df = pd.DataFrame({

'product': ['Widget', 'Gadget'],

'sales': [1000, 1500]

})

price_df = pd.DataFrame({

'product': ['Widget', 'Gadget'],

'price': [29.99, 39.99]

})

# Merge (like SQL JOIN)

merged = pd.merge(sales_df, price_df, on='product')

print(merged)

Business example:

import pandas as pd

# Load sales data

sales = pd.read_csv('sales_data.csv')

# Clean data

sales = sales.dropna()

sales['date'] = pd.to_datetime(sales['date'])

# Add calculated columns

sales['revenue'] = sales['units'] * sales['price']

sales['month'] = sales['date'].dt.to_period('M')

# Aggregate by month

monthly_summary = sales.groupby('month').agg({

'revenue': 'sum',

'units': 'sum',

'customer_id': 'nunique' # Count unique customers

}).reset_index()

monthly_summary.columns = ['month', 'total_revenue', 'total_units', 'unique_customers']

# Calculate average order value

monthly_summary['avg_order_value'] = (

monthly_summary['total_revenue'] / monthly_summary['unique_customers']

)

# Display top 5 months

top_months = monthly_summary.sort_values('total_revenue', ascending=False).head()

print(top_months)

numpy: Numerical Computing

Creating arrays:

import numpy as np

# From list

arr = np.array([1, 2, 3, 4, 5])

print(arr)

# 2D array

matrix = np.array([[1, 2, 3], [4, 5, 6]])

print(matrix)

# Special arrays

zeros = np.zeros(5) # [0. 0. 0. 0. 0.]

ones = np.ones(5) # [1. 1. 1. 1. 1.]

range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]

linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]

Array operations:

# Element-wise operations

sales = np.array([1000, 1500, 1200, 1800])

# Add 10% to all sales

increased_sales = sales * 1.1

print(increased_sales)

# Operations between arrays

costs = np.array([600, 900, 700, 1000])

profit = sales - costs

print(profit)

# Statistical functions

print(np.mean(sales)) # Average

print(np.median(sales)) # Median

print(np.std(sales)) # Standard deviation

print(np.sum(sales)) # Sum

print(np.min(sales)) # Minimum

Exercises

These hands-on exercises will help you apply the concepts covered in this chapter. Work through them in Google Colab to build practical skills in Python-based analytics.

Exercise 1: Set Up and Explore Google Colab

Objective : Familiarize yourself with the Google Colab interface and basic operations.

Tasks :

Go to colab.research.google.com and create a new notebook.
Rename your notebook to "Chapter7_Exercise1_YourName".
Create a markdown cell with a title and brief description of the notebook.
Create a code cell and run a simple Python command: print("Hello, Analytics!") .
Add another markdown cell explaining what you just did.

Insert a code cell that displays the current Python version:
import sys

print(f"Python version: {sys.version}")

Check which libraries are pre-installed by running:
!pip list | grep -E "pandas|numpy|matplotlib|seaborn"

Save your notebook to Google Drive.
Share the notebook with "view" permissions and copy the shareable link.

Deliverable : A well-organized Colab notebook with markdown documentation and the shareable link.

Exercise 2: Import a Dataset from Multiple Sources

Objective : Practice importing data from different sources into Google Colab.

Tasks :

Part A: Import from URL

#Import the Iris dataset directly from a URL:
import pandas as pd

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"

iris_df = pd.read_csv(url)

print(iris_df.head())

print(f"\nDataset shape: {iris_df.shape}")

Part B: Import from Google Drive

Mount your Google Drive.
Create a simple CSV file locally or download one, then upload it to your Google Drive.
Read the file from Google Drive into a pandas DataFrame.
Display the first few rows and basic information about the dataset.

Part C: Import from Local Upload

Use the file upload widget to upload a CSV file:
from google.colab import files
uploaded = files.upload()

# Read the uploaded file

import io

for filename in uploaded.keys():

df = pd.read_csv(io.BytesIO(uploaded[filename]))

print(f"Loaded {filename}")

print(df.head())

Part D: Documentation

Add markdown cells before each part explaining:

The data source
Why you might use this method
Any limitations or considerations

Deliverable : A Colab notebook demonstrating all three import methods with proper documentation.

Exercise 3: Calculate Descriptive Statistics with pandas

Objective : Use pandas to perform exploratory data analysis and calculate key statistics.

Tasks :

Load the Dataset : Use the Iris dataset from Exercise 2 or load a sales dataset:

import pandas as pd

import numpy as np

# Sample sales data

np.random.seed(42)

sales_data = pd.DataFrame({

'date': pd.date_range('2023-01-01', periods=100, freq='D'),

'product': np.random.choice(['Product A', 'Product B', 'Product C'], 100),

'region': np.random.choice(['North', 'South', 'East', 'West'], 100),

'sales': np.random.randint(100, 1000, 100),

'units': np.random.randint(1, 50, 100),

'cost': np.random.randint(50, 500, 100)

})

# Calculate profit

sales_data['profit'] = sales_data['sales'] - sales_data['cost']

Basic Descriptive Statistics :

# Overall statistics

print("=== Overall Statistics ===")

print(sales_data[['sales', 'units', 'profit']].describe())

# Mean, median, mode

print("\n=== Central Tendency ===")

print(f"Mean sales: ${sales_data['sales'].mean():.2f}")

print(f"Median sales: ${sales_data['sales'].median():.2f}")

print(f"Mode product: {sales_data['product'].mode()[0]}")

# Variability

print("\n=== Variability ===")

print(f"Sales std dev: ${sales_data['sales'].std():.2f}")

print(f"Sales variance: {sales_data['sales'].var():.2f}")

print(f"Sales range: ${sales_data['sales'].max() - sales_data['sales'].min():.2f}")

Grouped Statistics :

# By product

print("\n=== Statistics by Product ===")

product_stats = sales_data.groupby('product').agg({

'sales': ['mean', 'sum', 'count'],

'profit': ['mean', 'sum'],

'units': 'sum'

}).round(2)

print(product_stats)

# By region

print("\n=== Statistics by Region ===")

region_stats = sales_data.groupby('region')['sales'].agg([

('total_sales', 'sum'),

('avg_sales', 'mean'),

('min_sales', 'min'),

('max_sales', 'max')

]).round(2)

print(region_stats)

Time-Based Analysis :

# Add month column

sales_data['month'] = sales_data['date'].dt.to_period('M')

# Monthly statistics

print("\n=== Monthly Statistics ===")

monthly_stats = sales_data.groupby('month').agg({

'sales': 'sum',

'profit': 'sum',

'units': 'sum'

}).round(2)

print(monthly_stats)

Correlation Analysis :

print("\n=== Correlation Matrix ===")

correlation = sales_data[['sales', 'units', 'cost', 'profit']].corr()

print(correlation.round(3))

Custom Insights :

Calculate profit margin percentage
Identify top 5 days by sales
Find the product-region combination with highest average profit
Calculate what percentage of days had profit > $200

Deliverable : A Colab notebook with:

Well-organized code cells for each analysis
Markdown cells explaining each statistic and what it tells you
Business interpretations of the findings

Exercise 4: Create Basic Visualizations

Objective : Build fundamental visualizations using matplotlib and seaborn.

Tasks :

Setup : Use the sales dataset from Exercise 3 or create your own.

Line Chart - Sales Over Time :

import matplotlib.pyplot as plt

import seaborn as sns

# Set style

sns.set_style("whitegrid")

plt.figure(figsize=(12, 5))

# Daily sales

plt.plot(sales_data['date'], sales_data['sales'], marker='o',

linestyle='-', linewidth=1, markersize=3, alpha=0.7)

plt.title('Daily Sales Over Time', fontsize=14, fontweight='bold')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

Bar Chart - Sales by Product :

plt.figure(figsize=(10, 6))

# Aggregate by product

product_sales = sales_data.groupby('product')['sales'].sum().sort_values(ascending=False)

# Create bar chart

plt.bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])

plt.title('Total Sales by Product', fontsize=14, fontweight='bold')

plt.xlabel('Product')

plt.ylabel('Total Sales ($)')

plt.xticks(rotation=0)

# Add value labels on bars

for i, v in enumerate(product_sales.values):

plt.text(i, v + 500, f'${v:,.0f}', ha='center', fontweight='bold')

plt.tight_layout()

plt.show()

Histogram - Distribution of Sales :

plt.figure(figsize=(10, 6))

# Using matplotlib

plt.subplot(1, 2, 1)

plt.hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)

plt.title('Sales Distribution (Matplotlib)', fontweight='bold')

plt.xlabel('Sales ($)')

plt.ylabel('Frequency')

# Using seaborn

plt.subplot(1, 2, 2)

sns.histplot(sales_data['sales'], bins=20, kde=True, color='coral')

plt.title('Sales Distribution (Seaborn)', fontweight='bold')

plt.xlabel('Sales ($)')

plt.tight_layout()

plt.show()

Box Plot - Sales by Region :

plt.figure(figsize=(10, 6))

sns.boxplot(data=sales_data, x='region', y='sales', palette='Set2')

plt.title('Sales Distribution by Region', fontsize=14, fontweight='bold')

plt.xlabel('Region')

plt.ylabel('Sales ($)')

plt.tight_layout()

plt.show()

Scatter Plot - Sales vs. Units :

plt.figure(figsize=(10, 6))

sns.scatterplot(data=sales_data, x='units', y='sales', hue='product',

style='product', s=100, alpha=0.7)

plt.title('Sales vs. Units Sold by Product', fontsize=14, fontweight='bold')

plt.xlabel('Units Sold')

plt.ylabel('Sales ($)')

plt.legend(title='Product', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()

plt.show()

Multi-Panel Dashboard :

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

fig.suptitle('Sales Analytics Dashboard', fontsize=16, fontweight='bold')

# Panel 1: Sales over time

axes[0, 0].plot(sales_data['date'], sales_data['sales'], color='blue', alpha=0.6)

axes[0, 0].set_title('Sales Trend')

axes[0, 0].set_xlabel('Date')

axes[0, 0].set_ylabel('Sales ($)')

axes[0, 0].tick_params(axis='x', rotation=45)

# Panel 2: Sales by product

product_sales = sales_data.groupby('product')['sales'].sum()

axes[0, 1].bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])

axes[0, 1].set_title('Sales by Product')

axes[0, 1].set_xlabel('Product')

axes[0, 1].set_ylabel('Total Sales ($)')

# Panel 3: Sales distribution

axes[1, 0].hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black')

axes[1, 0].set_title('Sales Distribution')

axes[1, 0].set_xlabel('Sales ($)')

axes[1, 0].set_ylabel('Frequency')

# Panel 4: Profit by region

region_profit = sales_data.groupby('region')['profit'].mean()

axes[1, 1].barh(region_profit.index, region_profit.values, color='coral')

axes[1, 1].set_title('Average Profit by Region')

axes[1, 1].set_xlabel('Average Profit ($)')

axes[1, 1].set_ylabel('Region')

plt.tight_layout()

plt.show()

Deliverable : A Colab notebook with:

All visualizations properly labeled and formatted
Markdown cells explaining what each chart shows
At least one insight or observation from each visualization

Exercise 5: Document a Notebook for a Non-Technical Manager

Objective : Practice clear communication and documentation for business stakeholders.

Scenario : You've been asked to analyze quarterly sales performance and present findings to your manager who has limited technical background.

Tasks :

Create a Professional Notebook Structure :

# Quarterly Sales Performance Analysis

**Prepared by**: [Your Name]

**Date**: [Current Date]

**Period**: Q1 2023

## Executive Summary

[Write 2-3 sentences summarizing key findings]

## Table of Contents

1. Data Overview

2. Sales Performance Analysis

3. Product Performance

4. Regional Analysis

5. Key Insights and Recommendations

Data Overview Section :

Load and prepare the data
Add a markdown cell explaining:

What data you're analyzing
The time period covered
Key metrics included

Display the first few rows with explanation
Show basic dataset information (number of records, date range, etc.)

Analysis Sections with Business Language :

For each analysis, structure as:

Business Question (markdown): "How did sales perform this quarter?"
Analysis (code): Perform the calculation
Visualization (code): Create a clear chart
Interpretation (markdown): Explain what the results mean in business terms
So What? (markdown): Why this matters and what action to take

Example Structure for One Analysis :

## 2. Sales Performance Analysis

### Question: How did our sales trend throughout the quarter?

We want to understand if sales were growing, declining, or stable during Q1.

# Calculate daily and weekly sales trends

daily_sales = sales_data.groupby('date')['sales'].sum()

# Create visualization

plt.figure(figsize=(12, 5))

plt.plot(daily_sales.index, daily_sales.values, marker='o', linewidth=2)

plt.title('Daily Sales Trend - Q1 2023', fontsize=14, fontweight='bold')

plt.xlabel('Date')

plt.ylabel('Sales ($)')

plt.grid(True, alpha=0.3)

plt.tight_layout()

plt.show()

# Calculate growth

first_week_avg = daily_sales[:7].mean()

last_week_avg = daily_sales[-7:].mean()

growth_rate = ((last_week_avg - first_week_avg) / first_week_avg) * 100

print(f"First week average: ${first_week_avg:,.2f}")

print(f"Last week average: ${last_week_avg:,.2f}")

print(f"Growth rate: {growth_rate:.1f}%")

### Key Finding

Sales showed a **positive trend** throughout Q1, with the last week averaging

${last_week_avg:,.2f} compared to ${first_week_avg:,.2f} in the first week—

a growth of {growth_rate:.1f}%.

### What This Means

- Our Q1 initiatives appear to be working

- We're building momentum heading into Q2

- Current trajectory suggests we'll exceed Q2 targets if maintained

### Recommended Action

Continue current marketing and sales strategies. Consider increasing inventory

for top-performing products to meet growing demand.

Key Insights and Recommendations Section :

## 5. Key Insights and Recommendations

### Top 3 Insights

1. **[Insight 1]**: [Explanation with supporting data]

2. **[Insight 2]**: [Explanation with supporting data]

3. **[Insight 3]**: [Explanation with supporting data]

### Recommendations

|----------|---------------|-----------------|----------|

### Next Steps

- [ ] [Action item 1]

- [ ] [Action item 2]

- [ ] [Action item 3]

Best Practices to Follow :

Use clear, jargon-free language
Lead with insights, not methodology
Use formatting (bold, italics, bullet points) for readability
Include visual breaks between sections
Add context before showing code or results
Highlight key numbers and percentages
End each section with "so what?" implications
Include a clear call to action

Deliverable : A polished Colab notebook that:

Tells a clear story from start to finish
Uses professional formatting and structure
Explains technical concepts in business terms
Includes actionable recommendations
Could be shared directly with a non-technical stakeholder

Chapter 7. Working with Python in Cloud-Based Environments

7.1 Why Python for Business Analytics

The Rise of Python in Business

1. Readability and Accessibility

2. Comprehensive Ecosystem for Analytics

3. Versatility Across the Analytics Workflow

4. Free and Open Source

5. Industry Adoption and Career Value

6. Integration and Automation

7. Reproducibility and Documentation

When to Use Python vs. Other Tools

Python vs. R: A Brief Comparison

The Business Case for Python

7.2 Introduction to Cloud-Based Analytics Environments

The Evolution of Analytics Environments

What Are Cloud-Based Analytics Environments?

Benefits of Cloud-Based Environments

1. Zero Setup Time

2. Consistent Environment

3. Access to Powerful Hardware

4. Collaboration and Sharing

5. Accessibility

6. Integration with Cloud Services

7.2.1 Google Colab: Features and Limitations

Key Features

Limitations

Colab Pro and Pro+

7.2.2 Alternatives: Jupyter Notebook, VS Code, and Hosted Platforms (Conceptual)

Jupyter Notebook (Local)

JupyterLab (Local)

VS Code with Jupyter Extension

Kaggle Notebooks

Databricks Community Edition

Amazon SageMaker Studio Lab

Microsoft Azure Notebooks (Deprecated) / Azure ML Studio

Deepnote

Comparison Matrix

Recommendation for Business Analysts

7.3 Getting Started with Google Colab

7.3.1 Creating and Managing Notebooks

Creating Your First Notebook

Understanding the Interface

Your First Code

Adding Cells

Working with Text Cells (Markdown)

Organizing Your Notebook

Renaming Your Notebook

Saving Your Notebook

Downloading Your Notebook

Copying and Sharing Notebooks

Opening Existing Notebooks

Managing Multiple Notebooks

7.3.2 Connecting to Google Drive and Data Sources

Mounting Google Drive

Reading Files from Google Drive

Saving Files to Google Drive

Uploading Files Directly to Colab

Reading Data from URLs

Connecting to Google Sheets

Connecting to Databases

Connecting to APIs

Best Practices for Data Access

7.4 Python Essentials for Analytics

7.4.1 Basic Syntax, Data Types, and Control Structures

Python Basics

Data Types

Control Structures

7.4.2 Working with Libraries: pandas, numpy, matplotlib, seaborn

Introduction to Libraries

pandas: Data Manipulation

numpy: Numerical Computing

Exercises

Exercise 1: Set Up and Explore Google Colab

Exercise 2: Import a Dataset from Multiple Sources

Exercise 3: Calculate Descriptive Statistics with pandas

Exercise 4: Create Basic Visualizations

Exercise 5: Document a Notebook for a Non-Technical Manager

Additional Resources

Python Learning

Google Colab Resources