Chapter 7. Working with Python in Cloud-Based Environments
"The best tool is the one you'll actually use." — Anonymous Data Scientist
In today's business analytics landscape, the ability to work with data programmatically has become essential. While spreadsheets remain valuable for many tasks, modern analytics increasingly requires the flexibility, scalability, and reproducibility that programming languages provide. Python has emerged as the lingua franca of business analytics, offering a gentle learning curve for beginners while providing the power and sophistication needed for advanced work.
This chapter introduces Python for business analytics through the lens of cloud-based environments, particularly Google Colab. Rather than wrestling with software installation and environment configuration, cloud-based platforms allow you to start analyzing data immediately—all you need is a web browser. Whether you're a business analyst looking to expand your technical skills, a manager seeking to understand your team's tools, or a student preparing for a data-driven career, this chapter provides the foundation you need to work confidently with Python in modern analytics environments.
We'll explore why Python has become the tool of choice for business analytics, how to leverage cloud-based platforms for immediate productivity, and the essential Python skills needed for data manipulation, analysis, and visualization. By the end of this chapter, you'll be able to set up your own analytics environment, import and analyze data, create compelling visualizations, and share reproducible analyses with colleagues.
7.1 Why Python for Business Analytics
The Rise of Python in Business
Python has experienced explosive growth in business analytics over the past decade. According to the 2024 Stack Overflow Developer Survey, Python is the most wanted programming language, and the 2024 Kaggle State of Data Science survey shows that over 80% of data professionals use Python regularly.
Why has Python become so dominant?
1. Readability and Accessibility
Python's syntax is designed to be readable and intuitive, resembling natural language more than traditional programming languages.
Example Comparison:
# Python: Calculate average sales
average_sales = sum(sales_data) / len(sales_data)
# Compare to equivalent in other languages (more verbose)
# Java would require type declarations, class structure, etc.
# SQL would require aggregation syntax
# Excel formula: =AVERAGE(A1:A100)
Python code reads almost like English: "average sales equals the sum of sales data divided by the length of sales data."
2. Comprehensive Ecosystem for Analytics
Python offers a rich ecosystem of libraries specifically designed for data work:
|
Library |
Purpose |
Business Use Cases |
|
pandas |
Data manipulation and analysis |
Cleaning data, aggregating sales reports, merging datasets |
|
numpy |
Numerical computing |
Statistical calculations, array operations, financial modeling |
|
matplotlib |
Static visualizations |
Creating charts for reports and presentations |
|
seaborn |
Statistical visualizations |
Exploring distributions, correlations, patterns |
|
scikit-learn |
Machine learning |
Predictive modeling, customer segmentation, forecasting |
|
statsmodels |
Statistical modeling |
Regression analysis, hypothesis testing, time series |
|
plotly |
Interactive visualizations |
Dashboards, interactive reports |
|
requests |
API interactions |
Pulling data from web services, APIs |
|
openpyxl/xlsxwriter |
Excel integration |
Reading/writing Excel files, automating reports |
3. Versatility Across the Analytics Workflow
Python supports every stage of the analytics process:
Data Collection → Data Cleaning → Analysis → Visualization → Modeling → Deployment
Example Workflow:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# 1. Data Collection
data = pd.read_csv('sales_data.csv')
# 2. Data Cleaning
data = data.dropna()
data['date'] = pd.to_datetime(data['date'])
# 3. Analysis
monthly_sales = data.groupby(data['date'].dt.to_period('M'))['sales'].sum()
# 4. Visualization
monthly_sales.plot(kind='line', title='Monthly Sales Trend')
plt.show()
# 5. Modeling
X = data[['marketing_spend', 'price']]
y = data['sales']
model = LinearRegression().fit(X, y)
# 6. Prediction
predicted_sales = model.predict([[50000, 29.99]])
All in one language, one environment, one workflow.
4. Free and Open Source
Unlike proprietary tools (SAS, STATA, Tableau), Python is:
- Free to use : No licensing costs
- Open source : Transparent, community-driven development
- Platform-independent : Works on Windows, Mac, Linux
- Extensible : Anyone can create and share libraries
Business Impact: A company can equip an entire analytics team with Python at zero software cost, compared to tens of thousands of dollars for proprietary alternatives.
5. Industry Adoption and Career Value
Python skills are in high demand:
- Job postings : 70%+ of data analyst positions mention Python
- Salary premium : Python skills correlate with 10-20% higher salaries
- Industry standard : Used by Google, Netflix, Amazon, JPMorgan, and virtually every major tech company
6. Integration and Automation
Python excels at connecting different systems and automating workflows:
Example Use Cases:
- Pull data from Salesforce API, analyze it, and email results
- Automate monthly report generation from database queries
- Connect to Google Sheets, process data, update dashboards
- Schedule recurring analyses to run automatically
Example: Automated weekly sales report
import pandas as pd
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
# Pull data
data = pd.read_sql("SELECT * FROM sales WHERE date >= CURRENT_DATE - 7", connection)
# Analyze
weekly_total = data['sales'].sum()
top_products = data.groupby('product')['sales'].sum().nlargest(5)
# Create report
report = f"""
Weekly Sales Report - {datetime.now().strftime('%Y-%m-%d')}
Total Sales: ${weekly_total:,.2f}
Top 5 Products:
{top_products.to_string()}
"""
# Email report (simplified)
# send_email(report, recipients=['team@company.com'])
7. Reproducibility and Documentation
Python code serves as documentation of your analysis:
- Transparent : Every step is visible and reviewable
- Reproducible : Others can run the same analysis and get the same results
- Auditable : Critical for compliance and quality control
- Collaborative : Easy to share and iterate with version control (Git)
Contrast with Excel:
- Excel: Formulas hidden in cells, manual steps undocumented, "works on my machine" problems
- Python: Every transformation explicit, version-controlled, reproducible
When to Use Python vs. Other Tools
Python isn't always the best choice. Here's a practical decision framework:
|
Scenario |
Best Tool |
Why |
|
Quick calculation, small dataset (<1000 rows) |
Excel |
Faster for simple tasks, visual interface |
|
Ad-hoc data exploration, one-time analysis |
Excel or Python |
Either works; choose based on complexity |
|
Recurring analysis, automation needed |
Python |
Reproducibility, automation capabilities |
|
Large datasets (>100K rows) |
Python or SQL |
Performance, memory efficiency |
|
Complex transformations, multiple data sources |
Python |
Flexibility, integration capabilities |
|
Statistical modeling, machine learning |
Python or R |
Specialized libraries, industry standard |
|
Interactive dashboards for business users |
Tableau/Power BI |
User-friendly interface, no coding required |
|
Production deployment, real-time systems |
Python |
Integration, scalability, automation |
The Hybrid Approach:
Many analysts use a combination:
- Excel : Initial exploration, simple calculations, sharing with non-technical stakeholders
- Python : Data cleaning, complex analysis, automation, modeling
- SQL : Data extraction from databases
- Tableau/Power BI : Final dashboards for business users
Python vs. R: A Brief Comparison
Both Python and R are excellent for analytics. Here's a practical comparison:
|
Aspect |
Python |
R |
|
Primary Design |
General-purpose programming |
Statistical computing |
|
Learning Curve |
Gentler for programming beginners |
Steeper for non-statisticians |
|
Business Adoption |
Broader (analytics + engineering) |
Primarily analytics/research |
|
Visualization |
Good (matplotlib, seaborn, plotly) |
Excellent (ggplot2) |
|
Machine Learning |
Excellent (scikit-learn, TensorFlow) |
Good (caret, tidymodels) |
|
Statistical Analysis |
Good (statsmodels, scipy) |
Excellent (built-in, specialized packages) |
|
Deployment |
Easier (general-purpose language) |
More challenging |
|
Community |
Larger, more diverse |
Smaller, more specialized |
|
Job Market |
Broader demand |
More specialized roles |
Recommendation for Business Analysts: Start with Python. It's more versatile, has broader industry adoption, and skills transfer to other domains (web development, automation, data engineering).
The Business Case for Python
For Individual Analysts:
- ✅ Increase productivity through automation
- ✅ Handle larger, more complex datasets
- ✅ Create reproducible, auditable analyses
- ✅ Enhance career prospects and earning potential
- ✅ Collaborate more effectively with technical teams
For Organizations:
- ✅ Reduce software licensing costs
- ✅ Improve analysis quality and reproducibility
- ✅ Enable advanced analytics (ML, AI)
- ✅ Attract and retain technical talent
- ✅ Scale analytics capabilities efficiently
ROI Example:
A business analyst spending 10 hours/week on manual data processing can often reduce this to 2 hours/week with Python automation:
- Time saved : 8 hours/week × 50 weeks = 400 hours/year
- Value : At $50/hour fully loaded cost = $20,000/year
- Learning investment : ~40 hours to reach productivity
- Payback period : ~5 weeks
7.2 Introduction to Cloud-Based Analytics Environments
The Evolution of Analytics Environments
Traditional Approach:
- Install Python on your computer
- Configure environment and install libraries
- Manage dependencies and versions
- Deal with "works on my machine" problems
- Struggle with collaboration and sharing
Cloud-Based Approach:
- Open a web browser
- Start analyzing data
- Collaborate in real-time
- Access from anywhere
- No installation or configuration needed
What Are Cloud-Based Analytics Environments?
Cloud-based analytics environments are web-based platforms that provide:
- Computational resources : CPU, memory, storage in the cloud
- Pre-configured environments : Python and libraries already installed
- Interactive notebooks : Mix code, visualizations, and documentation
- Collaboration features : Share, comment, co-edit
- Integration : Connect to cloud storage, databases, APIs
Benefits of Cloud-Based Environments
1. Zero Setup Time
Traditional Installation:
# Install Python
# Install pip
# Install libraries
pip install pandas numpy matplotlib seaborn scikit-learn
# Troubleshoot version conflicts
# Configure IDE
# Set up virtual environments
# Debug path issues
# ...hours later...
Cloud-Based:
1. Go to colab.research.google.com
2. Click "New Notebook"
3. Start codin g
2. Consistent Environment
Everyone uses the same Python version and libraries—no more "works on my machine" problems.
3. Access to Powerful Hardware
Cloud platforms provide:
- Free tier : Sufficient for most business analytics
- Upgraded options : GPUs for machine learning, more RAM for large datasets
- Scalability : Handle datasets that wouldn't fit on your laptop
4. Collaboration and Sharing
Share notebooks like Google Docs:
- Real-time collaboration : Multiple people editing simultaneously
- Comments : Discuss specific code cells
- Version history : Track changes over time
- Easy sharing : Send a link, no file attachments
5. Accessibility
Work from anywhere:
- Any device : Laptop, tablet, even phone (for viewing)
- Any operating system : Windows, Mac, Linux, ChromeOS
- No local storage needed : Everything saved in the cloud
6. Integration with Cloud Services
Easy connection to:
- Google Drive, Dropbox, OneDrive
- Cloud databases (BigQuery, AWS RDS)
- APIs and web services
- Cloud storage (S3, Google Cloud Storage)
7.2.1 Google Colab: Features and Limitations
Google Colaboratory (Colab) is a free, cloud-based Jupyter notebook environment provided by Google.
Key Features
1. Free Access to Computing Resources
- CPU : Standard computing for most analytics tasks
- GPU : Graphics processing units for machine learning (free tier available)
- TPU : Tensor processing units for advanced deep learning (free tier available)
- RAM : 12-13 GB in free tier
- Disk : ~100 GB temporary storage
2. Pre-installed Libraries
Most common libraries are already installed:
import pandas as pd # ✓ Pre-installed
import numpy as np # ✓ Pre-installed
import matplotlib.pyplot as plt # ✓ Pre-installed
import seaborn as sns # ✓ Pre-installed
import sklearn # ✓ Pre-installed
For additional libraries:
!pip install library_name
3. Google Drive Integration
Mount your Google Drive to access files:
from google.colab import drive
drive.mount('/content/drive')
# Now access files like:
data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')
4. Interactive Widgets and Forms
Create user-friendly interfaces:
#@title Enter Parameters { run: "auto" }
start_date = "2024-01-01" #@param {type:"date"}
region = "North" #@param ["North", "South", "East", "West"]
threshold = 1000 #@param {type:"slider", min:0, max:10000, step:100}
print(f"Analyzing {region} region from {start_date} with threshold {threshold}")
5. Markdown and Rich Text
Mix code with formatted documentation:
# Analysis Title
This notebook analyzes **Q3 sales data** to identify trends.
## Key Findings
- Sales increased 15% YoY
- Northeast region outperformed
6. Built-in Data Visualization
Visualizations render inline:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4], [10, 20, 25, 30])
plt.title('Sales Trend')
plt.show() # Displays directly in notebook
7. Code Snippets Library
Access pre-written code for common tasks:
- Camera capture
- File upload
- Data visualization templates
- Machine learning examples
8. Keyboard Shortcuts
Efficient navigation and editing:
- Ctrl/Cmd + Enter : Run current cell
- Shift + Enter : Run cell and move to next
- Ctrl/Cmd + M B : Insert cell below
- Ctrl/Cmd + M D : Delete cell
- Ctrl/Cmd + / : Comment/uncomment
Limitations
1. Session Timeouts
- Idle timeout : ~90 minutes of inactivity
- Maximum session : ~12 hours
- Impact : Long-running analyses may be interrupted
- Workaround : Save intermediate results frequently
# Save intermediate results
data.to_csv('/content/drive/MyDrive/intermediate_results.csv', index=False)
2. Temporary Storage
- Files in /content/ are deleted when session ends
- Solution : Save important files to Google Drive
# BAD: Saved to temporary storage
data.to_csv('results.csv') # Lost when session ends!
# GOOD: Saved to Google Drive
data.to_csv('/content/drive/MyDrive/results.csv') # Persists
3. Resource Limits
Free tier limitations:
- RAM : 12-13 GB (can run out with large datasets)
- GPU/TPU : Limited availability during peak times
- Compute units : Fair usage limits (not publicly specified)
Workaround for large datasets:
# Read data in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk[chunk['sales'] > 1000]
chunks.append(processed)
data = pd.concat(chunks, ignore_index=True)
4. No Terminal Access
- Limited command-line functionality
- Can run shell commands with ! prefix, but not a full terminal
# Shell commands work with ! prefix
!ls /content/drive/MyDrive/
!wget https://example.com/data.csv
5. Internet Dependency
- Requires stable internet connection
- No offline work (unlike local Jupyter)
6. Privacy Considerations
- Data is processed on Google's servers
- Important : Don't upload sensitive/confidential data without proper authorization
- Review your organization's data governance policies
7. Limited Customization
- Can't customize the environment as extensively as local installations
- Some advanced configurations not possible
Colab Pro and Pro+
For power users, Google offers paid tiers:
|
Feature |
Free |
Pro (~$10/month) |
Pro+ (~$50/month) |
|
RAM |
12-13 GB |
Up to 32 GB |
Up to 52 GB |
|
Session Length |
12 hours |
24 hours |
24 hours |
|
Idle Timeout |
90 min |
Longer |
Longer |
|
GPU Access |
Limited |
Priority |
Priority |
|
Background Execution |
No |
Yes |
Yes |
|
Compute Units |
Limited |
More |
Most |
When to upgrade:
- Working with datasets >10 GB
- Training complex machine learning models
- Need longer uninterrupted sessions
- Require background execution
7.2.2 Alternatives: Jupyter Notebook, VS Code, and Hosted Platforms (Conceptual)
While this chapter focuses on Google Colab, it's valuable to understand the broader ecosystem.
Jupyter Notebook (Local)
What it is: The original notebook interface, running on your own computer.
Pros:
- ✅ Full control over environment
- ✅ Works offline
- ✅ No resource limits (beyond your hardware)
- ✅ Complete privacy (data stays local)
- ✅ Extensive customization
Cons:
- ❌ Requires installation and configuration
- ❌ Limited by your computer's resources
- ❌ Harder to share and collaborate
- ❌ Environment inconsistencies across team members
When to use:
- Working with sensitive data that can't leave your network
- Need offline access
- Require specific environment configurations
- Have powerful local hardware
Getting Started (conceptual):
# Install Jupyter
pip install jupyter
# Launch
jupyter notebook
# Opens in browser at localhost:8888
JupyterLab (Local)
What it is: Next-generation interface for Jupyter, more IDE-like.
Additional features over Jupyter Notebook:
- Multiple notebooks in tabs
- File browser
- Terminal access
- Text editor
- Extension system
When to use: Same as Jupyter Notebook, but for users who want a more integrated development environment.
VS Code with Jupyter Extension
What it is: Microsoft's Visual Studio Code editor with Jupyter notebook support.
Pros:
- ✅ Powerful code editor features (IntelliSense, debugging)
- ✅ Integrated terminal
- ✅ Git integration
- ✅ Extensions for Python, data science
- ✅ Can connect to remote Jupyter servers
Cons:
- ❌ Steeper learning curve
- ❌ More complex setup
- ❌ Heavier application
When to use:
- You're already familiar with VS Code
- Need advanced debugging capabilities
- Working on larger projects with multiple files
- Want tight Git integration
Kaggle Notebooks
What it is: Free cloud notebooks from Kaggle (owned by Google).
Pros:
- ✅ Free GPU access (30 hours/week)
- ✅ Large dataset library
- ✅ Community and competitions
- ✅ Similar to Colab
Cons:
- ❌ Less Google Drive integration
- ❌ Smaller storage limits
- ❌ Focused on data science competitions
When to use:
- Learning data science
- Participating in Kaggle competitions
- Need free GPU for machine learning
Databricks Community Edition
What it is: Free tier of Databricks, a cloud analytics platform.
Pros:
- ✅ Built for big data (Apache Spark)
- ✅ Collaborative features
- ✅ Enterprise-grade platform
Cons:
- ❌ Steeper learning curve
- ❌ Overkill for small datasets
- ❌ Limited free tier
When to use:
- Working with very large datasets (>100 GB)
- Learning Spark
- Your organization uses Databricks
Amazon SageMaker Studio Lab
What it is: Free Jupyter environment from AWS.
Pros:
- ✅ Free compute (no credit card required)
- ✅ 15 GB storage
- ✅ GPU access
Cons:
- ❌ Requires application/waitlist
- ❌ Less intuitive than Colab
- ❌ Smaller community
Microsoft Azure Notebooks (Deprecated) / Azure ML Studio
What it is: Microsoft's cloud notebook offerings.
Note: Azure Notebooks was deprecated; Azure ML Studio is the current offering but requires Azure subscription.
Deepnote
What it is: Collaborative data science notebook platform.
Pros:
- ✅ Real-time collaboration
- ✅ Beautiful interface
- ✅ Good integration features
- ✅ Free tier available
Cons:
- ❌ Smaller community
- ❌ Limited free tier
- ❌ Less mature than Colab
Comparison Matrix
|
Platform |
Cost |
Setup |
Collaboration |
Resources |
Best For |
|
Google Colab |
Free |
None |
Good |
Moderate |
General business analytics, learning |
|
Jupyter (Local) |
Free |
Medium |
Poor |
Your hardware |
Sensitive data, offline work |
|
VS Code |
Free |
Medium |
Medium |
Your hardware |
Software development, complex projects |
|
Kaggle |
Free |
None |
Good |
Moderate |
Learning, competitions |
|
Databricks |
Free tier |
Low |
Excellent |
Good |
Big data, Spark |
|
Colab Pro |
$10/mo |
None |
Good |
High |
Power users, ML |
Recommendation for Business Analysts
Start with Google Colab:
- Zero setup barrier
- Free and powerful enough for most business analytics
- Easy collaboration and sharing
- Smooth transition to other platforms later if needed
Transition to local Jupyter when:
- Working with confidential data
- Need offline access
- Require specific environment configurations
- Have outgrown Colab's resource limits
7.3 Getting Started with Google Colab
7.3.1 Creating and Managing Notebooks
Creating Your First Notebook
Step 1: Access Google Colab
- Open your web browser
- Go to https://colab.research.google.com
- Sign in with your Google account
Step 2: Create a New Notebook
Option A: From the welcome screen
- Click "New Notebook"
Option B: From Google Drive
- Go to Google Drive
- Click "New" → "More" → "Google Colaboratory"
Option C: From File menu
- File → New Notebook
Your first notebook opens with an empty code cell.
Understanding the Interface
┌─────────────────────────────────────────────────────┐
│ File Edit View Insert Runtime Tools Help │ ← Menu bar
├─────────────────────────────────────────────────────┤
│ + Code + Text 📁 Files 🔍 Search │ ← Toolbar
├─────────────────────────────────────────────────────┤
│ Untitled0.ipynb │ ← Notebook name
├─────────────────────────────────────────────────────┤
│ [ ] # Your code here │ ← Code cell
│ │
├─────────────────────────────────────────────────────┤
│ + Code + Text │ ← Add cells
└─────────────────────────────────────────────────────┘
Key Interface Elements:
- Menu Bar : File operations, editing, runtime control
- Toolbar : Quick access to common actions
- Notebook Name : Click to rename
- Code Cells : Where you write Python code
- Text Cells : Markdown for documentation
- Sidebar : Files, table of contents, code snippets
Your First Code
Click in the code cell and type:
print("Hello, Business Analytics!")
Run the cell:
- Click the play button (▶) on the left of the cell, OR
- Press Shift + Enter , OR
- Press Ctrl/Cmd + Enter
Output appears below the cell:
Hello, Business Analytics!
Adding Cells
Add a code cell:
- Click "+ Code" button, OR
- Hover between cells and click "+ Code", OR
- Press Ctrl/Cmd + M B (below) or Ctrl/Cmd + M A (above)
Add a text cell:
- Click "+ Text" button, OR
- Hover between cells and click "+ Text", OR
- Press Ctrl/Cmd + M M (convert code cell to text)
Working with Text Cells (Markdown)
Text cells use Markdown formatting:
# Main Heading
## Subheading
### Sub-subheading
**Bold text** and *italic text*
- Bullet point 1
- Bullet point 2
1. Numbered item
2. Another item
[Link text](https://example.com)
`inline code`
code block
> Blockquote
Example text cell:
# Q3 Sales Analysis
This notebook analyzes Q3 2024 sales data to identify:
- Top performing products
- Regional trends
- Seasonal patterns
**Data Source:** Salesforce export, October 15, 2024
Organizing Your Notebook
Best practices:
- Start with a title and overview
# Customer Churn Analysis - October 2024
**Objective:** Identify factors contributing to customer churn
**Data:** CRM export (10,000 customers, 2023-2024)
**Methodology:**
1. Data cleaning and exploration
2. Descriptive statistics
3. Visualization of key patterns
4. Predictive modeling
- Use section headings
## 1. Data Import and Cleaning
## 2. Exploratory Data Analysis
## 3. Key Findings
## 4. Recommendations
- Document your code
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data from Google Drive
data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')
# Display first few rows to verify
print(f"Dataset shape: {data.shape}")
data.head()
- Use table of contents
- Click the table of contents icon in the sidebar
- Automatically generated from your headings
- Click to navigate to sections
Renaming Your Notebook
Method 1: Click the notebook name at the top
- Type new name
- Press Enter
Method 2: File → Rename
Naming conventions:
- ✅ 2024-10_Customer_Churn_Analysis.ipynb
- ✅ Q3_Sales_Report_v2.ipynb
- ❌ Untitled0.ipynb
- ❌ notebook.ipynb
Best practices:
- Include date or period
- Descriptive name
- Version number if iterating
- Use underscores or hyphens, not spaces
Saving Your Notebook
Automatic saving:
- Colab auto-saves to Google Drive every few minutes
- Look for "All changes saved" in the top bar
Manual saving:
- File → Save, OR
- Ctrl/Cmd + S
Save location:
- Default: "Colab Notebooks" folder in your Google Drive
- Can be moved to any folder in your Drive
Downloading Your Notebook
Download as .ipynb (Jupyter format):
- File → Download → Download .ipynb
- Can be opened in Jupyter, JupyterLab, VS Code
Download as Python script:
- File → Download → Download .py
- Pure Python code (markdown cells become comments)
Download with outputs:
- File → Print
- Save as PDF from print dialog
Copying and Sharing Notebooks
Make a copy:
- File → Save a copy in Drive
- Creates a duplicate you can edit independently
Share with others:
- Click "Share" button (top right)
- Add people by email, OR
- Get shareable link
Sharing permissions:
- Viewer : Can view and run, cannot edit
- Commenter : Can view, run, and comment
- Editor : Can view, run, and edit
Best practice for collaboration:
- Share as "Viewer" for read-only access
- Share as "Editor" for collaboration
- Use "Copy to Drive" to let others create their own editable version
Opening Existing Notebooks
From Colab:
- File → Open notebook
- Tabs: Recent, Google Drive, GitHub, Upload
From Google Drive:
- Navigate to the .ipynb file
- Double-click to open in Colab
From GitHub:
- File → Open notebook → GitHub tab
- Enter repository URL or search
From URL:
- Replace github.com with colab.research.google.com/github in any GitHub notebook URL
- Example: https://colab.research.google.com/github/username/repo/blob/main/notebook.ipynb
Managing Multiple Notebooks
Organization strategies:
- Folder structure in Google Drive:
My Drive/
├── Analytics Projects/
│ ├── Customer Churn/
│ │ ├── data/
│ │ ├── 01_Data_Exploration.ipynb
│ │ ├── 02_Modeling.ipynb
│ │ └── 03_Final_Report.ipynb
│ └── Sales Analysis/
│ ├── data/
│ └── Q3_Analysis.ipynb
- Naming conventions:
- Prefix with numbers for sequence: 01_ , 02_ , 03_
- Include dates: 2024-10-15_
- Use descriptive names
- Version control:
- Save major versions: Analysis_v1.ipynb , Analysis_v2.ipynb
- Use "Save a copy" before major changes
- Consider GitHub for serious version control
7.3.2 Connecting to Google Drive and Data Sources
Mounting Google Drive
Why mount Google Drive?
- Access your data files
- Save results permanently
- Share data across notebooks
How to mount:
from google.colab import drive
drive.mount('/content/drive')
What happens:
- A link appears: "Go to this URL in a browser"
- Click the link (or it opens automatically)
- Choose your Google account
- Click "Allow"
- Copy the authorization code (if prompted)
- Paste into the notebook (if prompted)
Modern experience: Usually just one click to authorize.
Output:
Mounted at /content/drive
Verify the mount:
# List files in your Google Drive
!ls "/content/drive/MyDrive"
File path structure:
/content/drive/MyDrive/ ← Your Google Drive root
/content/drive/MyDrive/Colab Notebooks/ ← Default notebook location
/content/drive/MyDrive/data/ ← Your data folder (example)
Reading Files from Google Drive
CSV file:
import pandas as pd
# Read CSV from Google Drive
data = pd.read_csv('/content/drive/MyDrive/data/sales_data.csv')
# Display first few rows
print(f"Loaded {len(data)} rows")
data.head()
Excel file:
# Read Excel file
data = pd.read_excel('/content/drive/MyDrive/data/sales_data.xlsx', sheet_name='Q3 Sales')
data.head()
Multiple sheets:
# Read all sheets
excel_file = pd.ExcelFile('/content/drive/MyDrive/data/sales_data.xlsx')
# See sheet names
print(excel_file.sheet_names)
# Read specific sheets
q3_data = pd.read_excel(excel_file, sheet_name='Q3')
q4_data = pd.read_excel(excel_file, sheet_name='Q4' )
Text file:
# Read text file
with open('/content/drive/MyDrive/data/notes.txt', 'r') as f:
content = f.read()
print(content)
Saving Files to Google Drive
Save DataFrame to CSV:
# Process data
results = data.groupby('region')['sales'].sum()
# Save to Google Drive
results.to_csv('/content/drive/MyDrive/results/regional_sales.csv')
print("Results saved to Google Drive")
Save DataFrame to Excel:
# Save to Excel
results.to_excel('/content/drive/MyDrive/results/regional_sales.xlsx',
sheet_name='Summary',
index=False)
Save multiple DataFrames to Excel:
# Create Excel writer
with pd.ExcelWriter('/content/drive/MyDrive/results/sales_report.xlsx') as writer:
summary.to_excel(writer, sheet_name='Summary', index=False)
details.to_excel(writer, sheet_name='Details', index=False)
trends.to_excel(writer, sheet_name='Trends', index=False)
print("Multi-sheet report saved")
Save plot:
import matplotlib.pyplot as plt
# Create plot
plt.figure(figsize=(10, 6))
plt.plot(data['date'], data['sales'])
plt.title('Sales Trend')
plt.xlabel('Date')
plt.ylabel('Sales ($)')
# Save to Google Drive
plt.savefig('/content/drive/MyDrive/results/sales_trend.png', dpi=300, bbox_inches='tight')
plt.show()
print("Plot saved to Google Drive")
Uploading Files Directly to Colab
For small, one-time uploads:
from google.colab import files
# Upload file(s)
uploaded = files.upload()
# Access uploaded file
import pandas as pd
for filename in uploaded.keys():
print(f"Uploaded: {filename}")
data = pd.read_csv(filename)
What happens:
- "Choose Files" button appears
- Select file(s) from your computer
- Files upload to /content/ (temporary storage)
- ⚠️ Warning: Files deleted when session ends!
Best practice: Upload to Google Drive for permanent storage, or save results to Drive before session ends.
Reading Data from URLs
CSV from URL:
import pandas as pd
# Read directly from URL
url = 'https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv'
data = pd.read_csv(url)
data.head()
Download file from URL:
# Download file using wget
!wget https://example.com/data/sales_data.csv -O /content/drive/MyDrive/data/sales_data.csv
# Or using Python
import requests
url = 'https://example.com/data/sales_data.csv'
response = requests.get(url)
with open('/content/drive/MyDrive/data/sales_data.csv', 'wb') as f:
f.write(response.content)
print("File downloaded to Google Drive")
Connecting to Google Sheets
Read from Google Sheets:
# Method 1: Export as CSV (simple)
sheet_url = 'https://docs.google.com/spreadsheets/d/YOUR_SHEET_ID/export?format=csv&gid=0'
data = pd.read_csv(sheet_url)
# Method 2: Using gspread library (more features)
!pip install gspread
from google.colab import auth
import gspread
from google.auth import default
# Authenticate
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)
# Open sheet
sheet = gc.open('Sales Data 2024').sheet1
# Get all values
values = sheet.get_all_values()
data = pd.DataFrame(values[1:], columns=values[0])
data.head()
Write to Google Sheets:
# Update existing sheet
sheet.update([data.columns.values.tolist()] + data.values.tolist())
print("Data written to Google Sheets ")
Connecting to Databases
SQLite (local):
import sqlite3
import pandas as pd
# Connect to database
conn = sqlite3.connect('/content/drive/MyDrive/data/sales.db')
# Query data
query = "SELECT * FROM sales WHERE date >= '2024-01-01'"
data = pd.read_sql(query, conn)
# Close connection
conn.close()
data.head()
PostgreSQL/MySQL (remote):
# Install library
!pip install psycopg2-binary # PostgreSQL
# or
!pip install pymysql # MySQL
import pandas as pd
from sqlalchemy import create_engine
# Create connection (PostgreSQL example)
engine = create_engine('postgresql://username:password@host:port/database')
# Query data
query = "SELECT * FROM sales WHERE date >= '2024-01-01'"
data = pd.read_sql(query, engine)
data.head()
⚠️ Security Warning: Never hardcode credentials in notebooks!
Better approach:
from google.colab import userdata
# Store secrets in Colab's secret manager
# (Left sidebar → Key icon → Add secret)
db_password = userdata.get('DB_PASSWORD')
# Use in connection string
engine = create_engine(f'postgresql://username:{db_password}@host:port/database')
Connecting to APIs
Simple API request:
import requests
import pandas as pd
# API request
url = 'https://api.example.com/sales'
headers = {'Authorization': 'Bearer YOUR_API_KEY'}
response = requests.get(url, headers=headers)
# Parse JSON response
data = response.json()
# Convert to DataFrame
df = pd.DataFrame(data['results'])
df.head()
Pagination example:
import requests
import pandas as pd
def fetch_all_pages(base_url, headers):
all_data = []
page = 1
while True:
response = requests.get(f"{base_url}?page={page}", headers=headers)
data = response.json()
if not data['results']:
break
all_data.extend(data['results'])
page += 1
print(f"Fetched page {page-1}")
return pd.DataFrame(all_data)
# Use function
df = fetch_all_pages('https://api.example.com/sales', headers={'Authorization': 'Bearer KEY'})
print(f"Total records: {len(df)}")
Best Practices for Data Access
1. Organize your data:
Google Drive/
├── Analytics Projects/
│ ├── data/
│ │ ├── raw/ ← Original, unmodified data
│ │ ├── processed/ ← Cleaned, transformed data
│ │ └── external/ ← Data from APIs, downloads
│ ├── results/ ← Analysis outputs
│ └── notebooks/ ← Your .ipynb files
2. Document data sources:
"""
DATA SOURCES:
- sales_data.csv: Salesforce export, 2024-10-15, John Smith
- customer_data.xlsx: CRM database, 2024-10-14, automated export
- market_data.csv: Public API (https://api.example.com), 2024-10-15
"""
3. Validate data after loading:
# Load data
data = pd.read_csv('/content/drive/MyDrive/data/sales.csv')
# Validate
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print(f"Date range: {data['date'].min()} to {data['date'].max()}")
print(f"Missing values: {data.isnull().sum().sum()}")
# Check for expected columns
required_columns = ['date', 'product', 'sales', 'region']
missing_columns = set(required_columns) - set(data.columns)
if missing_columns:
print(f"⚠️ WARNING: Missing columns: {missing_columns}")
else:
print("✓ All required columns present")
4. Handle errors gracefully:
import pandas as pd
import os
file_path = '/content/drive/MyDrive/data/sales.csv'
try:
data = pd.read_csv(file_path)
print(f"✓ Successfully loaded {len(data)} rows")
except FileNotFoundError:
print(f"❌ Error: File not found at {file_path}")
print("Available files:")
print(os.listdir('/content/drive/MyDrive/data/'))
except pd.errors.EmptyDataError:
print("❌ Error: File is empty")
except Exception as e:
print(f"❌ Error loading data: {e}")
7.4 Python Essentials for Analytics
7.4.1 Basic Syntax, Data Types, and Control Structures
Python Basics
Comments:
# This is a single-line comment
"""
This is a multi-line comment
or docstring
"""
# Use comments to explain WHY, not WHAT
# BAD: x = x + 1 # increment x
# GOOD: x = x + 1 # adjust for zero-based indexing
Variables:
# Variables don't need type declarations
company_name = "Acme Corp"
revenue = 1250000
growth_rate = 0.15
is_profitable = True
# Variable naming conventions
# ✓ snake_case for variables and functions
# ✓ PascalCase for classes
# ✓ UPPER_CASE for constants
# Good names
customer_count = 150
average_order_value = 45.50
# Bad names
x = 150 # What is x?
aov = 45.50 # Unclear abbreviation
Printing output:
# Basic print
print("Hello, World!")
# Print variables
revenue = 1250000
print(revenue)
# Formatted strings (f-strings) - BEST PRACTICE
print(f"Revenue: ${revenue:,.2f}")
# Output: Revenue: $1,250,000.00
# Multiple variables
company = "Acme"
year = 2024
print(f"{company} revenue in {year}: ${revenue:,.2f}")
Data Types
Numbers:
# Integers
customer_count = 150
year = 2024
# Floats
price = 29.99
growth_rate = 0.15
# Arithmetic operations
total_revenue = customer_count * price
print(f"Total revenue: ${total_revenue:,.2f}")
# Common operations
a = 10
b = 3
print(a + b) # 13 (addition)
print(a - b) # 7 (subtraction)
print(a * b) # 30 (multiplication)
print(a / b) # 3.333... (division)
print(a // b) # 3 (integer division)
print(a % b) # 1 (modulo/remainder)
print(a ** b) # 1000 (exponentiation)
# Useful functions
import math
print(round(3.14159, 2)) # 3.14
print(abs(-10)) # 10
print(math.sqrt(16)) # 4.0
print(math.ceil(3.2)) # 4
print(math.floor(3.8)) # 3
Strings:
# String creation
company = "Acme Corp"
product = 'Widget Pro' # Single or double quotes
# String concatenation
full_name = "Acme" + " " + "Corp"
# Better: use f-strings
full_name = f"Acme Corp"
# String methods
text = " Business Analytics "
print(text.strip()) # "Business Analytics" (remove whitespace)
print(text.lower()) # " business analytics "
print(text.upper()) # " BUSINESS ANALYTICS "
print(text.replace("Business", "Data")) # " Data Analytics "
print(text.split()) # ['Business', 'Analytics']
# String formatting
revenue = 1250000
print(f"Revenue: ${revenue:,.2f}") # Revenue: $1,250,000.00
print(f"Revenue: ${revenue/1000:.1f}K") # Revenue: $1250.0K
# Multi-line strings
report = """
Q3 2024 Sales Report
====================
Total Revenue: $1.25M
Growth: 15%
"""
print(report)
Booleans:
# Boolean values
is_profitable = True
has_debt = False
# Comparison operators
revenue = 1000000
target = 900000
print(revenue > target) # True
print(revenue < target) # False
print(revenue == target) # False
print(revenue != target) # True
print(revenue >= target) # True
print(revenue <= target) # False
# Logical operators
high_revenue = revenue > 1000000
low_cost = True
print(high_revenue and low_cost) # False (both must be True)
print(high_revenue or low_cost) # True (at least one is True)
print(not high_revenue) # True (negation)
Lists:
# Create lists
products = ["Widget", "Gadget", "Doohickey"]
sales = [1000, 1500, 1200]
mixed = ["Acme", 2024, True, 3.14] # Can mix types
# Access elements (zero-indexed)
print(products[0]) # "Widget" (first element)
print(products[1]) # "Gadget"
print(products[-1]) # "Doohickey" (last element)
print(products[-2]) # "Gadget" (second to last)
# Slicing
print(products[0:2]) # ['Widget', 'Gadget'] (elements 0 and 1)
print(products[:2]) # ['Widget', 'Gadget'] (first 2)
print(products[1:]) # ['Gadget', 'Doohickey'] (from index 1 to end)
# List methods
products.append("Thingamajig") # Add to end
products.insert(0, "Whatsit") # Insert at position
products.remove("Widget") # Remove by value
last_product = products.pop() # Remove and return last
# List operations
print(len(products)) # Length
print(sum(sales)) # Sum (for numeric lists)
print(max(sales)) # Maximum
print(min(sales)) # Minimum
print(sorted(sales)) # Sorted copy
print(sales.index(1500)) # Find index of value
# List comprehensions (powerful!)
# Create new list by transforming another
doubled_sales = [x * 2 for x in sales]
high_sales = [x for x in sales if x > 1100]
print(doubled_sales) # [2000, 3000, 2400]
print(high_sales) # [1500, 1200]
Dictionaries:
# Create dictionaries (key-value pairs)
customer = {
"name": "John Smith",
"email": "john@example.com",
"age": 35,
"is_active": True
}
# Access values
print(customer["name"]) # "John Smith"
print(customer.get("email")) # "john@example.com"
print(customer.get("phone", "N/A")) # "N/A" (default if key doesn't exist)
# Modify dictionaries
customer["age"] = 36 # Update value
customer["phone"] = "555-1234" # Add new key-value pair
del customer["is_active"] # Delete key
# Dictionary methods
print(customer.keys()) # dict_keys(['name', 'email', 'age', 'phone'])
print(customer.values()) # dict_values(['John Smith', 'john@example.com', 36, '555-1234'])
print(customer.items()) # Key-value pairs
# Check if key exists
if "email" in customer:
print(f"Email: {customer['email']}")
# Business example: sales by region
regional_sales = {
"North": 250000,
"South": 180000,
"East": 310000,
"West": 220000
}
total_sales = sum(regional_sales.values())
print(f"Total sales: ${total_sales:,.0f}")
# Find top region
top_region = max(regional_sales, key=regional_sales.get)
print(f"Top region: {top_region} (${regional_sales[top_region]:,.0f})")
Control Structures
If statements:
# Basic if statement
revenue = 1250000
if revenue > 1000000:
print("Revenue exceeds $1M!")
# If-else
if revenue > 1000000:
print("High revenue")
else:
print("Revenue below $1M")
# If-elif-else
if revenue > 2000000:
status = "Excellent"
elif revenue > 1000000:
status = "Good"
elif revenue > 500000:
status = "Fair"
else:
status = "Needs Improvement"
print(f"Status: {status}")
# Multiple conditions
revenue = 1250000
profit_margin = 0.18
if revenue > 1000000 and profit_margin > 0.15:
print("Strong performance!")
# Business example: customer segmentation
customer_value = 5000
years_active = 3
if customer_value > 10000:
segment = "VIP"
elif customer_value > 5000 and years_active > 2:
segment = "Premium"
elif customer_value > 1000:
segment = "Standard"
else:
segment = "Basic"
print(f"Customer segment: {segment}")
For loops:
# Loop through list
products = ["Widget", "Gadget", "Doohickey"]
for product in products:
print(f"Product: {product}")
# Loop through range
for i in range(5): # 0, 1, 2, 3, 4
print(i)
for i in range(1, 6): # 1, 2, 3, 4, 5
print(i)
for i in range(0, 10, 2): # 0, 2, 4, 6, 8 (step by 2)
print(i)
# Loop with index
for i, product in enumerate(products):
print(f"{i+1}. {product}")
# Loop through dictionary
regional_sales = {
"North": 250000,
"South": 180000,
"East": 310000
}
for region, sales in regional_sales.items():
print(f"{region}: ${sales:,.0f}")
# Business example: calculate total sales
monthly_sales = [45000, 52000, 48000, 61000, 55000, 58000]
total = 0
for sales in monthly_sales:
total += sales
average = total / len(monthly_sales)
print(f"Total: ${total:,.0f}")
print(f"Average: ${average:,.0f}")
# Better: use built-in functions
total = sum(monthly_sales)
average = sum(monthly_sales) / len(monthly_sales)
While loops:
# Basic while loop
count = 0
while count < 5:
print(count)
count += 1
# Business example: compound growth
initial_investment = 10000
target = 20000
annual_return = 0.08
years = 0
value = initial_investment
while value < target:
value = value * (1 + annual_return)
years += 1
print(f"Year {years}: ${value:,.2f}")
print(f"Reached target in {years} years")
# Break and continue
for i in range(10):
if i == 3:
continue # Skip this iteration
if i == 7:
break # Exit loop
print(i)
Functions
# Define a function
def calculate_revenue(units_sold, price_per_unit):
"""Calculate total revenue from units and price."""
revenue = units_sold * price_per_unit
return revenue
# Call the function
total_revenue = calculate_revenue(1000, 29.99)
print(f"Revenue: ${total_revenue:,.2f}")
# Function with default parameters
def calculate_discount(price, discount_rate=0.10):
"""Calculate discounted price. Default discount is 10%."""
discounted_price = price * (1 - discount_rate)
return discounted_price
print(calculate_discount(100)) # $90.00 (10% discount)
print(calculate_discount(100, 0.20)) # $80.00 (20% discount)
# Function with multiple return values
def calculate_metrics(revenue, costs):
"""Calculate profit and margin."""
profit = revenue - costs
margin = profit / revenue if revenue > 0 else 0
return profit, margin
profit, margin = calculate_metrics(1000000, 750000)
print(f"Profit: ${profit:,.0f}")
print(f"Margin: {margin:.1%}")
# Business example: customer lifetime value
def calculate_ltv(avg_purchase, purchase_frequency, customer_lifespan):
"""
Calculate Customer Lifetime Value.
Parameters:
- avg_purchase: Average purchase amount
- purchase_frequency: Purchases per year
- customer_lifespan: Years as customer
Returns:
- Customer Lifetime Value
"""
ltv = avg_purchase * purchase_frequency * customer_lifespan
return ltv
customer_ltv = calculate_ltv(avg_purchase=50,
purchase_frequency=12,
customer_lifespan=5)
print(f"Customer LTV: ${customer_ltv:,.0f}")
7.4.2 Working with Libraries: pandas, numpy, matplotlib, seaborn
Introduction to Libraries
Libraries extend Python's capabilities. Think of them as toolboxes:
- pandas : Data manipulation (like Excel on steroids)
- numpy : Numerical computing (fast array operations)
- matplotlib : Plotting (create charts)
- seaborn : Statistical visualization (beautiful charts)
Importing libraries:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# These aliases (pd, np, plt, sns) are conventions
# Everyone uses them, so you should too
pandas: Data Manipulation
Creating DataFrames:
import pandas as pd
# From dictionary
data = {
'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],
'sales': [1000, 1500, 1200, 1800],
'region': ['North', 'South', 'North', 'East']
}
df = pd.DataFrame(data)
print(df)
Output:
product sales region
0 Widget 1000 North
1 Gadget 1500 South
2 Doohickey 1200 North
3 Thingamajig 1800 East
Reading data:
# From CSV
df = pd.read_csv('sales_data.csv')
# From Excel
df = pd.read_excel('sales_data.xlsx', sheet_name='Q3')
# From URL
url = 'https://example.com/data.csv'
df = pd.read_csv(url)
Exploring data:
# First/last rows
print(df.head()) # First 5 rows
print(df.head(10)) # First 10 rows
print(df.tail()) # Last 5 rows
# Shape
print(df.shape) # (rows, columns)
# Column names
print(df.columns)
# Data types
print(df.dtypes)
# Summary statistics
print(df.describe())
# Info
print(df.info())
Selecting data:
# Select column
sales = df['sales']
print(sales)
# Select multiple columns
subset = df[['product', 'sales']]
print(subset)
# Select rows by condition
high_sales = df[df['sales'] > 1200]
print(high_sales)
# Multiple conditions
north_high_sales = df[(df['region'] == 'North') & (df['sales'] > 1000)]
# Select by position
first_row = df.iloc[0] # First row
first_cell = df.iloc[0, 0] # First row, first column
# Select by label
df_indexed = df.set_index('product')
widget_sales = df_indexed.loc['Widget', 'sales']
Adding/modifying columns:
# Add new column
df['revenue'] = df['sales'] * 29.99
# Modify existing column
df['sales'] = df['sales'] * 1.1 # 10% increase
# Conditional column
df['performance'] = df['sales'].apply(
lambda x: 'High' if x > 1200 else 'Low'
)
# Or using np.where
df['performance'] = np.where(df['sales'] > 1200, 'High', 'Low')
Grouping and aggregating:
# Group by region, sum sales
regional_sales = df.groupby('region')['sales'].sum()
print(regional_sales)
# Multiple aggregations
summary = df.groupby('region').agg({
'sales': ['sum', 'mean', 'count'],
'revenue': 'sum'
})
print(summary)
# Reset index to make it a regular DataFrame
summary_df = summary.reset_index()
Sorting:
# Sort by sales (ascending)
df_sorted = df.sort_values('sales')
# Sort descending
df_sorted = df.sort_values('sales', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['region', 'sales'], ascending=[True, False])
Handling missing data:
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df_clean = df.dropna()
# Fill missing values
df_filled = df.fillna(0) # Fill with 0
df_filled = df.fillna(df.mean()) # Fill with mean
# Fill specific column
df['sales'] = df['sales'].fillna(df['sales'].median())
Merging DataFrames:
# Two DataFrames
sales_df = pd.DataFrame({
'product': ['Widget', 'Gadget'],
'sales': [1000, 1500]
})
price_df = pd.DataFrame({
'product': ['Widget', 'Gadget'],
'price': [29.99, 39.99]
})
# Merge (like SQL JOIN)
merged = pd.merge(sales_df, price_df, on='product')
print(merged)
Business example:
import pandas as pd
# Load sales data
sales = pd.read_csv('sales_data.csv')
# Clean data
sales = sales.dropna()
sales['date'] = pd.to_datetime(sales['date'])
# Add calculated columns
sales['revenue'] = sales['units'] * sales['price']
sales['month'] = sales['date'].dt.to_period('M')
# Aggregate by month
monthly_summary = sales.groupby('month').agg({
'revenue': 'sum',
'units': 'sum',
'customer_id': 'nunique' # Count unique customers
}).reset_index()
monthly_summary.columns = ['month', 'total_revenue', 'total_units', 'unique_customers']
# Calculate average order value
monthly_summary['avg_order_value'] = (
monthly_summary['total_revenue'] / monthly_summary['unique_customers']
)
# Display top 5 months
top_months = monthly_summary.sort_values('total_revenue', ascending=False).head()
print(top_months)
numpy: Numerical Computing
Creating arrays:
import numpy as np
# From list
arr = np.array([1, 2, 3, 4, 5])
print(arr)
# 2D array
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print(matrix)
# Special arrays
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones(5) # [1. 1. 1. 1. 1.]
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
Array operations:
# Element-wise operations
sales = np.array([1000, 1500, 1200, 1800])
# Add 10% to all sales
increased_sales = sales * 1.1
print(increased_sales)
# Operations between arrays
costs = np.array([600, 900, 700, 1000])
profit = sales - costs
print(profit)
# Statistical functions
print(np.mean(sales)) # Average
print(np.median(sales)) # Median
print(np.std(sales)) # Standard deviation
print(np.sum(sales)) # Sum
print(np.min(sales)) # Minimum
Exercises
These hands-on exercises will help you apply the concepts covered in this chapter. Work through them in Google Colab to build practical skills in Python-based analytics.
Exercise 1: Set Up and Explore Google Colab
Objective : Familiarize yourself with the Google Colab interface and basic operations.
Tasks :
- Go to colab.research.google.com and create a new notebook.
- Rename your notebook to "Chapter7_Exercise1_YourName".
- Create a markdown cell with a title and brief description of the notebook.
- Create a code cell and run a simple Python command: print("Hello, Analytics!") .
- Add another markdown cell explaining what you just did.
Insert a code cell that displays the current Python version:
import sys
print(f"Python version: {sys.version}")
Check which libraries are pre-installed by running:
!pip list | grep -E "pandas|numpy|matplotlib|seaborn"
- Save your notebook to Google Drive.
- Share the notebook with "view" permissions and copy the shareable link.
Deliverable : A well-organized Colab notebook with markdown documentation and the shareable link.
Exercise 2: Import a Dataset from Multiple Sources
Objective : Practice importing data from different sources into Google Colab.
Tasks :
Part A: Import from URL
#Import the Iris dataset directly from a URL:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
iris_df = pd.read_csv(url)
print(iris_df.head())
print(f"\nDataset shape: {iris_df.shape}")
Part B: Import from Google Drive
- Mount your Google Drive.
- Create a simple CSV file locally or download one, then upload it to your Google Drive.
- Read the file from Google Drive into a pandas DataFrame.
- Display the first few rows and basic information about the dataset.
Part C: Import from Local Upload
Use the file upload widget to upload a CSV file:
from google.colab import files
uploaded = files.upload()
# Read the uploaded file
import io
for filename in uploaded.keys():
df = pd.read_csv(io.BytesIO(uploaded[filename]))
print(f"Loaded {filename}")
print(df.head())
Part D: Documentation
- Add markdown cells before each part explaining:
- The data source
- Why you might use this method
- Any limitations or considerations
Deliverable : A Colab notebook demonstrating all three import methods with proper documentation.
Exercise 3: Calculate Descriptive Statistics with pandas
Objective : Use pandas to perform exploratory data analysis and calculate key statistics.
Tasks :
Load the Dataset
: Use the Iris dataset from Exercise 2 or load a sales dataset:
import pandas as pd
import numpy as np
# Sample sales data
np.random.seed(42)
sales_data = pd.DataFrame({
'date': pd.date_range('2023-01-01', periods=100, freq='D'),
'product': np.random.choice(['Product A', 'Product B', 'Product C'], 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'sales': np.random.randint(100, 1000, 100),
'units': np.random.randint(1, 50, 100),
'cost': np.random.randint(50, 500, 100)
})
# Calculate profit
sales_data['profit'] = sales_data['sales'] - sales_data['cost']
Basic Descriptive Statistics
:
# Overall statistics
print("=== Overall Statistics ===")
print(sales_data[['sales', 'units', 'profit']].describe())
# Mean, median, mode
print("\n=== Central Tendency ===")
print(f"Mean sales: ${sales_data['sales'].mean():.2f}")
print(f"Median sales: ${sales_data['sales'].median():.2f}")
print(f"Mode product: {sales_data['product'].mode()[0]}")
# Variability
print("\n=== Variability ===")
print(f"Sales std dev: ${sales_data['sales'].std():.2f}")
print(f"Sales variance: {sales_data['sales'].var():.2f}")
print(f"Sales range: ${sales_data['sales'].max() - sales_data['sales'].min():.2f}")
Grouped Statistics
:
# By product
print("\n=== Statistics by Product ===")
product_stats = sales_data.groupby('product').agg({
'sales': ['mean', 'sum', 'count'],
'profit': ['mean', 'sum'],
'units': 'sum'
}).round(2)
print(product_stats)
# By region
print("\n=== Statistics by Region ===")
region_stats = sales_data.groupby('region')['sales'].agg([
('total_sales', 'sum'),
('avg_sales', 'mean'),
('min_sales', 'min'),
('max_sales', 'max')
]).round(2)
print(region_stats)
Time-Based Analysis
:
# Add month column
sales_data['month'] = sales_data['date'].dt.to_period('M')
# Monthly statistics
print("\n=== Monthly Statistics ===")
monthly_stats = sales_data.groupby('month').agg({
'sales': 'sum',
'profit': 'sum',
'units': 'sum'
}).round(2)
print(monthly_stats)
Correlation Analysis
:
print("\n=== Correlation Matrix ===")
correlation = sales_data[['sales', 'units', 'cost', 'profit']].corr()
print(correlation.round(3))
Custom Insights :
- Calculate profit margin percentage
- Identify top 5 days by sales
- Find the product-region combination with highest average profit
- Calculate what percentage of days had profit > $200
Deliverable : A Colab notebook with:
- Well-organized code cells for each analysis
- Markdown cells explaining each statistic and what it tells you
- Business interpretations of the findings
Exercise 4: Create Basic Visualizations
Objective : Build fundamental visualizations using matplotlib and seaborn.
Tasks :
-
Setup
: Use the sales dataset from Exercise 3 or create your own.
Line Chart - Sales Over Time
:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
plt.figure(figsize=(12, 5))
# Daily sales
plt.plot(sales_data['date'], sales_data['sales'], marker='o',
linestyle='-', linewidth=1, markersize=3, alpha=0.7)
plt.title('Daily Sales Over Time', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Bar Chart - Sales by Product
:
plt.figure(figsize=(10, 6))
# Aggregate by product
product_sales = sales_data.groupby('product')['sales'].sum().sort_values(ascending=False)
# Create bar chart
plt.bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.title('Total Sales by Product', fontsize=14, fontweight='bold')
plt.xlabel('Product')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=0)
# Add value labels on bars
for i, v in enumerate(product_sales.values):
plt.text(i, v + 500, f'${v:,.0f}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()
Histogram - Distribution of Sales
:
plt.figure(figsize=(10, 6))
# Using matplotlib
plt.subplot(1, 2, 1)
plt.hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.title('Sales Distribution (Matplotlib)', fontweight='bold')
plt.xlabel('Sales ($)')
plt.ylabel('Frequency')
# Using seaborn
plt.subplot(1, 2, 2)
sns.histplot(sales_data['sales'], bins=20, kde=True, color='coral')
plt.title('Sales Distribution (Seaborn)', fontweight='bold')
plt.xlabel('Sales ($)')
plt.tight_layout()
plt.show()
Box Plot - Sales by Region
:
plt.figure(figsize=(10, 6))
sns.boxplot(data=sales_data, x='region', y='sales', palette='Set2')
plt.title('Sales Distribution by Region', fontsize=14, fontweight='bold')
plt.xlabel('Region')
plt.ylabel('Sales ($)')
plt.tight_layout()
plt.show()
Scatter Plot - Sales vs. Units
:
plt.figure(figsize=(10, 6))
sns.scatterplot(data=sales_data, x='units', y='sales', hue='product',
style='product', s=100, alpha=0.7)
plt.title('Sales vs. Units Sold by Product', fontsize=14, fontweight='bold')
plt.xlabel('Units Sold')
plt.ylabel('Sales ($)')
plt.legend(title='Product', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Multi-Panel Dashboard
:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Sales Analytics Dashboard', fontsize=16, fontweight='bold')
# Panel 1: Sales over time
axes[0, 0].plot(sales_data['date'], sales_data['sales'], color='blue', alpha=0.6)
axes[0, 0].set_title('Sales Trend')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales ($)')
axes[0, 0].tick_params(axis='x', rotation=45)
# Panel 2: Sales by product
product_sales = sales_data.groupby('product')['sales'].sum()
axes[0, 1].bar(product_sales.index, product_sales.values, color=['#1f77b4', '#ff7f0e', '#2ca02c'])
axes[0, 1].set_title('Sales by Product')
axes[0, 1].set_xlabel('Product')
axes[0, 1].set_ylabel('Total Sales ($)')
# Panel 3: Sales distribution
axes[1, 0].hist(sales_data['sales'], bins=20, color='skyblue', edgecolor='black')
axes[1, 0].set_title('Sales Distribution')
axes[1, 0].set_xlabel('Sales ($)')
axes[1, 0].set_ylabel('Frequency')
# Panel 4: Profit by region
region_profit = sales_data.groupby('region')['profit'].mean()
axes[1, 1].barh(region_profit.index, region_profit.values, color='coral')
axes[1, 1].set_title('Average Profit by Region')
axes[1, 1].set_xlabel('Average Profit ($)')
axes[1, 1].set_ylabel('Region')
plt.tight_layout()
plt.show()
Deliverable : A Colab notebook with:
- All visualizations properly labeled and formatted
- Markdown cells explaining what each chart shows
- At least one insight or observation from each visualization
Exercise 5: Document a Notebook for a Non-Technical Manager
Objective : Practice clear communication and documentation for business stakeholders.
Scenario : You've been asked to analyze quarterly sales performance and present findings to your manager who has limited technical background.
Tasks :
Create a Professional Notebook Structure
:
# Quarterly Sales Performance Analysis
**Prepared by**: [Your Name]
**Date**: [Current Date]
**Period**: Q1 2023
## Executive Summary
[Write 2-3 sentences summarizing key findings]
## Table of Contents
1. Data Overview
2. Sales Performance Analysis
3. Product Performance
4. Regional Analysis
5. Key Insights and Recommendations
-
Data Overview Section
:
- Load and prepare the data
- Add a markdown cell explaining:
- What data you're analyzing
- The time period covered
- Key metrics included
- Display the first few rows with explanation
- Show basic dataset information (number of records, date range, etc.)
-
Analysis Sections with Business Language
:
- For each analysis, structure as:
- Business Question (markdown): "How did sales perform this quarter?"
- Analysis (code): Perform the calculation
- Visualization (code): Create a clear chart
- Interpretation (markdown): Explain what the results mean in business terms
- So What? (markdown): Why this matters and what action to take
Example Structure for One Analysis
:
## 2. Sales Performance Analysis
### Question: How did our sales trend throughout the quarter?
We want to understand if sales were growing, declining, or stable during Q1.
# Calculate daily and weekly sales trends
daily_sales = sales_data.groupby('date')['sales'].sum()
# Create visualization
plt.figure(figsize=(12, 5))
plt.plot(daily_sales.index, daily_sales.values, marker='o', linewidth=2)
plt.title('Daily Sales Trend - Q1 2023', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Sales ($)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate growth
first_week_avg = daily_sales[:7].mean()
last_week_avg = daily_sales[-7:].mean()
growth_rate = ((last_week_avg - first_week_avg) / first_week_avg) * 100
print(f"First week average: ${first_week_avg:,.2f}")
print(f"Last week average: ${last_week_avg:,.2f}")
print(f"Growth rate: {growth_rate:.1f}%")
### Key Finding
Sales showed a **positive trend** throughout Q1, with the last week averaging
${last_week_avg:,.2f} compared to ${first_week_avg:,.2f} in the first week—
a growth of {growth_rate:.1f}%.
### What This Means
- Our Q1 initiatives appear to be working
- We're building momentum heading into Q2
- Current trajectory suggests we'll exceed Q2 targets if maintained
### Recommended Action
Continue current marketing and sales strategies. Consider increasing inventory
for top-performing products to meet growing demand.
Key Insights and Recommendations Section
:
## 5. Key Insights and Recommendations
### Top 3 Insights
1. **[Insight 1]**: [Explanation with supporting data]
2. **[Insight 2]**: [Explanation with supporting data]
3. **[Insight 3]**: [Explanation with supporting data]
### Recommendations
| Priority | Recommendation | Expected Impact | Timeline |
|----------|---------------|-----------------|----------|
| High | [Action 1] | [Impact] | [When] |
| Medium | [Action 2] | [Impact] | [When] |
| Low | [Action 3] | [Impact] | [When] |
### Next Steps
- [ ] [Action item 1]
- [ ] [Action item 2]
- [ ] [Action item 3]
-
Best Practices to Follow
:
- Use clear, jargon-free language
- Lead with insights, not methodology
- Use formatting (bold, italics, bullet points) for readability
- Include visual breaks between sections
- Add context before showing code or results
- Highlight key numbers and percentages
- End each section with "so what?" implications
- Include a clear call to action
Deliverable : A polished Colab notebook that:
- Tells a clear story from start to finish
- Uses professional formatting and structure
- Explains technical concepts in business terms
- Includes actionable recommendations
- Could be shared directly with a non-technical stakeholder
Additional Resources
Python Learning
- Python for Data Analysis (3rd Edition) by Wes McKinney (pandas creator)
- Python Data Science Handbook by Jake VanderPlas (free online)
- Real Python - Tutorials and articles
- DataCamp - Interactive Python courses
Google Colab Resources
- Official Colab Documentation
- Colab Tips and Tricks
- Seedbank - Collection of interactive machine learning examples
pandas Documentation
Visualization Resources
Communities and Support
- Stack Overflow - Python tag
- r/learnpython - Reddit community
- Python Discord - Active community chat
- Kaggle - Datasets, notebooks, and competitions