Lecture VII - Pandas and AI

Programming with Python

Author

Affiliation

Dr. Tobias Vlćek

Kühne Logistics University Hamburg - Fall 2024

Quick Recap of the last Lecture

What is NumPy?

NumPy is a package for scientific computing in Python
Provides multi-dimensional arrays and matrices
Much faster than Python lists for numerical operations
Operations are implemented in C and C++

. . .

Tip

NumPy arrays are stored in contiguous memory blocks, making operations very efficient.

Creating Arrays

Core data structure is the ndarray
Can create arrays from lists, tuples, or other data structures
Special functions like:
- np.zeros() for arrays of zeros
- np.random.rand() for random values
- np.arange() for evenly spaced values
- np.linspace() for linearly spaced values

Working with Arrays

Support for multi-dimensional operations
Common operations:
- Element-wise arithmetic (+, -, *, /)
- Array indexing and slicing
- Shape manipulation (reshape, flatten)
- Sorting and transposing

. . .

Tip

NumPy operations are vectorized, meaning they operate on entire arrays at once rather than element by element.

NumPy in Action I

Task: Complete the following task:

# TODO: Create an array with 10 evenly spaced numbers over the interval from 0 to 73.

import numpy as np
# YOUR CODE HERE

. . .

Note

Note, that you can always use the help() function to get more information about a function. But be sure to import the package first, otherwise you will get an error. To quit the help page, press q.

NumPy in Action II

Task: Complete the following task:

# TODO: Take the following 3x3 array and reduce it to a 1D array.

import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# YOUR CODE HERE

Pandas Basics

What is Pandas?

Pandas is a data manipulation and analysis library
It provides data structures like DataFrames and Series
Tools for data cleaning, analysis, and visualization
It can also be used to work with Excel files!

How to install Pandas

In the last lecture, we have installed it with pip install pandas or with Thonny
Now, import the package import pandas as pd

. . .

Note

You can also use a different abbreviation, but pd is the most common one.

Creating DataFrames

DataFrames behave quite similar to Numpy arrays
But they have row and column labels

. . .

import pandas as pd
df = pd.DataFrame({ # DataFrame is created from a dictionary
    "Name": ["Tobias", "Robin", "Nils", "Nikolai"],
    "Kids": [2, 1, 0, 0],
    "City": ["Oststeinbek", "Oststeinbek", "Hamburg", "Lübeck"],
    "Salary": [3000, 3200, 4000, 2500]}); print(df)

      Name  Kids         City  Salary
0   Tobias     2  Oststeinbek    3000
1    Robin     1  Oststeinbek    3200
2     Nils     0      Hamburg    4000
3  Nikolai     0       Lübeck    2500

Reading from CSV Files

df = pd.read_csv("data/employees.csv") # Reads the CSV file
print(df)

       Name  Age Department        Position  Salary
0     Alice   30         HR         Manager   50000
1       Bob   25         IT       Developer   60000
2   Charlie   28    Finance         Analyst   55000
3     David   35  Marketing       Executive   52000
4       Eve   32      Sales  Representative   48000
5     Frank   29         IT       Developer   61000
6     Grace   31         HR       Assistant   45000
7      Hank   27    Finance         Analyst   53000
8       Ivy   33  Marketing         Manager   58000
9      Jack   26      Sales  Representative   47000
10     Kara   34         IT       Developer   62000
11      Leo   30         HR         Manager   51000
12     Mona   28    Finance         Analyst   54000
13     Nina   35  Marketing       Executive   53000
14    Oscar   32      Sales  Representative   49000
15     Paul   29         IT       Developer   63000
16    Quinn   31         HR       Assistant   46000
17     Rita   27    Finance         Analyst   52000
18      Sam   33  Marketing         Manager   59000
19     Tina   26      Sales  Representative   48000
20      Uma   34         IT       Developer   64000
21    Vince   30         HR         Manager   52000
22     Walt   28    Finance         Analyst   55000
23     Xena   35  Marketing       Executive   54000
24     Yara   32      Sales  Representative   50000
25     Zane   29         IT       Developer   65000
26     Anna   31         HR       Assistant   47000
27      Ben   27    Finance         Analyst   53000
28    Cathy   33  Marketing         Manager   60000
29    Dylan   26      Sales  Representative   49000
30     Ella   34         IT       Developer   66000
31     Finn   30         HR         Manager   53000
32     Gina   28    Finance         Analyst   56000
33     Hugo   35  Marketing       Executive   55000
34     Iris   32      Sales  Representative   51000
35     Jake   29         IT       Developer   67000
36     Kyla   31         HR       Assistant   48000
37     Liam   27    Finance         Analyst   54000
38      Mia   33  Marketing         Manager   61000
39     Noah   26      Sales  Representative   50000
40    Olive   34         IT       Developer   68000
41     Pete   30         HR         Manager   54000
42   Quincy   28    Finance         Analyst   57000
43     Rose   35  Marketing       Executive   56000
44    Steve   32      Sales  Representative   52000
45     Tara   29         IT       Developer   69000
46     Umar   31         HR       Assistant   49000
47     Vera   27    Finance         Analyst   55000
48     Will   33  Marketing         Manager   62000
49     Zara   26      Sales  Representative   51000

Basic Operations

Use the df.head() method to display the first 5 rows
Use the df.tail() method to display the last 5 rows

. . .

df = pd.read_csv("data/employees.csv")
print(df.tail())

    Name  Age Department        Position  Salary
45  Tara   29         IT       Developer   69000
46  Umar   31         HR       Assistant   49000
47  Vera   27    Finance         Analyst   55000
48  Will   33  Marketing         Manager   62000
49  Zara   26      Sales  Representative   51000

Information about the DataFrame

Use df.info() to display information about a DataFrame

. . .

df = pd.read_csv("data/employees.csv")
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        50 non-null     object
 1   Age         50 non-null     int64 
 2   Department  50 non-null     object
 3   Position    50 non-null     object
 4   Salary      50 non-null     int64 
dtypes: int64(2), object(3)
memory usage: 2.1+ KB
None

Statistics about a DataFrame

Use df.describe() to display summary statistics
Use the df.index attribute to access the index

. . .

df = pd.read_csv("data/employees.csv")
print(df.describe())

             Age        Salary
count  50.000000     50.000000
mean   30.320000  54980.000000
std     2.958488   6175.957333
min    25.000000  45000.000000
25%    28.000000  50250.000000
50%    30.000000  54000.000000
75%    33.000000  59750.000000
max    35.000000  69000.000000

Filtering DataFrames

Use df['column_name'] to access a column
Use the df[df['column'] > value] method to filter

. . .

df = pd.read_csv("data/employees.csv")
df_high_salary = df[df['Salary'] >= 67000]
print(df_high_salary)
print(df_high_salary.iloc[2]["Name"]) #Access the third row and the "Name" column
print(df_high_salary.loc[40]["Name"]) #Access the label 40 and the "Name" column

     Name  Age Department   Position  Salary
35   Jake   29         IT  Developer   67000
40  Olive   34         IT  Developer   68000
45   Tara   29         IT  Developer   69000
Tara
Olive

Filtering in Action

Task: Complete the following task:

# TODO: Load the employees.csv located in the git repository into a DataFrame
# First, filter the DataFrame for employees with a manager position
# Then, print the average salary of the remaining employees
# Finally, print the name of the employee with the lowest salary

. . .

Note

Note, that we can use the mean() method on the Salary column, as it is a numeric column. In addition, we can use the min() method on the Salary column to find the lowest salary.

Grouping DataFrames

Grouping

Grouping is a powerful feature of Pandas
Groups data by one or more columns
And then perform operations
Syntax is df.groupby('column').method()

. . .

df = pd.read_csv("data/employees.csv")
df.groupby(['Position']).sum() # Sum per position

	Name	Age	Department	Salary
Position
Analyst	CharlieHankMonaRitaWaltBenGinaLiamQuincyVera	275	FinanceFinanceFinanceFinanceFinanceFinanceFina...	544000
Assistant	GraceQuinnAnnaKylaUmar	155	HRHRHRHRHR	235000
Developer	BobFrankKaraPaulUmaZaneEllaJakeOliveTara	306	ITITITITITITITITITIT	645000
Executive	DavidNinaXenaHugoRose	175	MarketingMarketingMarketingMarketingMarketing	270000
Manager	AliceIvyLeoSamVinceCathyFinnMiaPeteWill	315	HRMarketingHRMarketingHRMarketingHRMarketingHR...	560000
Representative	EveJackOscarTinaYaraDylanIrisNoahSteveZara	290	SalesSalesSalesSalesSalesSalesSalesSalesSalesS...	495000

Grouping Numeric Columns

To prevent errors, we can select numeric columns first
Afterwards, perform the operation on the selected columns
Helps to avoid errors when grouping by non-numeric columns
Or drop columns by df.drop(columns=["column"])

. . .

df = pd.read_csv("data/employees.csv")
numeric_cols = df.select_dtypes(include=['number']).columns
print(df.groupby("Position")[numeric_cols].sum())

                Age  Salary
Position                   
Analyst         275  544000
Assistant       155  235000
Developer       306  645000
Executive       175  270000
Manager         315  560000
Representative  290  495000

Grouping by Multiple Columns

Group by multiple columns ['column1', 'column2']
You can use lists or tuples to specify multiple columns

. . .

df = pd.read_csv("data/employees.csv")
df = df.drop(columns=["Name"])
# Max per position and department
df.groupby(['Position', "Department"]).max()

		Age	Salary
Position	Department
Analyst	Finance	28	57000
Assistant	HR	31	49000
Developer	IT	34	69000
Executive	Marketing	35	56000
Manager	HR	30	54000
Manager	Marketing	33	62000
Representative	Sales	32	52000

Grouping with Aggregations

We can use different aggregation functions:
- sum(): sum of the values
- mean(): mean of the values
- max(): maximum of the values
- min(): minimum of the values
- count(): count of the values

Pandas in Action

Task: Complete the following task:

# TODO: Load the employees.csv again into a DataFrame
# First, group by the "Position" column and count the employees per position
# Then, group by the "Department" column and calculate the mean of all other columns per department
df = pd.read_csv("data/employees.csv")
# Your code here

Combining DataFrames

Concatenating DataFrames

pd.concat() to concatenate along shared columns

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df2 = pd.DataFrame({"A": [7, 8, 9], "B": [10, 11, 12]})
df = pd.concat([df1, df2])
print(df)

Joining DataFrames

Use pd.join() to join DataFrames along columns
Joining is done on the index by default!

df1 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['x', 'y', 'z'])
df2 = pd.DataFrame({"C": [7, 8, 9], "D": [10, 11, 12]}, index=['z', 'y', 'w'])
df = df1.join(df2)
print(df)

   A  B    C     D
x  1  4  NaN   NaN
y  2  5  8.0  11.0
z  3  6  7.0  10.0

Merging DataFrames on Columns

pd.merge(df_name, on='column', how='type')
merge DataFrames along shared columns
how specifies the type of merge
- inner: rows with matching keys in both DataFrames
- outer: rows from both are kept, missing values are filled
- left: rows from the left are kept, missing values are filled
- right: rows from right are kept, missing values are filled

Outer Merge

df3 = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df4 = pd.DataFrame({"A": [2, 3, 4], "C": [7, 8, 9]})
df_merged = df3.merge(df4, on="A", how="outer")
print(df_merged)

   A    B    C
0  1  4.0  NaN
1  2  5.0  7.0
2  3  6.0  8.0
3  4  NaN  9.0

Merging in Action

Task: Complete the following task:

df1 = pd.DataFrame({
    "Name": ["John", "Alice", "Bob", "Carol"],
    "Department": ["Sales", "IT", "HR", "Sales"],
    "Salary": [50000, 60000, 55000, 52000]})
df2 = pd.DataFrame({
    "Name": ["Alice", "Bob", "Dave", "Eve"],
    "Position": ["Developer", "Manager", "Analyst", "Developer"],
    "Years": [5, 8, 3, 4]})

# TODO: Merge the two DataFrames on the "Name" column
# Try different types of merges (inner, outer, left, right)
# Observe and describe the differences in the results

Working with Excel Files

Reading Excel Files

Read using the pd.read_excel(file_path) function
Write using the df.to_excel(file_path) method

. . .

import pandas as pd
df = pd.read_csv("data/employees.csv")
df.to_excel("data/employees.xlsx", index=False)

. . .

Note

Note, that you likely need to install the openpyxl package to be able to write Excel files, as it handles the file format.

Advanced Excel file handling

We can also specify the sheet name when reading and writing

# Writes to the Employees sheet and does not include row indices
df.to_excel("data/employees.xlsx", sheet_name="Employees", index=False)

. . .

# Reads from the Employees sheet
df = pd.read_excel("data/employees.xlsx", sheet_name="Employees")
print(df.head())

      Name  Age Department        Position  Salary
0    Alice   30         HR         Manager   50000
1      Bob   25         IT       Developer   60000
2  Charlie   28    Finance         Analyst   55000
3    David   35  Marketing       Executive   52000
4      Eve   32      Sales  Representative   48000

Excel in Action

Task: Complete the following task:

# TODO: Load the temperatures.xlsx file into a DataFrame
# Look at the first few rows of the DataFrame
# Then, print the average temperature per city

Melting DataFrames

Melting

Sometimes, you want to transform a DataFrame
Instead of wide format, you want long format
This is useful for certain types of visualizations
And when working with time series data

. . .

Question: Anybody ever heard of the terms?

Wide Format

For example, the following DataFrame is in wide format:

         Date  Hamburg  Los_Angeles  Tokyo
0  2024-03-01     12.0         18.2   14.8
1  2024-03-02      9.8         23.0   17.6
2  2024-03-03      7.6         20.3   16.0
3  2024-03-04     10.1         21.1   13.4
4  2024-03-05     11.2         18.5   15.1
..        ...      ...          ...    ...
87 2024-05-27     12.4         24.5   24.9
88 2024-05-28     17.8         20.6   22.3
89 2024-05-29     16.2         20.4   20.2
90 2024-05-30     15.5         20.7   21.7
91 2024-05-31     12.6         22.0   22.9

[92 rows x 4 columns]

Long Format

The melting process transforms it into the following long format:

          Date     City  Temperature
0   2024-03-01  Hamburg         12.0
1   2024-03-02  Hamburg          9.8
2   2024-03-03  Hamburg          7.6
3   2024-03-04  Hamburg         10.1
4   2024-03-05  Hamburg         11.2
..         ...      ...          ...
271 2024-05-27    Tokyo         24.9
272 2024-05-28    Tokyo         22.3
273 2024-05-29    Tokyo         20.2
274 2024-05-30    Tokyo         21.7
275 2024-05-31    Tokyo         22.9

[276 rows x 3 columns]

How to melt DataFrames

Use pd.melt() to transform from wide to long
Parameters:
- id_vars: columns to keep
- var_name: name of the new column that will contain the names of the original columns
- value_name: name of the new column that will contain the values of the original columns

. . .

df = pd.read_csv("data/employees.csv")
df = pd.melt(df, id_vars=['Position'], var_name='Variables', value_name='Values')
print(df)

           Position Variables   Values
0           Manager      Name    Alice
1         Developer      Name      Bob
2           Analyst      Name  Charlie
3         Executive      Name    David
4    Representative      Name      Eve
..              ...       ...      ...
195       Developer    Salary    69000
196       Assistant    Salary    49000
197         Analyst    Salary    55000
198         Manager    Salary    62000
199  Representative    Salary    51000

[200 rows x 3 columns]

Melting in Action

Task: Complete the following task:

# TODO: Load and transform the temperatures.xlsx file by melting it
# Expected output format:
#         Date        City  Temperature
# 0  2024-03-01    Hamburg         7.2
# 1  2024-03-01 Los_Angeles       18.5
# 2  2024-03-01      Tokyo        12.3
# Then, print the maximum temperature per city by grouping by the "City" column

Programming with AI

Using AI to generate code

Coding by hand is not the only way to generate code
Most likely, a lot of you have already used ChatGPT

. . .

How do

Large Language

Models work?

Large Language Models (LLMs)

Think of them like advanced pattern recognition systems
They have “read” massive amounts of text
Books, websites, articles, code, and more
Text is broken into tokens, parts of words or punctuation
Based on patterns, they can generate new text

Training LLMs

Imagine learning a language by reading millions of books
Learns patterns in how words and ideas connect via tokens
Interconnected nodes with weights representing patterns
During training, these weights are adjusted
Once trained, applying them takes much less ressources

Pattern Recognition

Not like a search engine!
When asked, it looks for relevant patterns it learned
Like having a huge library in its “memory” to draw from
It can find patterns between concepts and your question
Knows only limited text at once (context window)

Probability based responses

After each written token, it predicts “what should come next?”
Like a advanced version of the word prediction on your phone
Chooses the most likely next token based on training
But can’t actually “think” or “understand” like humans

Limitations

No true understanding of cause and effect
Sometimes makes mistakes or “hallucinates”
Mostly only knows what it was trained on
Can reflect biases present in training data
No emotional understanding (but can simulate responses!)

Impact on Jobs

Question: What do you think about their impact on jobs?
Question: What are the implications for us?
Question: Can we use them to our advantage?

(Current) Choices for Programmers

Github Copilot: Integrated into VS Code by Microsoft
Cursor: Fork of VS Code with AI assistance built in
Aider: Chat interface for AI to write code in the terminal

. . .

Tip

Currently, Cursor is my favorite one. But this might change in the future, as there is a lot of competition in this space.

Installing Cursor

Go to Cursor
Download and install Cursor
You will need to create an account
Some free usage per month, after that you need to pay
For us, the free plan should be more than enough

Using Cursor

Open the folder with your tutorial files
Create a new .py file
Press Ctrl + L to open the chat

Asking for help

Task: Paste the following prompt in to the chat:

Can you please write me a small number guessing game in python? It should work for one player in the terminal. The player should guess a number between 1-10 and get hints about whether his guess was too large or too small. After 3 tries, end the game if he didn’t succeed with a nice message.

. . .

Copy the generated code and paste it into your file.

Literature

Interesting Books

Downey, A. B. (2024). Think Python: How to think like a computer scientist (Third edition). O’Reilly. Link to free online version
Elter, S. (2021). Schrödinger programmiert Python: Das etwas andere Fachbuch (1. Auflage). Rheinwerk Verlag.

. . .

For more interesting literature to learn more about Python, take a look at the literature list of this course.

Quick Recap of the last Lecture

What is NumPy?

Creating Arrays

Working with Arrays

NumPy in Action I

NumPy in Action II

Pandas Basics

What is Pandas?

How to install Pandas

Creating DataFrames

Reading from CSV Files

Basic Operations

Information about the DataFrame

Statistics about a DataFrame

Filtering DataFrames

Filtering in Action

Grouping DataFrames

Grouping

Grouping Numeric Columns

Grouping by Multiple Columns

Grouping with Aggregations

Pandas in Action

Combining DataFrames

Concatenating DataFrames

Joining DataFrames

Merging DataFrames on Columns

Outer Merge

Merging in Action

Working with Excel Files

Reading Excel Files

Advanced Excel file handling

Excel in Action

Melting DataFrames

Melting

Wide Format

Long Format

How to melt DataFrames

Melting in Action

Programming with AI

Using AI to generate code

Large Language Models (LLMs)

Training LLMs

Pattern Recognition

Probability based responses

Limitations

Impact on Jobs

(Current) Choices for Programmers

Installing Cursor

Using Cursor

Asking for help

More on Cursor

Literature

Interesting Books