import numpy as np
= np.array([1, 2, 3, 4, 5]); type(x) x
numpy.ndarray
Programming with Python
.py
files containing Python code. . .
We can import entire modules or individual functions, classes or variables.
random
provides functions for random numbersos
allows interaction with the operating systemcsv
is used for reading and writing CSV filesre
is used for working with regular expressionspip install <package_name>
. . .
Virtual environments are not that important for you right now, as they are mostly used if you work on several projects with different dependecies at once.
. . .
The name of the package comes from Numerical Python.
C
and C++
. . .
Question: Have you heard of C and C++?
pip install numpy
Tools -> Manage Packages...
import numpy as np
. . .
import numpy as np
= np.array([1, 2, 3, 4, 5]); type(x) x
numpy.ndarray
. . .
You don’t have to use as np
. But it is a common practice to do so.
ndarray
import numpy as np
= np.array([1, 1, 1, 1])
array_from_list print(array_from_list)
[1 1 1 1]
import numpy as np
= np.array((2, 2, 2, 2))
array_from_tuple print(array_from_tuple)
[2 2 2 2]
ndarray
import numpy as np
= np.array(["s", 2, 2.0, "i"])
array_different_types print(array_different_types)
['s' '2' '2.0' 'i']
. . .
But it is mostly not recommended, as it can lead to performance issues. If possible, try to keep the types homogenous.
Improve performance by allocating memory upfront
np.zeros(shape)
: to create an array of zerosnp.random.rand(shape)
: array of random valuesnp.arange(start, stop, step)
: evenly spacednp.linspace(start, stop, num)
: evenly spaced. . .
The shape refers to the size of the array. It can have one or multiple dimensions.
(2)
or 2
creates a 1-dimensional array (vetor)(2,2)
creates a 2-dimensional array (matrix)(2,2,2)
3-dimensional array (3rd order tensor)(2,2,2,2)
4-dimensional array (4th order tensor). . .
import numpy as np
= np.array([1, 2, 3, 4, 5])
x + 1 x
array([2, 3, 4, 5, 6])
Task: Practice working with Numpy:
# TODO: Create a 3-dimensional tensor with filled with zeros
# Choose the shape of the tensor, but it should have 200 elements
# Add the number 5 to all values of the tensor
# Your code here
assert sum(tensor) == 1000
# TODO: Print the shape of the tensor using the method shape()
# TODO: Print the dtype of the tensor using the method dtype()
# TODO: Print the size of the tensor using the method size()
ndarray
works as before. . .
Question: What do you expect will be printed?
import numpy as np
= np.random.randint(0, 10, size=(3, 3))
x print(x); print("---")
print(x[0:2,0:2])
[[7 7 2]
[3 3 9]
[3 9 4]]
---
[[7 7]
[3 3]]
i
: integerb
: booleanf
: floatS
: stringU
: unicode. . .
= np.array(["Hello", "World"]); string_array.dtype string_array
dtype('<U5')
. . .
= np.array([1, 2, 3, 4, 5], dtype = 'f'); print(x.dtype) x
float32
. . .
= np.array([1, 2, 3, 4, 5], dtype = 'f'); print(x.astype('i').dtype) x
int32
. . .
Note, how the types are specified as int32
and float32
.
Question: Do you have an idea what 32
stands for?
. . .
int16
is a 16-bit integerfloat32
is a 32-bit floating point numberint64
is a 64-bit integerfloat128
is a 128-bit floating point number. . .
int8
has to be in the range of -128 to 127int16
has to be in the range of -32768 to 32767. . .
Question: Size difference between int16
and int64
?
concatenate
two join arraysaxis
you can specify the dimensionhstack()
and vstack()
are easier. . .
Question: What do you expect will be printed?
import numpy as np
= np.array((1,1,1,1))
ones = np.array((1,1,1,1)) *2
twos print(np.vstack((ones,twos))); print(np.hstack((ones,twos)))
[[1 1 1 1]
[2 2 2 2]]
[1 1 1 1 2 2 2 2]
sort()
: sort the array from low to highreshape()
: reshape the array into a new shapeflatten()
: flatten the array into a 1D arraysqueeze()
: squeeze the array to remove 1D entriestranspose()
: transpose the array. . .
Try experiment with these methods, they can make your work much easier.
Task: Complete the following task to practice with Numpy:
# TODO: Create a 2-dimensional matrix with filled with ones of size 1000 x 1000.
# Afterward, flatten the matrix to a vector and loop over the vector.
# In each loop iteration, add a random number between 1 and 10000.
# TODO: Now, do the same with a list of the same size and fill it with random numbers.
# Then, sort the list as you have done with the Numpy vector before.
# You can use the `time` module to compare the runtime of both approaches.
import time
= time.time()
start # Your code here
= time.time()
end print(end - start) # time in seconds
9.059906005859375e-06
pip install pandas
or with Thonnyimport pandas as pd
. . .
You can also use a different abbreviation, but pd
is the most common one.
. . .
import pandas as pd
= pd.DataFrame({ # DataFrame is created from a dictionary
df "Name": ["Tobias", "Robin", "Nils", "Nikolai"],
"Kids": [2, 1, 0, 0],
"City": ["Oststeinbek", "Oststeinbek", "Hamburg", "Lübeck"],
"Salary": [3000, 3200, 4000, 2500]}); print(df)
Name Kids City Salary
0 Tobias 2 Oststeinbek 3000
1 Robin 1 Oststeinbek 3200
2 Nils 0 Hamburg 4000
3 Nikolai 0 Lübeck 2500
= pd.read_csv("employees.csv") # Reads the CSV file
df print(df)
Name Age Department Position Salary
0 Alice 30 HR Manager 50000
1 Bob 25 IT Developer 60000
2 Charlie 28 Finance Analyst 55000
3 David 35 Marketing Executive 52000
4 Eve 32 Sales Representative 48000
5 Frank 29 IT Developer 61000
6 Grace 31 HR Assistant 45000
7 Hank 27 Finance Analyst 53000
8 Ivy 33 Marketing Manager 58000
9 Jack 26 Sales Representative 47000
10 Kara 34 IT Developer 62000
11 Leo 30 HR Manager 51000
12 Mona 28 Finance Analyst 54000
13 Nina 35 Marketing Executive 53000
14 Oscar 32 Sales Representative 49000
15 Paul 29 IT Developer 63000
16 Quinn 31 HR Assistant 46000
17 Rita 27 Finance Analyst 52000
18 Sam 33 Marketing Manager 59000
19 Tina 26 Sales Representative 48000
20 Uma 34 IT Developer 64000
21 Vince 30 HR Manager 52000
22 Walt 28 Finance Analyst 55000
23 Xena 35 Marketing Executive 54000
24 Yara 32 Sales Representative 50000
25 Zane 29 IT Developer 65000
26 Anna 31 HR Assistant 47000
27 Ben 27 Finance Analyst 53000
28 Cathy 33 Marketing Manager 60000
29 Dylan 26 Sales Representative 49000
30 Ella 34 IT Developer 66000
31 Finn 30 HR Manager 53000
32 Gina 28 Finance Analyst 56000
33 Hugo 35 Marketing Executive 55000
34 Iris 32 Sales Representative 51000
35 Jake 29 IT Developer 67000
36 Kyla 31 HR Assistant 48000
37 Liam 27 Finance Analyst 54000
38 Mia 33 Marketing Manager 61000
39 Noah 26 Sales Representative 50000
40 Olive 34 IT Developer 68000
41 Pete 30 HR Manager 54000
42 Quincy 28 Finance Analyst 57000
43 Rose 35 Marketing Executive 56000
44 Steve 32 Sales Representative 52000
45 Tara 29 IT Developer 69000
46 Umar 31 HR Assistant 49000
47 Vera 27 Finance Analyst 55000
48 Will 33 Marketing Manager 62000
49 Zara 26 Sales Representative 51000
df.head()
method to display the first 5 rowsdf.tail()
method to display the last 5 rows. . .
= pd.read_csv("employees.csv")
df print(df.tail())
Name Age Department Position Salary
45 Tara 29 IT Developer 69000
46 Umar 31 HR Assistant 49000
47 Vera 27 Finance Analyst 55000
48 Will 33 Marketing Manager 62000
49 Zara 26 Sales Representative 51000
df.info()
to display information about a DataFrame. . .
= pd.read_csv("employees.csv")
df print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 50 non-null object
1 Age 50 non-null int64
2 Department 50 non-null object
3 Position 50 non-null object
4 Salary 50 non-null int64
dtypes: int64(2), object(3)
memory usage: 2.1+ KB
None
df.describe()
to display summary statisticsdf.index
attribute to access the index. . .
= pd.read_csv("employees.csv")
df print(df.describe())
Age Salary
count 50.000000 50.000000
mean 30.320000 54980.000000
std 2.958488 6175.957333
min 25.000000 45000.000000
25% 28.000000 50250.000000
50% 30.000000 54000.000000
75% 33.000000 59750.000000
max 35.000000 69000.000000
df['column_name']
to access a columndf[df['column'] > value]
method to filter. . .
= pd.read_csv("employees.csv")
df = df[df['Salary'] >= 67000]
df_high_salary print(df_high_salary)
print(df_high_salary.iloc[2]["Name"]) #Access the third row and the "Name" column
print(df_high_salary.loc[40]["Name"]) #Access the label 40 and the "Name" column
Name Age Department Position Salary
35 Jake 29 IT Developer 67000
40 Olive 34 IT Developer 68000
45 Tara 29 IT Developer 69000
Tara
Olive
Task: Complete the following task:
# TODO: Load the employees.csv located in the git repository into a DataFrame
# First, filter the DataFrame for employees with a manager position
# Then, print the average salary of the remaining employees
# Finally, print the name of the employee with the lowest salary
. . .
Note, that we can use the mean()
method on the Salary
column, as it is a numeric column. In addition, we can use the min()
method on the Salary
column to find the lowest salary.
df.groupby('column').method()
. . .
= pd.read_csv("employees.csv")
df = df.drop(columns=["Name", "Department"])
df 'Position']).mean() # Mean per position df.groupby([
Age | Salary | |
---|---|---|
Position | ||
Analyst | 27.5 | 54400.0 |
Assistant | 31.0 | 47000.0 |
Developer | 30.6 | 64500.0 |
Executive | 35.0 | 54000.0 |
Manager | 31.5 | 56000.0 |
Representative | 29.0 | 49500.0 |
['column1', 'column2']
. . .
= pd.read_csv("employees.csv")
df = df.drop(columns=["Name"])
df # Max per position and department
'Position', "Department"]).max() df.groupby([
Age | Salary | ||
---|---|---|---|
Position | Department | ||
Analyst | Finance | 28 | 57000 |
Assistant | HR | 31 | 49000 |
Developer | IT | 34 | 69000 |
Executive | Marketing | 35 | 56000 |
Manager | HR | 30 | 54000 |
Marketing | 33 | 62000 | |
Representative | Sales | 32 | 52000 |
sum()
: sum of the valuesmean()
: mean of the valuesmax()
: maximum of the valuesmin()
: minimum of the valuescount()
: count of the valuespd.melt()
to transform from wide to long. . .
= pd.read_csv("employees.csv").drop(columns=["Name"])
df = pd.melt(df, id_vars=['Position'])
df print(df.head()); print(df.tail())
Position variable value
0 Manager Age 30
1 Developer Age 25
2 Analyst Age 28
3 Executive Age 35
4 Representative Age 32
Position variable value
145 Developer Salary 69000
146 Assistant Salary 49000
147 Analyst Salary 55000
148 Manager Salary 62000
149 Representative Salary 51000
Task: Complete the following task:
# TODO: Load the employees.csv again into a DataFrame
# First, group by the "Position" column and count the employees per position
# Then, group by the "Department" column and calculate the sum of all other columns per department
= pd.read_csv("employees.csv")
df # Your code here
. . .
Do you notice any irregularities while calculating the sum per department?
pd.concat()
to concatenate along shared columns= pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df1 = pd.DataFrame({"A": [7, 8, 9], "B": [10, 11, 12]})
df2 = pd.concat([df1, df2])
df print(df)
A B
0 1 4
1 2 5
2 3 6
0 7 10
1 8 11
2 9 12
pd.join()
to join DataFrames along columns= pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}, index=['x', 'y', 'z'])
df1 = pd.DataFrame({"C": [7, 8, 9], "D": [10, 11, 12]}, index=['z', 'y', 'w'])
df2 = df1.join(df2)
df print(df)
A B C D
x 1 4 NaN NaN
y 2 5 8.0 11.0
z 3 6 7.0 10.0
pd.merge(df_name, on='column', how='type')
how
specifies the type of merge
inner
: rows with matching keys in both DataFramesouter
: rows from both are kept, missing values are filledleft
: rows from the left are kept, missing values are filledright
: rows from right are kept, missing values are filled= pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df3 = pd.DataFrame({"A": [2, 3, 4], "C": [7, 8, 9]})
df4 = df3.merge(df4, on="A", how="outer")
df_merged print(df_merged)
A B C
0 1 4.0 NaN
1 2 5.0 7.0
2 3 6.0 8.0
3 4 NaN 9.0
pd.read_excel(file_path)
functiondf.to_excel(file_path)
method. . .
import pandas as pd
= pd.read_csv("employees.csv")
df "employees.xlsx", index=False) df.to_excel(
. . .
Note, that you likely need to install the openpyxl
package to be able to write Excel files, as it handles the file format.
= pd.read_excel("employees.xlsx")
df
# Writes to the Employees sheet and does not include row indices
"employees.xlsx", sheet_name="Employees", index=False)
df.to_excel(
# Reads from the Employees sheet
= pd.read_excel("employees.xlsx", sheet_name="Employees") df
. . .
And that’s it for todays lecture!
You now have the basic knowledge to start working with scientific computing. Don’t worry that we haven’t applied Excel files yet, we will do so in the upcoming tutorial.
. . .
For more interesting literature to learn more about Python, take a look at the literature list of this course.