Python Seaborn #1 – A quick reference

  1. seaborn is a statistical plotting library to interact well with Panda’s DataFrame
  2. It is built directly off the MatplotLib but uses simpler one line syntax
  3. Scatter plots line up a set of two continuous features (Age, Salary, Height, Temp)
  4. pip install seaborn # For installation run this command
  5. Using hue in scatterplot – make your plots from 2D information to 3D information
  6. df = pd.read_csv(“dm_office_sales.csv”)
    plt.figure(figsize=(12,4), dpi=100)
    sns.scatterplot(x=’salary’, y=’sales’, data=df, hue=’level of education’, palette=’Dark2′, style=’level of education’, alpha=0.7)
  7. Aabhar : Jose Portilla (Head of Data Science at Pierian Training) @Udemy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("dm_office_sales.csv")
df.head()


sns.scatterplot(x='salary', y='sales', data=df)

# Let us make the figure size better
plt.figure(figsize=(12,4), dpi=100)
sns.scatterplot(x='salary', y='sales', data=df)


# hue - coloring the scatter plots based on a) cateogrical column or b) continuous colum
sns.scatterplot(x='salary', y='sales', data=df, hue='level of education')


# hue - for a continuou value, seaborn takes the gradient automatically - same color faded or smoothens
# Using palette - provide the color map options. Details below:
# https://matplotlib.org/stable/gallery/color/colormap_reference.html below:
sns.scatterplot(x='salary', y='sales', data=df, hue='level of education', palette ='Dark2')


# Size setting of points,
# Higher the sales - bigger the points
sns.scatterplot(x='salary', y='sales', data=df, size='sales' )

# transperency to the points using alpha 0 - Fully transparent, 1 = Default 
sns.scatterplot(x='salary', y='sales', data=df, alpha=0.3)


# Markers - points style - depending upon the categorical column values passed
sns.scatterplot(x='salary', y='sales', data=df, hue='level of education', style='level of education')



df = pd.read_csv("dm_office_sales.csv")
plt.figure(figsize=(12,4), dpi=100)
sns.scatterplot(x='salary', y='sales', data=df, hue='level of education', palette='Dark2', style='level of education', alpha=0.7)

Python MatPlotLib – A quick reference

  1. matplotlib is a library in Python for plotting/visualizing the data
  2. Two approaches for creating plots – i) functional based methods ii) OOP based methods using FIGURE object
  3. Figure object helps you in movig the axes within the same canvas, thus allowing to have multiple plots in a single canvas
  4. Main lib – https://matplotlib.org/stable/users/index
  5. Aabhar : Jose Portilla (Head of Data Science at Pierian Training) @Udemy
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0,10)
y = x * 2   #np.log(x)

plt.plot(x,y)
plt.xlabel('My X Axes')
plt.ylabel("My Y Axes")
plt.title("Sample Plot")
plt.xlim(0,9)
plt.ylim(0,15)  # lim allows to shrink and fit the ploat within provided value - you may loosed the graph here
plt.show() # If you run the script as python code not as jupyter-notebook, this is needed
# Using conventional functional method of saving fig using savefig() method
#plt.savefig('D:\\DataScienceLearning\\PythonPrograms\\RK_PGMS\\myimage.png', dpi=100.0)
#help(plt.savefig)

# USING FIGURE OBJECT
figObj = plt.figure()
print(type(figObj))
# RES -> <class 'matplotlib.figure.Figure'>
# Below creates a canvas for ratio for LEFT, BOTTOM, WIDTH, HEIGHT -> [(x,y),(w,h)]
figObj.add_axes([0,0,1,1]) # there numbers are basically a ratio number
figObj2 = plt.figure()
myAxes_A = figObj2.add_axes([0,0,1,1]) # This will occupy just half the canvas - This has notthing to do with the actual data points
myAxes_B = figObj2.add_axes([0.25,0.25,0.125,0.125]) # 1/4th and 1/8th
# Let us add multiple plots within the same figure

myAxes_A.plot(np.linspace(0,5,6), np.linspace(0,5,6)**4)
myAxes_B.plot(x,y)  #Subplot with smaller window

# PUTTING LEGENDS, linewidth, linestyle, marker

fig = plt.figure()
ax= fig.add_axes([0,0,1,1],title='MyGraph', xlabel='X Axes', ylabel='Y Axes')
myXSets = np.linspace(0,10,10)
myYSets_1 = myXSets
myYSets_2 = myXSets**2
ax.plot(myXSets, myYSets_1, label="Set1",  color='purple', marker='o', markersize=10)
ax.plot(myXSets, myYSets_2, label="Set2", marker = '+', linewidth = 10, linestyle='--', markeredgecolor='red') #Line width/style
# Set your legends to a location with loc value or number
ax.legend(loc='center left')   # best-0,upper right-1, upper left-2, lower left-3, lower right-4, right = 5...
# You can set as per convas
ax.legend(loc=(1.1,0.75))

Python Pandas – 02 – DataFrame – A quick reference

Pandas DataFrame is a table with rows and columns
A group of Pandas Series-objects with common index

Operations:

  1. Create DF -> pd.DataFrame(data=ndArray/tuple/Dict => Iterable, index= array-like, columns=array-like)
  2. Grab one/many columns – myDF[‘newColName’] = myDF[‘oldCol’] / 100.00
  3. Grab one/many rows – myDF.iloc[0] or myDF.loc[‘IndexName’]
  4. Insert a new column – myDF[‘newColumname’] = myDF[‘someOldColum’] / 100.0
  5. Insert a new row – myDF.append(newRowSet)
  6. Aabhar : Aabhar : Jose Portilla (Head of Data Science at Pierian Training) @Udemy
import numpy as np
import pandas as pd

# ########## CREATING A DATAFRAME
np.random.seed(101) # This will ensure that you get the same set of random number as many times you run it.
myData = np.random.randint(0,101,(4,3))
print(myData)
""" RES ->
[[95 11 81]
 [70 63 87]
 [75  9 77]
 [40  4 63]]
"""
myDf = pd.DataFrame(data=myData)
print(myDf) # By default the row index and column index will be 0, 1, 2 ...
""" RES -> 
    0   1   2
0  95  11  81
1  70  63  87
2  75   9  77
3  40   4  63
"""
myDf = pd.DataFrame(data=myData, index=["Apple", "Berry", "Cherry", "Dates"], columns=["Jan", "Feb", "Mar"])
print(myDf)
""" RES -> 
        Jan  Feb  Mar
Apple    95   11   81
Berry    70   63   87
Cherry   75    9   77
Dates    40    4   63
"""

myDf = pd.read_csv(filepath_or_buffer='D:\\DataScienceLearning\\PythonPrograms\\RK_PGMS\myData.csv')
print(myDf.columns)
# RES -> Index(['name', 'region', 'numberrange', 'currency', 'country'], dtype='object')
print(myDf.index)
# RES -> RangeIndex(start=0, stop=40, step=1)
print(myDf.info())
""" RES -> Talks about each column(i.e. 5) and toal rows(i.e. 40)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         40 non-null     object
 1   region       40 non-null     object
 2   numberrange  40 non-null     int64 
 3   currency     40 non-null     object
 4   country      40 non-null     object
dtypes: int64(1), object(4)
memory usage: 1.7+ KB
None
"""
print()
# In Jupyter-notebook try printing without "print"
# You should be getting a better interface - ( a well designed table)
print(myDf.head(2)) # By default first 5 entries
""" RES -> 
           name          region  numberrange currency         country
0   Mufutau Moon      Quảng Bình            6  $99.10   United Kingdom
1  Harrison Bass  Stockholms län            9  $34.75           Mexico
"""

print(myDf.tail(3)) # By default last 5 entries
""" RES -> 
                name          region  numberrange currency      country
37          Otto Ray           Bihar            3   $9.07      Colombia
38  Blake Fitzgerald    South Island            4  $15.79        Mexico
39    Clarke Harrell  Jönköpings län            5  $45.20   New Zealand
"""

# Consider the number type of columns and find the statistical values
print(myDf.describe())

""" RES -> 
       numberrange
count    40.000000
mean      4.950000
std       2.630687
min       0.000000
25%       3.000000
50%       5.000000
75%       7.000000
max      10.000000
"""
# Use transpose to see the Data Frame in a better look and feel
print(myDf.describe().transpose())
""" RES -> 
            count  mean       std  min  25%  50%  75%   max
numberrange   40.0  4.95  2.630687  0.0  3.0  5.0  7.0  10.0
"""

# Accessing selected columns from DF

# Each column of the DF is basically a Pandas series
print(type(myDf['name']), "---", type(myDf['numberrange']))
# RES -> <class 'pandas.core.series.Series'> --- <class 'pandas.core.series.Series'>

# Look at here passing column name as array - myDF [ []] - Two square brackets if more than one column
print(myDf[ ['name', 'numberrange'] ].head(3))
""" RES -> 
            name  numberrange
0   Mufutau Moon            6
1  Harrison Bass            9
2       Leo Cruz           10
"""

# Creating additional column to the DF
myDf['XX'] = myDf['numberrange']*2
print(myDf.head(2))
""" RES -> 
            name          region  numberrange currency         country  XX
0   Mufutau Moon      Quảng Bình            6  $99.10   United Kingdom  12
1  Harrison Bass  Stockholms län            9  $34.75           Mexico  18
"""

# Removing a column
myDf2 = myDf.drop('XX', axis=1, inplace=False) # Setting true will modify the DF
print(myDf2.head(2))
""" RES -> 
            name          region  numberrange currency         country
0   Mufutau Moon      Quảng Bình            6  $99.10   United Kingdom
1  Harrison Bass  Stockholms län            9  $34.75           Mexico
"""

# Removing a row - you need to pass the row index
myDf3 = myDf.drop([36,37], axis=0, inplace=False)
print(myDf3.tail(4))
""" RES -> 
                name          region  numberrange currency         country  XX
34       Illana Peck        Sardegna            5  $28.04           Norway  10
35       Ima Hawkins       Querétaro            9  $45.63   United Kingdom  18
38  Blake Fitzgerald    South Island            4  $15.79           Mexico   8
39    Clarke Harrell  Jönköpings län            5  $45.20      New Zealand  10   
"""

# Setting up a new index to your DF rather than default 0,1,2
myDf3.set_index('XX', inplace=True)
print(myDf3.head(2))
""" RES ->  Look XX became the index now
             name          region  numberrange currency         country
XX                                                                     
12   Mufutau Moon      Quảng Bình            6  $99.10   United Kingdom
18  Harrison Bass  Stockholms län            9  $34.75           Mexico
"""
myDf3.reset_index()  # This will bring the DF to original 0,1,2 - index pattern
print()
# Accessing particular rows - use iloc or loc
# Note: Return is a Series, not a DF
print(myDf.iloc[5:8])
""" RES -> 
             name        region  numberrange currency  country  XX
5       Zelda Gay      Connacht            4   $7.75     Italy   8
6  Nichole Oliver  Penza Oblast            5  $62.07   Nigeria  10
7    Anika Haynes      Los Ríos            6  $44.45    France  12
                  region  numberrange currency         country  XX
"""
# Let us set a new index
myDfX = myDf.set_index('name', inplace=False)
#print(myDfX.head(10))
# Conditional select (particular row by new index value)
print(myDfX.loc[['Mufutau Moon']])
""" RES -> 
                  region  numberrange currency         country  XX
name                                                              
Mufutau Moon  Quảng Bình            6  $99.10   United Kingdom  12
"""

# Selected rows with selected column
# Check carefully the closing brackets
print(myDfX.loc[ ['Mufutau Moon', 'Cora Newton'], ['country', 'currency'] ])
""" RES -> 
                     country currency
name                                 
Mufutau Moon  United Kingdom  $99.10 
Cora Newton       Costa Rica  $66.40 
"""
# Append a new row - append - depricated, concat - not working
oneRowAsSeries = myDf.iloc[3]
myDf.count()
#type(oneRowAsSeries)
myNewDf = myDf.append(oneRowDf)
#myDf.concat()
print(myNewDf.count())
""" -> RES 
name           41
region         41
numberrange    41
currency       41
country        41
XX             41
dtype: int64
"""
boolSeries = (myNewDf['numberrange'] %6 == 0)

print(myNewDf[boolSeries])
""" RES ->
            name         region  numberrange currency         country  XX
0   Mufutau Moon     Quảng Bình            6  $99.10   United Kingdom  12
7   Anika Haynes       Los Ríos            6  $44.45           France  12
14  Ethan Powers   South Island            6  $30.46           Brazil  12
22    Abbot Bird   South Island            6  $71.36          Ukraine  12
27    Ivana Bell  Valle d'Aosta            0  $76.95            Chile   0
29   Phoebe Goff       Arkansas            6  $40.60       Costa Rica  12
"""
#myOptions = ['Mufutau Moon', 'Ivana Bel']
#myDF.isin(myOptions)
#myNewDf[myNewDf.isin(myOptions)]

Python Pandas – 01 – Series – A quick reference

  1. pandas is an open source, BSD-licensed library providing high-performance,
    easy-to-use data structures and data analysis tools for the Python programming language
  2. Support for the extremely powerful table i.e DATAFRAME system built off of NumPy
  3. Tools for reading/writing bwn many formats ( Can interact with HTML file, SQL databases too!)
  4. Intelligent grabbing of data based on the indexing/logic/subset etc.
  5. Handle missing data
  6. Adjust and restucture data structure
  7. Main Documentation Link : https://pandas.pydata.org/docs/
  8. Aabhar : Jose Portilla (Head of Data Science at Pierian Training) @Udemy
  9. SERIES -> 1 Dimensional ndarray with axis label
  10. Seris is a data structure in Pandas lib that holds an array of information along with a named index
  11. How to install pandas -> pip install pandas
  12. In case of error – ModuleNotFoundError: No module named ‘pandas’, open jupyter, Terminal -> Run Terminal ->(Type) pip install pandas – Successfully installed pandas-1.5.3 pytz-2022.7.1 – after installation restart your Jupyter kernel
  13. myScoreSeries = pd.Series(data=[55,35.0,’SeventyFive’], index=[‘Sachin’, ‘Dhoni’, ‘Kohli’])
import numpy as np
import pandas as pd

# ############# PANDA SERIES using Series() constructor
#help(pd.Series)  #Upper case S
myIndex = ['Sachin', 'Dhoni', 'Kohli']
myData = [55,35,75]


mySeries = pd.Series(data=myData)
print(type(mySeries))   
# RES -> <class 'pandas.core.series.Series'>
print(mySeries)  # By defult int indexed 
""" RES -> 
0    55
1    35
2    75
dtype: int64
"""

mySeries = pd.Series(data=myData, index=myIndex)
print(mySeries)
""" RES -> 
Sachin    55
Dhoni     35
Kohli     75
dtype: int64
"""

print(mySeries[0])
# RES -> 55
print(mySeries['Sachin'] , mySeries.shape)
# RES -> 55 (3,)   ---> 3 rows, 1 column

# Series using Python Dictionary
myDict = {"India" : "Best", "Australia" : "Better"}
mySer = pd.Series(myDict)
print(mySer)
""" RES -> 
India          Best
Australia    Better
dtype: object
"""

print(mySer.keys())  
# RES -> Index(['India', 'Australia'], dtype='object')
print(mySer.values) # Use as attribute
# RES -> ['Best', 'Better']

ser1 = {"India" : 44, "Japan" : 40, "USA" : 65 } 
ser2 = {"India" : 40, "Pak" : 24, "Nepal" : 20}

sales_q1 = pd.Series(ser1)
sales_q2 = pd.Series(ser2)

# Look what happens with a normal list
print([1, 2] * 3)
# RES -> [1, 2, 1, 2, 1, 2]

# Broadcasting -> the above operation is different in series
print(sales_q1 * 2)
""" RES -> 
India     88
Japan     80
USA      130
dtype: int64
"""
print(sales_q1 + sales_q2)  # Leave with NaN for the non matching keys from both the series
""" RES -> 
India    84.0
Japan     NaN
Nepal     NaN
Pak       NaN
USA       NaN
dtype: float64
"""
# For a meaningful operation on series use method add, sub, mul, div - NaN will be replaced by 0.0
print( sales_q1.add(sales_q2, fill_value = 0.0) )
""" RES -> 
India    84.0
Japan    40.0
Nepal    20.0
Pak      24.0
USA      65.0
dtype: float64
"""

# Traversing the series - We will have another post - it is not that straight forward
#for key in sales_q1:
#    print(sales_q1[key])