Machine Learning Applications with Python in Energy Sector

Introduction

Purpose

The aim of this document is to create a guide for data analysis and machine learning applications with Python in the energy sector. For this reason, we will install Python in the next section. Anaconda Installation Anaconda is an integrated Python distribution prepared for those who want to use Python for data science and similar scientific applications. In addition to the frequently used libraries on data science, artificial intelligence, etc., it also includes tools such as Jupiter Notebook and Spyder. When you install Anaconda; Python, Jupiter Notebook and Spyder will also be installed on your system. To download the integrated distribution of Anaconda, go to https://www.anaconda.com/distribution and download the installation file (Windows & MacOS & Linux) suitable for your computer. Then open the downloaded installation file and complete the simple installation by following the directions (Next-> Next->...Setup).

Anaconda Navigator

Anaconda Navigator is a desktop graphic user interface in Anaconda that allows you to easily manage Anaconda packages, environments and channels without having to launch your applications and use your command line. We can manage many programs in Anaconda Navigator, but Spyder is the program that we will be most interested in and implement Machine Learning algorithms. To start Spyder, first open Anaconda Navigator: Mac: You'll find Anaconda Navigator in Launchpad (and also in the Applications folder). Drag it to the Dock if you want to have it readily available. Windows: You'll find Anaconda Navigator in the Start menu. Linux: Open a terminal window and run the command anaconda-navigator. Then, click the Launch button below the Spyder icon on the Navigator Home tab

Spyder

Spyder is an open source IDE written in Python that you can use for Python development. Spyder is a powerful interactive development environment for the Python language, with advanced editing, interactive testing, debugging and introspection. In addition, Spyder is a numerical computing environment thanks to the support of popular Python libraries such as IPython, NumPy, SciPy or Matplotlib.

Machine Learning

Machine Learning Definition

Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.

Machine Learning Techniques

Machine learning uses two types of techniques: supervised learning, which trains a model on known input and output data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns or intrinsic structures in input data.

Supervised Learning

Supervised machine learning builds a model that makes predictions based on evidence in the presence of uncertainty. A supervised learning algorithm takes a known set of input data and known responses to the data (output) and trains a model to generate reasonable predictions for the response to new data. Use supervised learning if you have known data for the output you are trying to predict. Supervised learning uses classification and regression techniques to develop predictive models. Classification techniques predict discrete responses—for example, whether an email is genuine or spam, or whether a tumor is cancerous or benign. Classification models classify input data into categories. Typical applications include medical imaging, speech recognition, and credit scoring. Use classification if your data can be tagged, categorized, or separated into specific groups or classes. For example, applications for hand-writing recognition use classification to recognize letters and numbers. In image processing and computer vision, unsupervised pattern recognition techniques are used for object detection and image segmentation. Common algorithms for performing classification include support vector machine (SVM), boosted and bagged decision trees, k-nearest neighbor, Naïve Bayes, discriminant analysis, logistic regression, and neural networks. Regression techniques predict continuous responses—for example, changes in temperature or fluctuations in power demand. Typical applications include electricity load forecasting and algorithmic trading. Use regression techniques if you are working with a data range or if the nature of your response is a real number, such as temperature or the time until failure for a piece of equipment. Common regression algorithms include linear model, nonlinear model, regularization, stepwise regression, boosted and bagged decision trees, neural networks, and adaptive neuro-fuzzy learning.

Unsupervised Learning

Unsupervised learning finds hidden patterns or intrinsic structures in data. It is used to draw inferences from datasets consisting of input data without labeled responses. Clustering is the most common unsupervised learning technique. It is used for exploratory data analysis to find hidden patterns or groupings in data. Applications for cluster analysis include gene sequence analysis, market research, and object recognition. For example, if a cell phone company wants optimize the locations where they build cell phone towers, they can use machine learning to estimate the number of clusters of people relying on their towers. A phone can only talk to one tower at a time, so the team uses clustering algorithms to design the best placement of cell towers to optimize signal reception for groups, or clusters, of their customers. Common algorithms for performing clustering include k-means and k-medoids, hierarchical clustering, Gaussian Mixture models, Hidden Markov models, self-organizing maps, fuzzy c-means clustering, and subtractive clustering.

A Regression Model Application with Python

Pandas Library

The Pandas library is one of the most preferred tools for data scientists to do data manipulation and analysis, next to matplotlib for data visualization and NumPy, the fundamental library for scientific computing in Python on which Pandas was built.

Pandas Basics

Use the following import convention:

In [ ]:
import pandas as pd

Pandas Data Structures

Series

A one-dimensional labeled array capable of holding any data type.

In [ ]:
s = pd.Series(["Solar","Wind", "Thermal", "Hydroelectric" ],  index=['1',  '2',  '3',  '4'])

DataFrame

A two-dimensional labeled data structure with columns of potentially different types.

In [ ]:
data = {'Powerplants': ['Bandırma1',  'Bandırma2',  'Kentsa'],

'Capacity': ['936',  '607',  '40'],

'Year': ["2010", "2016","1997"]}

df = pd.DataFrame(data,columns=['Powerplants',  'Capacity',  'Year'])

The first column 0,1,2 is the index and Powerplants,Capacity,Year are the Columns.

Help

In [ ]:
help(pd.Series.loc)

Read and Write to CSV

In [ ]:
pd.read_csv('file.csv', header=None, nrows=5)
df.to_csv('myDataFrame.csv')

Read multiple sheets from the same file

In [ ]:
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx,  'Sheet1')

Read and Write to Excel

In [ ]:
pd.read_excel('file.xlsx')
df.to_excel('dir/myDataFrame.xlsx',  sheet_name='Sheet1')

Read SQL Query or Database Table into a DataFrame

In [ ]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
pd.read_sql('SELECT * FROM table;', engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query('SELECT * FROM my_table;', engine)

Write Records Stored in a DataFrame to a SQL Database.

In [ ]:
df.to_sql('myDf', engine)

Selection

Getting

Get one element

Our series:

In [ ]:
s['2']
#Out[]: ' Wind'

Get subset of a DataFrame

Our dataframe:

In [ ]:
df[1:]
#Out[]:
'''  Powerplants Capacity  Year
1   Bandırma2      607  2016
2      Kentsa       40  1997'''

Selecting , Boolean Indexing and Setting

By Position

Select single value by row and and column

Our dataframe:

In [ ]:
df.iloc[1, 1]
#Out[]: 607
In [ ]:
df.iat[0, 0]
#Out[]: Bandırma1
In [ ]:
df.iat[1, 1]
#Out[]: 607

By Label

Select single value by row and column labels

Our dataframe:

In [ ]:
df.loc[0,  'Year']
#Out[]: 2010
In [ ]:
df.at[0,  'Year']
#Out[]: 2010

Boolean Indexing

Series s where value is not >2

In [ ]:
s = pd.Series([1, -2, -6, 4],  index=['a',  'b',  'c',  'd'])
s[~(s > 2)]
'''Out[]: 
a    1
b   -2
c   -6'''

s where value is <-2 or >2

In [ ]:
s[(s < -2) | (s > 2)]
'''Out: 
c   -6
d    4'''

Setting

Set index c of Series s to -6

In [ ]:
s['c'] 
#Out:6

Dropping

Drop values from rows (axis=0)

In [ ]:
s.drop(['a',  'c'])
'''Out[]
a    1
d    4'''

Drop values from columns(axis=1)

In [ ]:
df.drop('Year', axis=1) 
'''Out[]:
  Powerplants Capacity
0   Bandırma1      936
1   Bandırma2      607
2      Kentsa       40'''

Sort and Rank

Sort by labels along an axis

In [ ]:
df.sort_index()
'''Out[]:
  Powerplants Capacity  Year
0   Bandırma1      936  2010
1   Bandırma2      607  2016
2      Kentsa       40  1997'''

Sort by the values along an axis

In [ ]:
df.sort_values(by='Powerplants') 
  Powerplants Capacity  Year
0   Bandırma1      936  2010
1   Bandırma2      607  2016
2      Kentsa       40  1997
In [ ]:
df.sort_values(by='Capacity') 
'''Out[]:
  Powerplants Capacity  Year
2      Kentsa       40  1997
1   Bandırma2      607  2016
0   Bandırma1      936  2010'''
In [ ]:
df.rank()
'''Out[]:
Powerplants  Capacity  Year
0          1.0       3.0   2.0
1          2.0       2.0   3.0
2          3.0       1.0   1.0'''

Retrieving Series/DataFrame Information

Basic Information (rows, columns)

In [ ]:
df.shape
#Out[28]: (3, 3)

Describe DataFrame columns

In [ ]:
df.columns
#Out[]:Index(['Powerplants', 'Capacity', 'Year'], dtype='object')

Info on DataFrame

In [36]:
df.info()
'''Out[]: Index(['Powerplants', 'Capacity', 'Year'], dtype='object')
Capacity  Year
0   Bandırma1      936  2010
1   Bandırma2      607  2016
2      Kentsa       40  19'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
Powerplants    3 non-null object
Capacity       3 non-null object
Year           3 non-null object
dtypes: object(3)
memory usage: 152.0+ bytes
Out[36]:
"Out[]: Index(['Powerplants', 'Capacity', 'Year'], dtype='object')\nCapacity  Year\n0   Bandırma1      936  2010\n1   Bandırma2      607  2016\n2      Kentsa       40  19"

Number of non-NA values

In [ ]:
df.count()
'''Out[]:
Powerplants    3
Capacity       3
Year           3'''

Summary Sum of values

In [37]:
df.sum()
'''Out[]:
Powerplants    Bandırma1Bandırma2Kentsa
Capacity                       93660740
Year                       201020161997'''
Out[37]:
'Out[]:\nPowerplants    Bandırma1Bandırma2Kentsa\nCapacity                       93660740\nYear                       201020161997'

Cumulative sum of values

In [ ]:
df.cumsum()

Minimum/maximum values

In [38]:
df.min()
'''Out[]:
Powerplants    Bandırma1
Capacity              40
Year                1997'''
Out[38]:
'Out[]:\nPowerplants    Bandırma1\nCapacity              40\nYear                1997'
In [ ]:
df.max()
'''Out[]:
Powerplants    Kentsa
Capacity          936
Year             2016'''