GUIDE to the python stack for Data Science engineers

Data science is a field that studies and analyzes large amounts of data to find useful patterns in it, make predictions, or make fact-based decisions. Data science is based on the methods and tools of mathematics, statistics and programming. They allow you to extract valuable information from data and apply it in various fields – from business and medicine to science-intensive research.

Why use python?

1. Simplicity and convenience

2. Large number of libraries: Python has many specialized libraries for working with data, such as NumPy, Pandas, SciPy, Matplotlib, Seaborn, scikit-learn, etc. These libraries provide tools for data analysis, visualization and machine learning.

3. Wide range of capabilities: Using Python, you can solve a wide range of problems, ranging from data processing and cleaning, statistical analysis and visualization, to creating complex machine learning and deep learning models.

4. Community and Ecosystem: Python has a huge community of developers who are actively creating new tools, libraries and packages for data analysis. This ensures the continued development and support of relevant resources for developers and data scientists.

In this article we will look at a python stack for working in Data Science

NumPy

One of the most popular libraries for the Python programming language, which provides powerful tools for working with multidimensional arrays and performing various mathematical operations. It is the core library for scientific computing in Python and is widely used in fields such as data analytics, machine learning, engineering and scientific computing.

Arrays in numpy can only be stored in one type.

This allows you to use memory efficiently and perform operations at high speed, since a fixed amount of memory is allocated for each element of the array.

Getting array characteristics:

import numpy as np

row = np.array([1, 2, 3])

matrix = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

print(f'Row(1d) dimension: {row.ndim}')

print(f'Matrix(2d) dimension: {matrix.ndim}')

print(f'Row(1d) shape: {row.shape}')

print(f'Matrix(2d) shape: {matrix.shape}')

In numpy you can perform arithmetic operations on arrays of the same dimensions:

def multi_numpy(array_1: np.ndarray, array_2: np.ndarray):

print(array_1 * array_2)

multi_numpy(np.array([1, 2, 3]), np.array([2, 2, 2]))

NumPy allows you to generate arrays:

array_with_7 = np.full((3, 3), 7)

print(f'Generating an array with the value 7 (3×3):\n {array_with_7}')

array_like_matrix = np.full_like(matrix, 7)

print(f'generating an array with the value 7 (we generate it like another array (2×5)):\n {array_like_matrix}')

rand_array = np.random.randint(1, 25, (3,5))

print(f'Generate an array filled with values ​​in the range (1 – 25), length – 15 (3 row, 5 column):\n {rand_array}')

The return type after filtering or slicing is a representation, not a copy.

Changing the value in the array representation changes the value in the source array, so you should copy the array using the copy command:

original = np.array([0, 1, 2])

copy = original

copy[1] = -7

print(f'Original array after modification: {original} = copied array = {copy}')

copy_right = np.copy(original)

copy_right[1] = -7

print(f'Original array after modification: {original} != copied array = {copy_right}')

SciPy

It is a library for the Python programming language that provides convenient tools for performing scientific and engineering calculations. It contains many functions for working with linear algebra, optimization, signal processing, image processing, statistics and much more.

Popular submodules: special, stats

Special provides functions to work with special functions such as Bessel functions, Airy functions, Hermite functions and many others. These functions are used in various fields of science and engineering, including physics, astronomy, statistics, and probability:

from scipy import special

#Calculate factorial 7

print(special.factorial(7))

#Combination calculation

print(special.comb(10,2))

#Calculation of permutations

print(special.perm(10,2))

Stats provides a wide range of statistical functions, probability distributions, random number generators and statistical tests.

You can use the binomial distribution, that is, a certain number of samples, the result of each is successful or not.

Example: binomial distribution, which consists of 100 trials and a success rate of 33%:

from scipy import stats

binom = stats.binom(100,0.33)

#Probability of sampling that is less than 7

binom.cdf(7)

# Getting 17 random samples

binom.rvr(17)

You can use the Pausson distribution, that is, modeling the probability of a certain number of individual events over a certain time period.

Example: Pausson distribution with mean 4, size 10000 elements:

from scipy import stats

poisson = stats.poisson(mu = 4)

poisson = poisson.rvs(size = 10000)

The most popular submodule of scipy.stats is continuous distribution. The parameters that can be used are location (loc) and scale (scale). By default, the scale is 1.0 and the position is 0.

Continuous distributions include normal distribution, exponentially varying continuous distribution, uniform distribution, and others.

Normal distribution:

from scipy import stats

normal = stats.norm()

normal = normal.rvs(size = 20000)

Exponentially varying continuous distribution:

from scipy import stats

expon = stats.expon()

expon = expon.rvs(size = 20000)

Uniform distribution:

from scipy import stats

uniform = stats.uniform()

uniform = uniform.rvs(size = 20000)

Pandas

Used for data processing and analysis. Provides convenient data structures and tools for working with tables, time series and other data types. Pandas is based on NumPy and Matplotlib.

Data storage structures in pandas are divided into two types: Series (one-dimensional array) and DataFrame (multidimensional array):

import pandas as pd

sql = pd.Series([‘MS SQL’, ‘PostgreSQL’, ‘Oracle’, ‘MySQL’]name=”sql vendor”)

product = pd.DataFrame({'PL':[‘C#’, ‘Java’, ‘Go’],'Popular Product':[‘Stack Overflow’, ‘Jira’, ‘Docker’]})

Pandas allows you to import data from various sources (json, csv, xml, sql, excel):

csv = pd.read_csv('GOOG.csv', parse_dates=[‘Date’]index_col=”Date”, delimiter=”,”)

print(csv)

If it is necessary to read several specific columns, an enumeration in square brackets is used:

print(csv[[‘High’,’Low’]])

Sometimes it happens that the data is partially saved. For this case, pandas has the fillna function, which allows you to replace emptiness with a specific value. Can be used for multiple fields:

test = pd.read_csv('test.csv', delimiter=”,”)

test = test.fillna({'quantity': 0})

print(test)

Pandas allows you to aggregate data. Partial list of aggregate functions:

1. average value.

2. sum of values.

3. number of non-missing values.

4. median of values.

5. extremes (min/max values).

6. standard deviation.

7. variance

Joining tables in pandas follows the SQL principle. There are types of joins: inner, left, right, outer(full join). To select a join, use the how parameter:

table1 = pd.DataFrame({“PL”:[“C#”, “Java”, “Python”]”Year”:[2000, 1995, 1991]})

table2 = pd.DataFrame({“PL”:[“C#”, “Java”, “Go”]”Company”:[‘Microsoft’, ‘Oracle’, ‘Google’]})

inner = pd.merge(table1, table2, on=”PL”)

print(inner,”\n”)

left = pd.merge(table1, table2, on=”PL”, how=”left”)

print(left,”\n”)

outer = pd.merge(table1, table2, on=”PL”, how=”outer”)

print(outer,”\n”)

There are two methods for sorting data in pandas:

To use sorting by value, the main parameter is which fields we sort by (by). Additionally, you can use the ascending parameter, which allows you to sort.

The groupby method is used to group values. The main parameter is by. It means, as in filtering, by which fields we group.

Additionally, for grouping you can use for aggregation:

  • resample – method of grouping time series. In order to use the index must be – datetime. The interval is indicated in brackets.

  • pivot_table – method for creating free tables. Possibility of using multiple aggregations.

Often for analysis you need to cut off data according to criteria to get the result:

table[table[’column1’] > 10]

For complex filtering, & is used – if AND, | – if OR. For each logical block, you need to use brackets so that pandas understands that different types are being compared:

table[(table[’column1’] > 10) & (table[’column12] == 'Text')]

For an analogue in sql NULL, the isna() method is used for the field.

If you need to display certain columns, then after the square brackets where the filtering was done, there is a list of the columns that need to be displayed.

There are two methods for outputting a specific number of rows in pandas.

Matplotlib

A library for creating graphs and data visualization in the Python programming language. Can be used for data visualization in scientific and engineering applications, data analysis, statistics, machine learning and results visualization

Popular methods:

  • plot – plotting a linear graph

  • pie – creating a pie chart

  • scatter – creating a scatter diagram

  • bar – build a bar chart

  • title – adding a title to the chart

  • xlabel/ylabel – adding labels to the x/y axis

  • axis – set axis limits and get current limits

  • show – displaying the graph

Building a simple graph:

import matplotlib.pyplot as plt

import numpy as np

x = np.linspace(0, 10, 100)

y = np.sin(x)

plt.plot(x, y)

plt.xlabel('Time')

plt.ylabel('Value')

plt.title('Sine graph example')

plt.show()

Building a diagram:

import matplotlib.pyplot as plt

labels = [‘Яблоки’, ‘Апельсины’, ‘Персики’, ‘Бананы’]

sizes = [25, 30, 20, 25]

plt.pie(sizes, labels=labels, autopct=”%1.1f%%”)

plt.axis('equal')

plt.title('Percentage of fruits')

plt.show()

Construction of a 3d figure:

import matplotlib.pyplot as plt

import numpy as np

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()

ax = fig.add_subplot(111, projection='3d')

x = np.random.standard_normal(100)

y = np.random.standard_normal(100)

z = np.random.standard_normal(100)

ax.scatter(x, y, z)

plt.show()

Conclusion

To summarize this article, indispensable tools for a data scientist engineer have been identified and discussed, which can help extract information from data in order to make informed decisions, predict trends, optimize processes and achieve business goals. These tools:

  • NumPy and SciPy libraries that allow you to solve complex mathematical problems.

  • Library for data visualization – matplotlib

  • Multifunctional tool for data processing and analysis – Pandas.

To study this topic, we recommend reading Kennedy Berman’s book “Python Basics for Data Scientists.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *