Book Python: Artificial Intelligence, Big Data, and Cloud Computing

image Hello, habrozhiteli! Paul and Harvey Daytels offer a fresh look at Python and use a unique approach to quickly solve the problems facing modern IT people.

At your disposal more than five hundred real tasks – from fragments to 40 large scenarios and examples with full implementation. IPython with Jupyter Noteboos allows you to quickly learn modern Python programming idioms. Chapters 1–5 and fragments of chapters 6–7 will make clear examples of solving artificial intelligence problems from chapters 11–16. You will learn about natural language processing, emotion analysis on Twitter, IBM Watson cognitive computing, machine learning with a teacher in classification and regression problems, machine learning without a teacher in clustering, pattern recognition with deep learning and convolutional neural networks, recurrent neural networks, large data from Hadoop, Spark and NoSQL, IoT, and more. You will work (directly or indirectly) with cloud services, including Twitter, Google Translate, IBM Watson, Microsoft Azure, OpenMapQuest, PubNub, etc.

9.12.2. Reading CSV files in the pandas library DataFrame collection

The “Introduction to data science” sections of the previous two chapters introduced the basics of working with pandas. Now we will demonstrate the pandas tools for downloading CSV files, and then perform the basic data analysis operations.

Datasets

In practical data science examples, various free and open data sets will be used to demonstrate the concepts of machine learning and natural language processing. A huge variety of free data sets are available on the Internet. The popular Rdatasets repository contains links to over 1,100 free CSV datasets. These kits were originally supplied with the R programming language to simplify the study and development of statistical programs, however, they are not related to the R language. Now these datasets are available on GitHub at:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

This repository is so popular that there is a pydataset module designed specifically for accessing Rdatasets. For instructions on installing pydataset and accessing datasets, please contact:

https://github.com/iamaziz/PyDataset

Another great source for datasets:

https://github.com/awesomedata/awesome-public-datasets

One commonly used machine learning dataset for beginners is the Titanic crash dataset, which lists all passengers and whether they survived when the Titanic collided with an iceberg and sank April 14–15, 1912. We will use this set to show how to load a data set, view its data and derive descriptive statistics. Other popular datasets will be explored in the data science example chapters later in this book.

Work with local CSV files

You can use the pandas library’s read_csv function to load the CSV dataset into a DataFrame. The following snippet downloads and displays the CSV accounts.csv file that was created earlier in this chapter:

In  [1]: import pandas as pd
In  [2]: df = pd.read_csv('accounts.csv',
     ...:                      names=['account', 'name', 'balance'])
     ...:

In  [3]: df
Out[3]:
      account     name    balance
0          100    Jones      24.98
1          200       Doe    345.67
2          300    White        0.00
3          400     Stone    -42.16
4          500       Rich    224.62

The names argument specifies the column names of the DataFrame. Without this argument, read_csv considers the first line of the CSV file to contain a comma-separated list of column names.
To save the DataFrame data in a CSV file, call the to_csv method of the DataFrame collection:

In [4]: df.to_csv('accounts_from_dataframe.csv', index=False)

The key argument index = False means the row names (0–4 on the left side of the DataFrame output in the fragment [3]) should not be written to the file. The first line of the resulting file contains the column names:

account,name,balance
100,Jones,24.98
200,Doe,345.67
300,White,0.0
400,Stone,-42.16
500,Rich,224.62

9.12.3. Reading the Titanic Disaster Dataset

The Titanic disaster dataset is one of the most popular machine learning datasets and is available in many formats, including CSV.

Download the Titanic Disaster Dataset at the URL

If you have a URL representing a dataset in CSV format, then you can load it into a DataFrame with the read_csv function – let’s say from GitHub:

In [1]: import pandas as pd
In [2]: titanic = pd.read_csv('https://vincentarelbundock.github.io/' +
    ...:       'Rdatasets/csv/carData/TitanicSurvival.csv')
    ...:

View some rows of the Titanic disaster dataset

The data set contains over 1300 rows, each row represents one passenger. According to Wikipedia, there were approximately 1317 passengers on board, and 815 of them died1. For large data sets, only the first 30 lines are displayed when the DataFrame is output, then the ellipsis “…” and the last 30 lines are displayed. To save space, we’ll look at the first and last five lines using the head and tail methods of the DataFrame collection. Both methods return five lines by default, but the number of lines displayed can be passed in the argument:

In [3]: pd.set_option (‘precision’, 2) # Format for floating point values

image

Please note: pandas adjusts the width of each column based on the widest value in the column or column name (depending on which has the largest width); in the age column of row 1305 is NaN – a sign of a missing value in the data set.

Customizing Column Names

The name of the first column in the dataset looks rather strange (‘Unnamed: 0’). This problem can be resolved by customizing the column names. Replace ‘Unnamed: 0’ with ‘name’ and reduce ‘passengerClass’ to ‘class’:

image
image

9.12.4. Simple data analysis using the Titanic disaster dataset as an example

Now we will use pandas to conduct a simple data analysis using some characteristics of descriptive statistics as an example. When you call describe for a DataFrame collection that contains both numeric and non-numeric columns, describe calculates statistical characteristics for numeric columns only — in this case, only for the age column:

image

Note the differences in count (1046) and the number of rows of data in the dataset (1309 – when calling tail, the index of the last row was 1308). Only 1046 rows of data (count value) contained an age value. The rest of the results were missing and were marked with NaN, as in line 1305. When performing calculations, the pandas library ignores missing data (NaN) by default. For 1046 passengers with a valid age, the average age (expectation) was 29.88 years. The youngest passenger (min) was only two months old (0.17 * 12 gives 2.04), and the oldest (max) was 80 years old. The median age was 28 (indicated by a 50 percent quartile). A 25 percent quartile describes the median age in the first half of passengers (ranked by age), and a 75 percent quartile describes the median age in the second half of passengers.

Suppose you want to calculate statistics about surviving passengers. We can compare the survived column with the value ‘yes’ to get a new Series collection with True / False values, and then use describe to describe the results:

In [9]: (titanic.survived == 'yes').describe()
Out[9]:
count      1309
unique        2
top       False
freq        809
Name: survived, dtype: object

For non-numeric data, describe displays various characteristics of descriptive statistics:

  • count – the total number of elements in the result;
  • unique – the number of unique values ​​(2) as a result – True (the passenger survived) or False (the passenger died);
  • top – the value most often encountered as a result;
  • freq – the number of occurrences of the value top.

9.12.5. Bar graph of passenger ages

Visualization is a good way to get to know the data better. Pandas contains many built-in visualization tools based on Matplotlib. To use them, first enable Matplotlib support in IPython:

In [10]: %matplotlib

The histogram clearly shows the distribution of numerical data over a range of values. The hist method of the DataFrame collection automatically analyzes the data of each numeric column and builds the corresponding histogram. To view the histograms for each numerical column of data, call hist for your DataFrame collection:

In [11]: histogram = titanic.hist()

The Titanic disaster dataset contains only one numerical column of data, so the chart shows a histogram for age distribution. For datasets with multiple numeric columns, hist creates a separate histogram for each numeric column.

image

»More details on the book can be found atPublisher

Table of contents

Excerpt

For Khabrozhiteley 25% discount on the coupon – Python

Upon payment of the paper version of the book (release date – June 5th) an e-book is sent to e-mail.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *