Integrating pandas with spreadsheets: three ways

Opening and Saving Excel Files in Pandas

Pandas is ideal for working with tabular data – it can open Excel files and perform transformations inside. Moreover, the dataframe can also be saved to an Excel file. This makes it convenient to integrate pandas with the office suite: the results can be immediately beautifully presented for a presentation at a conference or handed over to colleagues who only do “wet” biology and do not work with data science tools.

Let's see how this is done. If we imported pandas under the name pd:

import pandas as pd

then you can read a data frame from an Excel file as follows:

df = pd.read_excel(”your_excel_file.xlsx”)

This line will create a data frame df and transfer the data from the file into it.

Since Python does not have an explicit variable declaration procedure and typing is dynamic, remember that the pd.read_excel method always returns a data frame. And any variable that receives the result of its work becomes a data frame itself.

By default, the pd.read_excel method reads data starting from cell A1, from left to right, top to bottom. If they are arranged differently, you need to “point” it to the table with the data:

skip a specified number of rows before the table with the skiprows parameter;
capture the desired columns with the usecols parameter.

Let's assume that the imported table takes up cells in the Excel file C2:F6. Then pandas has to skip one row and use columns to read it. C:F. Let's also assume that the file has several sheets, and this table is located on a worksheet called “Sales” — it can be specified in the sheet_name parameter. The resulting command will be:

df = pd.read_excel(”your_excel_file.xlsx”, sheet_name=”Sales”, skiprows=1, usecols=”C:F”)

Let's say the calculations are complete, and it would be a good idea to save the resulting table to an Excel file. If your data frame is called df, then the df.to_excel method will allow you to do this quickly and easily:

df.to_excel(”your_resulting_file.xlsx”)

Pandas has a wide range of functionality for working with Excel files – it can read from several worksheets at once, write to several sheets, perform a wide range of calculations and build (using the Matplotlib library) graphs that Excel cannot build. We will talk about some of these functions in the following articles. And you can study the integration of pandas with Excel in more detail in the book:

Zumshtein F. Python for Excel: Translated from English. — SPb.: BHV-Petersburg, 2023 – 336 p., ill.

Integrating Google Sheets and pandas into Google Colab

There is good news for those working with Python in Google Colab interactive notebooks. On July 22 this year, Google announced the ability to work with Google Sheets directly in the Google Colab interactive notebook – and then import them as pandas data frames. To do this, you need to import the sheets module from the google.colab library:

from google.colab import sheets

Next, you need to create an InteractiveSheet() object from this module. It is created by simply mentioning (remember that there is no explicit declaration of variables?):

sh = sheets.InteractiveSheet()

When running for the first time, you may see a prompt asking for access: Colaboratory Runtimes needs access to your Google Drive. Authorize it and continue running.

If Colaboratory Runtimes already has all the permissions, then the next cell in the notebook will contain a blank interactive Google Sheets worksheet. This is InteractiveSheet(). When you log in to your Google Sheets account, you will see a copy of this sheet there. This is what the permission was for: Colab needs to connect to the Google Sheets account and perform actions there on behalf of the user.

The interactive table can be filled with data manually or by copying and pasting from another table, using standard key combinations.

To turn this table into a df data frame, write the following command in the following code cell:

df = sh.as_df()

Remember that the InteractiveSheet object's as_df function also always returns a data frame.

You can check that the import was done correctly by simply typing the name of the data frame in the next code cell:

df

and launching it for execution. If the program did not give an error, and your table appeared in front of you, but beautifully formatted in HTML format – then everything went well!

Hardcore Only: Using CSV Files

But what if mastering the functionality of working with spreadsheets is difficult and scary? For starters, you can work with CSV files – they are most often used in data analysis and bioinformatics. The pd.read_csv and df.to_csv methods are similar to similar methods for Excel, but have fewer parameters (because the format itself is simpler).

Opening a CSV file:

df = pd.read_csv(”my_csv_file.csv”)

Saving a data frame to a CSV file:

df.to_csv(”my_resulting_file.csv”)

Frankly, in bioinformatics you can work for years in pandas with only the CSV format. Methods for working with Excel files are still used relatively rarely – but with the spread of Python and data science methods, they will become more and more relevant.

CSV can also be a lifesaver when using “alternative” spreadsheet software that does not fully support XSLX files, such as OpenOffice/LibreOffice (these are the default on Linux, the operating system of choice for bioinformaticians).

The ODS file format does not open in pandas without additional libraries (we'll talk about them someday!), and saving to it is also impossible. The CSV format can serve as a convenient bridge – any spreadsheet application can import it and save data to it. If you save data to a CSV file in pandas, you can then import it into an OpenOffice Calc table, save it in ODS format and continue working in office programs.

In one of the next posts, we will also cover how to integrate pandas and SQL databases, another go-to tool for data analysts. Don't switch over!