My thoughts on Python in Excel

Python in Excel

On August 22, 2023, Microsoft released a preview version of “Python in Excel”. I'm a developer xlwings and author of the O'Reilly book Python for Excel, so of course I was interested in testing it. You shouldn't judge a book by its cover, so I decided to dive deeper and record a LinkedIn training course Python in Excel: Working with pandas DataFrames. At the same time, I discovered many bugs and other problems, after which sent reports about them to GitHub. At the moment, they have already either been resolved, or at least confirmed/commented on, and this is amazing. I can only wish that it would be just as easy to report bugs in the world of Office.js and VBA.

So what do I think about Python in Excel? In short:

  • We wanted an alternative to VBA, but we ended up with an alternative to the Excel formula language

  • Integrating Jupyter Notebook Cells Inside an Excel Grid Was a Flawed Decision

  • Python in Excel is neither suitable for beginners in Python nor for interactive data analysis

  • It has too many limitations at the moment (can't use native packages and can't connect to web API)

  • Here are examples of possible use cases I see for the current version of Python in Excel:

    • Computationally expensive tasks like Monte Carlo simulations

    • Working with AI using available packages (scikit-learn, nltk, statsmodels, imbalanced-learn, gensim)

    • Complex visualizations using Matplotlib/Seaborn

    • Time sequence analysis (one of Excel's blind spots)

    • Not sure about data cleaning/analysis: since you will almost certainly need Power Query, it may be easier and faster to use Power Query alone (instead of using Power Query and Python together in Excel)

Before we get into details, I want to emphasize that this is just my personal opinion and not a complaint or criticism. During my research, I spoke with the Excel development team several times, they are very friendly and I appreciate their interest in my opinion. Unlike me, they have to work with almost forty years of Excel history and a ton of corporate redundant garbage. Given these limitations, they do a great job!

Create what people need

The history of Python in Excel began in 2015, when user Daniel posted the following idea on the Microsoft UserVoice page (which later migrated to feedback forum companies):

UserVoice

The idea quickly became the most popular among those who needed:

Unfortunately, Python in Excel does not support either function. Instead of becoming an alternative to VBA, it has become an alternative to the Excel formula language. While I don't think this is a bad thing, Python in Excel has actually become just a 2D Jupyter notebook inside an Excel spreadsheet. By the way, it uses a real Jupyter notebook kernel running in the Azure cloud. This means you can use the magic of laptops like %%timeitand that's great!

The project's chief architect, John Lamb, said: Real Python Podcast, that the implementation was considered a “crazy idea”, but it ended up being liked by the decision makers at Microsoft. In my opinion, it would have been better to remain a crazy idea; I will explain the reasons in the next section.

Bug: Placing Jupyter notebooks inside a grid

Python in Excel could be easily implemented by integrating the Jupyter notebook as an Excel task pane. A Python developer would feel right at home with the ability to directly read and write to Excel cells. On the other hand, Excel users could get started with Python in a “standard” environment without having to learn a new tool when moving from Excel. What's the problem with placing laptop cells inside a grid?

Looks like a duck, but it's not a duck

Everyone knows how a spreadsheet works: you enter numbers and formulas into cells, and everything just works because Excel understands the dependencies of each cell and recalculates those that need to be changed when the input data changes. And it doesn’t matter at all where exactly these cells are located in the table.

Well, Python in Excel breaks this basic rule of spreadsheets: even though PY cells are very similar to user-defined functions (UDFs), they are calculated from left to right and top to bottom. The same applies to sheets, that is, the first sheet is calculated before the second and so on.

Repeat yourself

When Office Scripts appeared, it became so convenient that the code (unlike the situation with VBA) was stored outside of the Excel workbook, on SharePoint/OneDrive. Suddenly it was possible to use the same code in multiple books without the need for copy-pasting. In addition, when a bug was discovered, it was enough to correct the code in only one place. Life is good. In fact, it would be even better if the Office Scripts file was a regular TypeScript file that could be tracked on GitHub, but that's another story.

Python in Excel stores the code in the Excel file itself. From the end user's point of view, the code exists inside the cells, but technically it is stored in /xl/pythonScripts.xml inside an Excel workbook (that is, in a ZIP file). We're back to copy-pasting again, yay!

Magic is everywhere!

If you need to output a pandas DataFrame (or any other Python object) to Excel cells, then you need to switch from Object mode to Values ​​mode (the button on the left of the formula bar). There are two problems here:

  • When a DataFrame uses a standard integer index, it is hidden when converted to values. If this index is text or non-standard, then it is displayed. Magic!

    Imagine you have a hundred PY cells, and in cell 3 you specify a column id as a DataFrame index. This will mean that in cell 100, where you convert the Python object to values, you suddenly have one more column.

  • Switching the mode from Object to Values ​​changes the formula =PY("...", 1) on =PY("...", 0). However, this is hidden from the user because the user only sees Python code in the cell (part "..." formulas). Magic!

    I see two serious problems with this:

    • This makes it difficult to directly reference a cell with a Python object because there is a risk that someday someone will switch that cell to Values ​​mode and that will break any formulas that reference that cell in Object mode.

    • PY cells in Object mode behave differently than any other object cell (“Excel (rich) data types”). Let's take the Stocks data type as an example: a cell always remains an object, and you can display its value by typing in another cell =A1.Price. PY cells can behave in a similar way: if you need values, you write =A1.Values. Consistency is always good.

      Stocks rich data types

Non-interactive data analysis

I mainly use Jupyter notebooks for interactive data analysis or incremental experiments. You enter a few lines of code into the cell and press Shift+EnterTo calculate the value of a cell, you look at the intermediate results and continue with the next cell.

When running Python in Excel, I don't see the result in the cell unless:

  • I'll change the cell type from Object to Values. This is often not possible because it requires sufficient space around the cell.

  • I'll click on the preview icon and a pop-up window will open. This is a bunch of unnecessary actions (even if you know the keyboard shortcut Ctrl+Shift+F5), and the preview window is often not very useful: it's too small, and the values ​​are shown as an Excel cell rather than how Jupyter Notebook displays them. This should be fixed at some point, but for now, almost a year after the release of the first version, the preview still shows #N/Aso I don't know what it is: np.nan, pd.NA or pd.NaT.

There are too many hurdles, so unfortunately I won't be using Python in Excel for interactive data analysis!

Diagnostic panel

I would say the Diagnostic Panel is the biggest problem with Python in Excel. You can even start with an unnecessarily technical name (why not just call it Outputlike in VS Code?).

Diagnostic panel displays what is output print() in the cells, as well as any error messages. That is, everything that the Jupyter notebook displays directly below the cell. But since Python in Excel pulls the output away from the cell, you need to manually bind the output to the appropriate cell. Two examples:

  • When learning a DataFrame, the first thing we usually do is df.info(). Since this prints the output, each time Excel recalculates Python cells, the diagnostic panel opens. This is not only annoying, but also completely prevents you from quickly correlating the diagnostic panel output with the corresponding cell. In addition, the output from the previous run is saved there, which makes it even more difficult to find the necessary lines (by the way, a monospace font would help here). What conclusion applies to df1and which one df2? I won't understand this until I look at the code in the cells L2 And K2:

    Diagnostics Pane Error

    Diagnostic Panel Error

  • The same thing happens with exceptions. Try to find the error in the screenshot below (the error occurs in K2which has already disappeared at the bottom of the diagnostic panel)

    Diagnostics Pane Error

As a new Python coder, it is essential to get easy error feedback. But that's not the case here, and that's why I don't think Python in Excel is a good solution for Python beginners.

I'll propose an idea: a preview popup is a much better place to display output without losing context. Also, coloring the PY cell or icon red when an error occurs would help.

Python in the cloud

As you've probably heard, Python runs in the cloud in instances Azure Container, not Excel. There has been a lot of complaining about this on social media (“what about our privacy?”), but what all these commentators are missing is the fact that most corporations already trust Microsoft with all their data using OneDrive/SharePoint.

In fact, I agree that the best way to implement Python in Excel is in the cloud. Maintaining and installing a local Python environment is an impossible task for inexperienced users or when the team has a large number of people. The only option left is Web-Assembly (WASM), but it looks like the Excel team left its implementation as an exercise for me (see end of this post).

However, it's worth remembering that Python in Excel will cost you on top of your Microsoft 365 plan (we don't know how much yet) and there appears to be a limit on the number of requests: at least when I recorded my LinkedIn training course, I saw the following message:

Python in Excel Rate Limit

This is annoying even if you are just recording a video for a course, much less when using Excel spreadsheets for hedge fund trading (yes, this still happens) and is completely unacceptable. And even if there are no restrictions, the Azure cloud sometimes goes offline, which means that cells with Python will not work. This is worth remembering.

In other words, I wouldn't use Python in Excel for anything critical unless I could switch to self-hosting or a local server when the cloud went down.

Problems with scripting language

One of the main reasons to love Python is that it is a “scripting language”. With Python you can connect whatever you need to connect because it has packages for everything. That is, the work process usually looks like this: we download several files from the web API, get data from the database, run everything through DataFrame pandas or polars, and at the end we send an email with the results.

Well, such a basic workflow is not possible using Python in Excel. You can't install additional packages, and your code runs in a steel box, meaning there's no way to connect to the outside world via a web API or database connection. The official solution is to use Power Query to retrieve external data because Python in Excel can reference Power Query. But in fact, one of the reasons I needed Python and pandas in Excel was precisely because I was giving up on Power Query (I don't want to learn the M language), so I'm not a big fan of this solution.

Here's what I don't understand:

  • Python in Excel runs on Azure Container instances. And Azure Container instances can use your own container with all the dependencies you need, so I don't understand why the Excel team said in one of the posts in Redditthat the dependencies will have to be installed from scratch every time someone opens the workbook. This is complete nonsense, because the container image is built once and then simply stored with all its dependencies until the next time someone opens the workbook.

  • If I can access web APIs through Office Scripts, then why not through Python in Excel? As far as I know, the Office administrator can restrict access to Office Scripts, so I believe something similar will one day appear for Python in Excel.

To summarize: due to missing packages and blocked Internet access, it is impossible to use Python in Excel in many situations for which people needed Python in Excel.

Matplotlib is great, but it could use a little interactivity

Python in Excel has Matplotlib and Seaborn (which is based on Matplotlib) to help you create plots when Excel charts are too simple. This is a wonderful and necessary opportunity, but the result is a boring, static image. This happened in xlwings ten years ago. I miss access to modern interactive charts like Plotly, Bokeh, Altair, HoloViews and so on.

Lack of access to the Excel object model

People constantly use Excel to create reports. Very often they have to combine one or more workbooks with data on multiple sheets. Let's say there is a sheet for each month, or for each store address, or for each department. When working with Python in Excel, you will have to manually select the data in each worksheet to link all the data:

jan = xl("Jan[#All]", headers=True)
feb = xl("Feb[#All]", headers=True)
mar = xl("Mar[#All]", headers=True)
...
df = pd.concat([jan, feb, mar, ...])

If you need to merge weekly data or data from a hundred departments, good luck. If Python in Excel provided access to the Excel object model (like VBA or Office Scripts), it would look something like this (this example shows xlwings syntax):

data = []
for sheet in workbook.sheets:
    data.append(sheet.tables[0].range.options(pd.DataFrame).value)
df = pd.concat(data)

Speaking of reporting: everyone uses their own formatting for colors and cell borders, but without access to the Excel object model, this is also impossible to do.

DataFrames are great, but NumPy arrays and lists are great too

If you select more than one cell, the function xl() will pass DataFrame to pandas. I don't mind having this happen by default, but quite often pandas require a NumPy list or array (or DataFrame polars!) instead of a DataFrame. And it should be very easy to obtain. In fact, it's now possible to write your own converter to solve this problem, but I don't want to copy-paste it into every book…! Here's how to fix it:

def myconverter(x, headers=False, convert=None, **kwargs):
    if convert is None or convert == pd.DataFrame:
        return excel.convert_to_dataframe(x, headers=headers, **kwargs)
    elif convert == np.array:
        return np.array(x)
    elif convert == list:
        return x
    else:
        raise ValueError(f"{convert} is not supported.")
excel.set_xl_array_conversion(myconverter)

Running this code in the initialization panel (when it is editable, but for now place the code in cell A1 of the first sheet) will allow you to use the function normally xl() to get DataFrame pandas:

df = xl("A1:B2", headers=True)

But this will also allow you to get a NumPy array:

arr = xl("A1:B2", convert=np.array)

or a list of lists:

mylist = xl("A1:B2", convert=list)

It's just a few lines of code, but it's not yet available in Python in Excel, which means you have to duplicate and copy-paste code between books.

A better alternative than “an alternative to the Excel formula language”

I wondered, “Are there no better alternatives than adding Python to enhance the analytical capabilities of the Excel formula language?” I think there is: it is possible to improve the native Excel formula language itself. And in recent years, Microsoft has been doing just that. It all started with dynamic arrays introduced in 2018, which later created the opportunity for powerful formulas like =UNIQUE() And =SORT(). Microsoft has recently added the following new features:

Wait a minute, that's half the reason I'd create pandas DataFrames in Excel! That is, instead of writing df.pivot_table() And df.groupby()I can now stay within the confines of native Excel and write real Excel formulas without all the inconvenience of Python in Excel.

What I found interesting was the multi-line editing process in PY cells. Why not upgrade Excel's native formula language so I don't have to write expressions like this? LET (which I still find difficult to read):

=LET(x, 1, y, 2, x + y)

Maybe let me write like this instead?

let x = 1
let y = 2
x + y

Alternatively, why not turn your Excel tables into a native Excel DataFrame? Add attributes to them instead of taking a functional approach, and you end up with something like this:

=MyTable[#All].GROUPBY(...)

There are many opportunities to integrate pandas functionality in a more Excel-native way!

What do I plan to do next with xlwings?

One of the most popular uses of xlwings has always been to run Jupyter Notebook and Excel in parallel. The tricky part here is that Python must be installed locally. This is a problem because Python is difficult to deploy and maintain with the packages you need. There are also problems with IT security departments: they do not like the full installation of Python on the user's machine.

One solution to the problem is xlwings Server, which is self-hosted and runs anywhere Python runs. However, xlwings Server assumes this use case: a developer writing macros and UDFs in Python in the backend. The end user accesses it as a modern Office.js web add-on without the need for Python knowledge. That is, it was not intended for interactive use by the end user.

That's why I started working on “xlwings Script” which will be based on Web-Assembly (WASM). Essentially, Web-Assembly allows Python to run natively in the browser, while Office.js addons run in the browser inside Excel. That is, I “just” connect Python running on WASM with the Excel object model. This will help me achieve PyScriptbuilt on top of Pyodide and Emscripten.

It will most likely take me a few more months to implement this; I will determine this by an early proof-of-concept, which will have the following characteristics:

  • Support for true native functions (aka user-defined function, UDF)

  • Provide access to the Excel object model

  • Support for Matplolib/Seaborn charts

  • Web API call support

  • Ability to install your own packages

  • Emphasis on privacy: everything will run on the user's machine

  • High speed

  • Win, macOS and Excel on the Web support from the start

  • Ability to load external files without Power Query

  • Working with fixed versions of Office, such as Office 2021, not just Microsoft 365

  • To get it, you just need to download the add-on from the Excel add-on store

Essentially, I want to create Office Scripts, but with Python instead of JavaScript and running on the user's machine rather than in the Microsoft cloud.

Conclusion

Use cases for Python in Excel certainly exist, but in my opinion, integrating a classic Jupyter notebook instead of placing cells on an Excel sheet would be much better.

Don't forget that Python in Excel is still a preview version at the moment, so we'll probably get more features and fewer limitations in the future.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *