Parsing and processing data from weather sites Yandex.Weather and Meteoinfo (Hydrometeorological Center) using pandas (Part 1)

I welcome everyone! I want to tell you how I managed to use the pandas library for parsing and processing site meteorological data Yandex.Weather. I note that this is my first article for Habr, do not judge strictly.

Brief background. It so happened that I had to maintain a telegram channel about the weather, almost immediately the question arose of how to reduce the time to search and analyze data from the main meteorological sites in order to get everything just-in-time to your computer. In other words, the goal was to do a little automation.

Yandex.Weather

At the moment, there are a number of standard methods for obtaining data from a site (libraries requests), but I managed to use the built-in method read_html() from the pandas library and “read” the HTML table directly into the DataFrame using Jupiter Notebook.

Basic method syntax:

df = pd.read_html('ссылка')
Result of pandas read_html() method

Result of pandas read_html() method

Separately, I note that when reading with the read_html() method, I did not catch errors using the construction try-exceptbecause, firstly, the site of Y. Pogoda itself works quite stably, and, secondly, the goal was only to obtain data.

So, after analyzing the resulting table, it became obvious that the data we need is in the cells [1]-[9]. We collect them in a DataFrame, which you can work with using concate():

ya_new=pd.concat([ya_df[0],ya_df[1],ya_df[2],ya_df[3],ya_df[4],ya_df[5],ya_df[6],ya_df[7],ya_df[8],ya_df[9]],ignore_index=True)

We get a table with “raw” data:

table with "raw" Yandex.Weather data

table with “raw” Yandex.Weather data

The situation has gotten better! It was noticeable that the data was organized quite structurally, there were no gaps in the main columns, and accordingly, some plan for further work appeared:

  1. Process column with temperature tempto get the lower and upper temperature limits using regular expressions (and a little magic)

  2. Process column wind with help regular expressionsto get the wind value (no direction)

  3. Invent and implement processing of a column with temperature according to sensations (meteorologists call it) t_eff

  4. Collect everything in the final dataframe.

Column processing temp was to loop line by line in the dataframe column to find and write to the list elements that match the pattern ‘[+ -]\d{1,2}\…[+ -]\d{1,2}’. How to understand regular expressions helped me this video.

for i in range(0,ya_new.shape[0]):
    t_str=ya_new['temp'][i]
    temp_str=re.search(r'[+ -]\d{1,2}\…[+ -]\d{1,2}',t_str) #поиск по шаблону при помощи метода search() библиотеки re
    temp_clean.append(temp_str.group(0))

Similarly, a loop was implemented to process the column wind:

for i in range(0,ya_new.shape[0]):
    w_str=ya_new['wind'][i]
    wind_str=re.search(r'\d{1,2},\d',w_str) #поиск по шаблону при помощи метода search() библиотеки re
    wind_clean.append(wind_str.group(0))

I would like to dwell on the processing of the column t_eff. The column contains a row containing the same value twice. At the beginning of each such line can be “+”, “-” or nothing (if the temperature value is 0). Split Implementation:

t_eff=[]
for i in range(0,ya_new.shape[0]):
    string=ya_new['t_eff'][i]                 #идём по элементам столбца
    
    if string[0]=='−':                        #если сначала идет «-»
        s_1=re.sub('[^0-9]','', string)       #удаляем все, что не цифра
  
        s_fin='−'+s_1[:len(s_1)//2]           #сохраняем в новую строку «половину» длины строки со знаком «-»
    if string[0] != '−':                      #аналогично делаем для ситуации, если там не «-», а «+» или ничего.
        s_1=re.sub('[^0-9]','', string)
        s_fin=s_1[:len(s_1)//2]
    t_eff.append(s_fin)

Separately, the % sign in the humidity column was changed and the data types were processed and a column with an average temperature was added:

#поработаем с форматами
ya_full[['t_down','t_up','wet']]=ya_full[['t_down','t_up','wet']].astype('int')
ya_full['wind']=ya_full['wind'].astype('float')
ya_full['t_eff']=ya_full['t_eff'].astype('int')

#считаем среднюю температуру
ya_full['t_mean']=(ya_full['t_up']+ya_full['t_down'])/2

…and assembled the final dataframe using concate():

ya_full=pd.concat([list_days,temp_fin,ya_new[['press','wet']],wind_fin,temp_eff,ya_new['weather']],axis=1)
The final result of data processing on the Yandex.Weather website

The final result of data processing on the Yandex.Weather website

Anything could be done with such a table: calculate the averages for different sites, build graphs, analyze, etc. In the next article I will talk about processing site data. hydrometeorological center (everything turned out to be more complicated there), the full implementation code from this case can be viewed on my github.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *