Kaggle. Enefit – forecasting consumer energy behavior

Review

The goal of the competition is to create a model for forecasting energy consumption to reduce the costs of energy imbalances.

This competition aims to address the problem of energy imbalance, a situation where the energy expected to be used does not match the energy actually used or produced. Consumers, who both consume and generate energy, contribute significantly to energy imbalances. Even though they make up only a small portion of all consumers, their unpredictable energy use causes logistical and financial problems for energy companies.

Description

NLP workflows increasingly involve text rewriting, but there is still much to learn about how to effectively use LLM hints. This machine learning competition aims to be a new way to delve deeper into this problem.

Objective: Recover the LLM hint used to rewrite the given text. Your solution will be tested on a dataset of over 1,300 original texts, each paired with a rewritten version from Gemma, Google's new family of open models.

About Us

Enefit is one of the largest energy companies in the Baltic region. As energy experts, we help clients plan their green journey individually and flexibly, and realize it using clean energy solutions. Enefit is currently trying to address the imbalance by developing internal forecasting models and relying on third-party forecasts. However, these methods have proven to be insufficient due to their low accuracy in predicting consumer energy behavior. The disadvantages of these current methods are their inability to accurately account for the wide range of variables that influence consumer behavior, resulting in high costs.

Assessment

Solutions are evaluated by the mean absolute error (MAE) between the predicted return and the observed target. Formula:

Where:

  • $$n$$ total number of data points.

  • $${y}_{i}$$ is the predicted value for data point $$i$$.

  • $${x}_{i}$$ is the observed value for data point $$i$$.

Participation

You must enter this competition using the provided python time series API, which ensures that models do not look ahead in time. To use the API, follow the template in this notebook.

Model training period

  • November 1, 2023 is the start date.

  • January 24, 2024 is the deadline for applications. You must accept the competition rules by this date in order to participate in the competition.

  • January 24, 2024 – Deadline for joining teams. This is the last day for participants to join or combine teams.

  • January 31, 2024 is the deadline for applications.

All deadlines are 11:59 GMT on the relevant day unless otherwise stated. The competition organizers reserve the right to update the competition schedule if they deem it necessary.

Model testing period

The rating table will be updated periodically.

Expect 1-3 updates before final rating.

April 30, 2024 – Competition end date

Prizes

1st place – $15,000

2nd place – $10,000

3rd place – $8,000

4th place – $7,000

5th place – $5,000

6th place – $5,000

Code requirements

Entries for this competition must be submitted as notebooks. In order for the “Submit” button to be active after committing, the following conditions must be met:

  • Notebook CPU runtime <= 9 hours

  • Notebook GPU runtime <= 9 hours

  • Internet access disabled

  • Allow free and public access to external data, including pre-trained models

  • The submission file must be named submission.csv and be generated by the API.

Please see the Code Contest FAQ for more information on how to submit a solution. And review the code debugging document if you encounter submission errors.

Description of the training sample

Your task in this competition is to predict the amount of electricity produced and consumed by Estonian energy consumers who have installed solar panels. You will have access to weather data, relevant energy prices and records of installed PV capacity.

This is a forecasting competition using the Time Series API. The league table will be determined using actual data collected after the decision submission period ends.

All data sets follow the same timing convention. Times are in EET/EEST. Most variables are a sum or average over a 1-hour period. The datetime column (regardless of its name) always indicates the start of a 1-hour period. However, for weather datasets, some variables, such as temperature or cloud cover, are set to a specific time, which is always the end of a 1-hour period.

Files

train.csv

  • county – County identification code.

  • is_business – Boolean value to determine whether the prosumer is a business or not.

  • product_type – Identification code with the following mapping of codes to contract types: {0: "Combined", 1: "Fixed", 2: "General service", 3: "Spot"}.

  • target – Volume of consumption or production for the corresponding segment per hour. Segments are defined using county, is_businessand product_type.

  • is_consumption – A Boolean value indicating whether the target of this row is consumption or production.

  • datetime – Estonian time in EET (UTC+2) / EEST (UTC+3). It describes the start of the 1-hour period for which the target is set.

  • data_block_id – All rows having the same data_block_id will be available at the same forecast time. It depends on what information is available when the forecasts are actually made, at 11 a.m. each morning. For example, if the weather data_block_id value for predictins made on October 31 is 100, then the historical weather data_block_id value for October 31 will be 101, since historical weather data is actually only available for the next day.

  • row_id – Unique identifier for the string.

  • prediction_unit_id – Unique identifier for the combination of county, is_business and product_type. New prediction units may appear or disappear in the test set.

    gas_prices.csv

  • origin_date – The date when day-ahead prices became available.

  • forecast_date – Date when forecast prices should be current.

  • [lowest/highest]_price_per_mwh – The lowest/highest price of natural gas prevailing on the market for the day ahead on that trading day, in euros per megawatt-hour equivalent.

  • data_block_id

    client.csv

  • product_type

  • county – County identification code. See county_id_to_name_map.json for mapping ID codes to county names.

  • eic_count – Aggregated number of points of consumption (EICs – European Identification Code).

  • installed_capacity – Installed photovoltaic solar panel with a capacity in kilowatts.

  • is_business – Boolean value to determine whether the prosumer is a business or not.

  • date

  • data_block_id

    electricity_prices.csv

  • origin_date

  • forecast_date – Represents the start of the 1 hour period when the price is valid

  • euros_per_mwh – The price of electricity for the day ahead is indicated in euros per megawatt hour.

  • data_block_id

    forecast_weather.csv Weather forecasts that would have been available at the time of forecasting. Retrieved from European Center for Medium-Range Weather Forecasts.

  • [latitude/longitude] – Coordinates for weather forecast.

  • origin_datetime – A timestamp indicating when the forecast was generated.

  • hours_ahead – Number of hours between forecast generation and weather forecast. Each forecast covers a total of 48 hours.

  • temperature – Air temperature at a height of 2 meters above the ground in degrees Celsius. Calculated at the end of the 1-hour period.

  • dewpoint – Dew point temperature at 2 meters above the ground in degrees Celsius. Calculated at the end of the 1-hour period.

  • cloudcover_[low/mid/high/total] – Percentage of sky covered with clouds in the following altitude ranges: 0-2 km, 2-6, 6+ and total. Calculated at the end of the 1-hour period.

  • 10_metre_[u/v]_wind_component – Wind speed component [в восточном/северном направлении], measured at a height of 10 meters above the surface in meters per second. Calculated at the end of the 1-hour period.

  • data_block_id

  • forecast_datetime – Timestamp of forecast weather. Generated from origin_datetime plus hours_ahead. This represents the start of the 1-hour period for which weather data is forecast.

  • direct_solar_radiation – Direct solar radiation reaching the surface in a plane perpendicular to the direction of movement of the Sun accumulates over an hour in watt-hours per square meter.

  • surface_solar_radiation_downwards – Solar radiation, both direct and diffuse, reaching the horizontal plane on the Earth's surface accumulates over the course of an hour in watt-hours per square meter. snowfall – Snowfall over hour in units of meters of water equivalent.

  • total_precipitation – The accumulated liquid, consisting of rain and snow, that falls on the Earth's surface during a described hour, in units of meters.

    historical_weather.csv Historical weather data.

  • datetime – This marks the start of the 1-hour period over which weather data is measured.

  • temperature – Measured at the end of a 1-hour period.

  • dewpoint – Measured at the end of a 1-hour period.

  • rain – Differs from generally accepted forecasts. The amount of precipitation in large-scale weather systems per hour in millimeters.

  • snowfall – Differs from generally accepted forecasts. The amount of snow that fell per hour in centimeters.

  • surface_pressure – Air pressure at the surface in hectopascals.

  • cloudcover_[low/mid/high/total] – Differs from generally accepted forecasts. Cloud cover at 0-3 km, 3-8, 8+ and general.

  • windspeed_10m – Differs from generally accepted forecasts. Wind speed at a height of 10 meters above the ground in meters per second.

  • winddirection_10m -Different from conditional forecasts. Wind direction at 10 meters above the ground in degrees.

  • shortwave_radiation – Differs from generally accepted forecasts. Global horizontal irradiation in watt hours per square meter.

  • direct_solar_radiation

  • diffuse_radiation – Differs from generally accepted forecasts. Diffuse solar radiation in watt-hours per square meter.

  • [latitude/longitude] – Coordinates of the weather station.

    data_block_id

  • public_timeseries_testing_util.py

    An optional file designed to make it easier to run custom standalone API tests. For details, see the documentation for the script. You will need to edit this file before using it.

  • example_test_files

    Data intended to illustrate how the API functions. Includes the same files and columns provided by the API. The first three data_block_ids are repetitions of the last three data_block_ids in the dataset.

  • example_test_files/sample_submission.csv

    A valid submission sample provided by the API. See this notebook for a very simple example of how to use the submission template.

  • example_test_files/revealed_targets.csv

    Actual target values ​​the day before the forecast time. This represents a two day delay compared to the forecast time in test.csv.

  • enefit

    Files that allow you to use the API. Expect the API to deliver all rows in less than 15 minutes and reserve less than 0.5 GB of memory. The copy of the API you can download uses data from example_test_files/. You must make forecasts for these dates to improve the API, but these forecasts are not taken into account. Expect approximately three months' worth of data initially, and up to ten months' worth of data by the end of the forecast period.

Citation

Kristjan Eljand, Martin Laid, Jean-Baptiste Scellier, Sohier Dane, Maggie Demkin, Addison Howard. (2023). Enefit – Predict Energy Behavior of Prosumers. Kaggle.

https://kaggle.com/competitions/predict-energy-behavior-of-prosumers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *