Classification of exoplanets (part I data processing)

There is something bewitching and beautiful in space, at the same time, a person is designed in such a way that if he doesn’t know something, he will be afraid of it (thanks to our mothers and fathers in the nth generation for such a wide range of perception of information and response to it) , however, there were always madmen researchers, dreamers and just people who are tired of doing something that has already been invented without them and works well, so they sought to come up with something new. Someone is taking courses on endless self-development, discovering new types of breathing, and also filling their chakras and feeling a surge of strength, and someone is really trying to discover what an ordinary person most likely will not need in the next 50 years (or maybe more), after all, it is unlikely that we will be able to leave our solar system before this date. However, there is something attractive and unusual in looking at the night sky and trying to draw lines in your head that are called the Big Dipper or the same dipper, or maybe you will be lucky enough to see the Milky Way in all its glory, something that makes you feel at the same time , as some say, a small dot, but at the same time we do not forget that we have a microcosm for which man, roughly speaking, is the whole universe. As Lisa Randall wrote in Knocking on Heaven's Door, a person is somewhere in the middle of this whole world. But we were talking about the sky and stars, so let’s finish the references to science on quantum physics and get down to business.

Milky Way pictcha from the Internet

Milky Way pictcha from the Internet

In this article I would like to talk about a task directly related to what was mentioned above, namely the classification of exoplanets based on data from the Kepler orbital satellite.
(The dataset can be found at this Kaggle link: https://www.kaggle.com/datasets/nasa/kepler-exoplanet-search-results)

An exoplanet is a planet that orbits a star (no damn moon), outside the solar system.

Based on a set of features from the dataset, it is proposed to determine a candidate for an exoplanet, or a non-candidate. That is, the smart guys from NASA have already marked all the candidates and non-candidates, there are also confirmed planets in the data, but we will not touch them, because:

a) We are doing machine learning, and not writing a dissertation on astronomy, so we will leave the final verdict to the scientists, and our task is to simplify their lives and allow them to spend time on more pleasant and useful things than line-by-line data processing.

b) Even if we decide that we know the decision-making principle for exoplanet classification so well that we can build a model that predicts not just candidates, but also potentially confirmed planets, we will encounter an imbalance of classes, which is not very good.

Starting parameters of the dataset: 50 columns and 9564 rows.
(You can read more about these parameters here: https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html)

We import libraries, if warnings annoy you, then remove them too:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sc
from matplotlib.colors import ListedColormap
from matplotlib.colors import LinearSegmentedColormap
import warnings
warnings.filterwarnings('ignore')

Reading the dataset:

data = pd.read_csv('cumulative.csv')
data.head()

Next, let's look at the target variable and the balance between classes

Candidate – exoplanet candidate

False Positive is definitely not a candidate

data.groupby('koi_pdisposition').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)
Target variable

Target variable

As we can see from the graph, the classes are well balanced, which means we shouldn’t have any problems with building the model, but before that we should work with the data.

Why?
Because there are gaps, we’ll see now

First, let's select through X the data that we definitely won't need and leave only what we need.

X=data.drop(['koi_teq_err1','koi_teq_err2', 'koi_disposition', 'koi_pdisposition','kepler_name', 'kepoi_name', 'rowid','koi_score', 'kepid'], axis=1)
X.head()

Why did we drop them? In short, this is data that will not help us in training the model, such as the number of a potential planet and other ids. If you are interested, I advise you to open the article above and read the description of the parameters.

Next, we run our 'y' through pandas get_dummies

Y=data['koi_pdisposition']
y=pd.get_dummies (Y, drop_first= True, dtype=float )
y = y.rename(columns={'FALSE POSITIVE': 'koi_pdisposition'})
y.head()

Now we have y in float format, but it was in object format.

Next, let's see what percentage of data is missing

(X.isna().sum() / len(X)).round(4) * 100

We will get the output of all the features and next to them the percentage of missing data.
If you're using .py format, it might be worth typing print((X.isna().sum() / len(X)).round(4) * 100) as I use ipynb because it's intuitive for me and doesn't forces you to rewrite conclusions many times to test yourself or build graphs.

To get some kind of visual representation, it is better to build a heat map using the data:

plt.figure(figsize=(10,12))
sns.heatmap(X.isna().transpose())
White lines are gaps

White lines are gaps

Let's fill in the gaps through interpolation using the neighbor method.
For what? We will not create an imbalance and look for averages and build polynomials, since later we will see histograms that will make this choice clearer.

X=X.interpolate(method='nearest')
X.info()

Using the info method, we were convinced that there were no gaps now; by the way, we could have called it before to make sure there were gaps, without bothering with heat maps and percentages.

Next, let's look at the distributions in the data

X.hist(bins=25, figsize=(20,20));

Judging by the histograms, the nearby interpolation method was a good choice.

Next we will make feature selection:
1) Let's build a random forest model and derive the importance of features
2) F-test (automatic feature selection)

Let me remind you that we have 42 signs left, of which it would be nice to discard the unnecessary ones. For this we actually need feature selection. In fact, there are many more methods for feature selection than 2, for example a correlation matrix or variance treshold. However, I like these 2 more and most often they were the ones that produced the best metrics.

Speaking of metrics. Let’s write a function that will count them for us; before that, don’t forget to import the necessary libraries:

from sklearn.model_selection import train_test_split
from sklearn.metrics import  roc_auc_score, classification_report, accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier
# пишем функцию для расчета метрик:
def metrics(y_true,y_pred):
    acc=accuracy_score(y_true, y_pred)
    f1=f1_score(y_true, y_pred)
    roc_auc=roc_auc_score(y_true, y_pred)
    print(f'accuracy: {acc}\nf1: {f1}\nroc auc: {roc_auc} ')

Now we only need to input the values ​​y_true and pred to output the metrics, which speeds up the process of outputting them.

Let's divide the data into a test and training set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Let's train a random forest model

rfc = RandomForestClassifier(max_depth=2, random_state=42)
rfc.fit(X_train, y_train)
rfc_pred_train=rfc.predict(X_train)
rfc_pred_test=rfc.predict(X_test)
metrics(y_test,rfc_pred_test)

Metric output:
accuracy: 0.9158389963408259
f1: 0.9182326053834434
roc auc: 0.9178450601875332

Now let's do what we started all this for:
FEATURE IMPORDANCE

feature_importances = rfc.feature_importances_
sorted_indices = np.argsort(feature_importances)
plt.figure(figsize=(12, 10))
plt.barh(np.arange(len(feature_importances)), feature_importances[sorted_indices], color="darkblue")
plt.yticks(np.arange(len(X.columns)), X.columns[sorted_indices])
plt.show()
selected_features = X.columns[feature_importances > 0.01]
print(selected_features)
Importance of Features

Importance of Features

In the code we display only values ​​of importance greater than 0.01, we get the following array:
Index([‘koi_fpflag_nt’, ‘koi_fpflag_ss’, ‘koi_fpflag_co’, ‘koi_fpflag_ec’, ‘koi_depth’, ‘koi_prad’, ‘koi_prad_err1’, ‘koi_prad_err2’, ‘koi_teq’, ‘koi_insol_err1’, ‘koi_insol_err2’, ‘koi_model_snr’, ‘koi_steff_err1’, ‘koi_steff_err2’]dtype=”object”)

Great, now let's try F-test, namely for the classification case:

from sklearn.feature_selection import f_classif,SelectKBest
f_statistic, p_values = f_classif(X, y)
selector = SelectKBest(f_classif, k=8)
X_f = selector.fit_transform(X,y)
mask = selector.get_support(indices=True)
best_features = [X.columns[i] for i in mask]
print(best_features)

We get the following array:
[‘koi_fpflag_nt’, ‘koi_fpflag_ss’, ‘koi_fpflag_co’, ‘koi_fpflag_ec’, ‘koi_depth’, ‘koi_teq’, ‘koi_steff_err1’, ‘koi_steff_err2’]

The advantage of the f-test method is that we do not need to build models or do anything ourselves. Everything has already been implemented for us, and we just get some k-best parameters.
In our case, k=8, the value of this parameter can naturally be twisted and changed at your discretion.

Conclusion

We talked about life, processed the data, filled in the blanks using the neighbor method, built graphs to understand what we had done and selected features using two methods. This is where we can finish the first part of the article. The second part will also be published soon, after the publication of this (first), there we will build models based on selected features and even make a neural network, as the data allows.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *