To take full advantage of all information available, it is best to use as many available databases as possible. For example, adding u-band or X-ray information while classifying quasars based on their variability is highly likely to improve the overall performance. Because these catalogs are taken with different instruments, bandwidths, locations, times, etc., the intersection of these catalogs is smaller than any single catalog; thus, the resulting multi-catalog contains missing values. Traditional classification methods cannot deal with the resulting missing data problem because to train a classification model it is necessary to have all features for all training members.
PyMissingData allows you to predict missing values given the observed data and dependency relationships between variables. Also, it has tools to compare diferent predictions that you can use to test their behaviour on a database with all the data.
The library can be installed through pip as:
pip install PyMissingData
Also you can download it from https://github.com/arsandov/PyMissingData. The main idea is that any user can run it in its own database but can also add new features through the github system. For a quick guide on how to use github visit https://guides.github.com/activities/hello-world/.
Requirements
This library has been created and used with python 2.7. It doesn't have compatibility for python 3 yet.
It has the following dependencies to other packages:
The library has two modules with different goals. The bayesian_learner which let you fill the missing values based on a bayesian network recovering the dependencies between values and the compare_tools module which let you compare the quality of predictioms from different algorithms.
This module allows the user to read a file with the data including a mark for missing values in order to create a new file with the missing values filled.
import PyMissingData
bf= PyMissingData.bayesian_learner.bayesian_fill(min_iterations=8,max_iterations=80,pvalparam=0.05)
bf.fill_missing_data("data_examples.txt","data_examples_filled.txt")
This module allows the user to compare the quality of two predictions based on the Mean Absolute Error and the Normalized Root Mean Squared Error
The read file follows the same requirements as the file above and can be found here
import PyMissingData as pmd
filename_input="Training_Features.txt"
filename_removed="tf_removed.txt"
filename_output="tf_filled.txt"
pmd.compare_tools.random_delete(filename_input,filename_removed)
bf= pmd.bayesian_learner.bayesian_fill(min_iterations=3,max_iterations=80,pvalparam=0.3)
bf.fill_missing_data(filename_removed,filename_output)
bf.json_network()
pmd.compare_tools.compare_predictions(filename_removed,filename_input,filename_output)
Output:
Iteration: 0
Iteration: 1
...
Iteration: 4 Better bayesian network found
...
Iteration: 37 Better bayesian network found
Iteration: 38
...
Iteration: 52
There is a cycle, the new pval is 0.335
There is a cycle, the new pval is 0.37
...
Prints the network in Json
...
Comparison of results {'real-prediction': {'mae': 0.05022275484304244, 'nrmse': 0.09962577574706931}, 'prediction-random': {'mae': 0.4017728823839466, 'nrmse': 0.47370763198282584}, 'real-random': {'mae': 0.39857917144201765, 'nrmse': 0.5716654435811355}}
After performing some tests we found some bugs in the library, we are working to solve them as soon as possible. After that, we will update the webpage with the new functions!
Please remember that the library is still in alpha state. If you want to contact us, feel free to send us an email to augustocsandoval[(at)]gmail.com