Out[1]:
In [4]:
Out[4]:

Python Missing Data

Authors: Augusto Sandoval, Yuan-Sen Ting, Pavlos Protopapas, Karim Pichara
Contributors: Isadora Nun

Introduction

To take full advantage of all information available, it is best to use as many available databases as possible. For example, adding u-band or X-ray information while classifying quasars based on their variability is highly likely to improve the overall performance. Because these catalogs are taken with different instruments, bandwidths, locations, times, etc., the intersection of these catalogs is smaller than any single catalog; thus, the resulting multi-catalog contains missing values. Traditional classification methods cannot deal with the resulting missing data problem because to train a classification model it is necessary to have all features for all training members.

PyMissingData allows you to predict missing values given the observed data and dependency relationships between variables. Also, it has tools to compare diferent predictions that you can use to test their behaviour on a database with all the data.

Installation

The library can be installed through pip as:
pip install PyMissingData
Also you can download it from https://github.com/arsandov/PyMissingData. The main idea is that any user can run it in its own database but can also add new features through the github system. For a quick guide on how to use github visit https://guides.github.com/activities/hello-world/.

Requirements

This library has been created and used with python 2.7. It doesn't have compatibility for python 3 yet.

It has the following dependencies to other packages:



We recommend using the distribution of Python called Anaconda, frequently used for science, math, engineering and data analysis. Which includes all these packages and more.

Library structure

The library has two modules with different goals. The bayesian_learner which let you fill the missing values based on a bayesian network recovering the dependencies between values and the compare_tools module which let you compare the quality of predictioms from different algorithms.

Bayesian Learner

This module allows the user to read a file with the data including a mark for missing values in order to create a new file with the missing values filled.

Example
In this example the values are read from the file data_examples which should meet the following requeriments:
  • All values should be normalize between 0 and 1
  • To mark missing values use a negative number (for example -1)

Code:
import PyMissingData
bf= PyMissingData.bayesian_learner.bayesian_fill(min_iterations=8,max_iterations=80,pvalparam=0.05)
bf.fill_missing_data("data_examples.txt","data_examples_filled.txt")


Parameters (init):
  • min_iterations: Due to irregularities in greedy algorithms this set the minimum iterations before starting to record the best solution
  • max_iterations: In the limit of iterations done before stopping, keep in mind that usually more iterations would create a better result.
  • bins: The continuous data is discretized in bins, the default value is 15
  • pvalparam: The p-value below which to consider something significantly unlikely. This value is used while creating the structure, its range is 0-1.
  • random_seed: In case of require always the same sequence of random numbers you can set an integer as seed
  • indegree: The upper bound on the size of a witness set (see Koller et al. 85). If this is larger than 1, a huge amount of trials are required to avoid a divide-by-zero error.

Parameters (fill_missing_data):
  • filename_in: Filename used as input
  • filename_out: Filename to store the output file with filled data
  • header: Indicates if the input file has a header, the default value is None (None is different to false, read pandas documentation for details)
  • float_format: Format for the output values, the default value is '%.5f' (read pandas documentation for details)
  • delim_whitespace: The character which separetes a value from another is a whitespace. The default value is True (read pandas documentation for details)

Compare Tools

This module allows the user to compare the quality of two predictions based on the Mean Absolute Error and the Normalized Root Mean Squared Error

Example

The read file follows the same requirements as the file above and can be found here
import PyMissingData as pmd
filename_input="Training_Features.txt"
filename_removed="tf_removed.txt"
filename_output="tf_filled.txt"
pmd.compare_tools.random_delete(filename_input,filename_removed)
bf= pmd.bayesian_learner.bayesian_fill(min_iterations=3,max_iterations=80,pvalparam=0.3)
bf.fill_missing_data(filename_removed,filename_output)
bf.json_network()
pmd.compare_tools.compare_predictions(filename_removed,filename_input,filename_output)

Output:
Iteration: 0
Iteration: 1
...
Iteration: 4 Better bayesian network found
...
Iteration: 37 Better bayesian network found
Iteration: 38
...
Iteration: 52
There is a cycle, the new pval is 0.335
There is a cycle, the new pval is 0.37
...
Prints the network in Json
...
Comparison of results {'real-prediction': {'mae': 0.05022275484304244, 'nrmse': 0.09962577574706931}, 'prediction-random': {'mae': 0.4017728823839466, 'nrmse': 0.47370763198282584}, 'real-random': {'mae': 0.39857917144201765, 'nrmse': 0.5716654435811355}}

Process finished with exit code 0
Parameters (compare_predictions()):
  • file_marked: The file with missing values marked as negative numbers
  • file_real: File with all the values (the idea of this is compare the quality of our prediction against a real dataset with all the data)
  • file_prediction: The file with the prediction we made
  • header: Indicates if the input file has a header, the default value is None (None is different to false, read pandas documentation for details)
  • delim_whitespace: The character which separetes a value from another is a whitespace. The default value is True (read pandas documentation for details)
  • times: While running the comparision it check 'times' times against a random prediction (i.e. an uniform between 0 and 1). The default value es 100

References

  1. Automatic Classification of Variable Stars in Catalogs with Missing Data - Karim Pichara and Pavlos Protopapas 2013 ApJ 777 83
Open Modal

Library Maintenance

Dear Visitor

After performing some tests we found some bugs in the library, we are working to solve them as soon as possible. After that, we will update the webpage with the new functions!

Please remember that the library is still in alpha state. If you want to contact us, feel free to send us an email to augustocsandoval[(at)]gmail.com