Data mining and conflicting areas identification
Internship in an IT research laboratory
·
From June 26, 2020 to August 26, 2020
During my internship in LIRIS research laboratory, I was tasked to find computation methods to analyze a dataset prior to a machine learning training. I ended up with python modules capable of analysing and highlighting possible incorrect data or problematic values for a machine learning training. At the very beginning of my internship I learned theories about machine learning methods and technologies, applied to python. I became familiar with machine learning concepts such as suppervised/unsupervised learning or underfiting/overfiting. Following a thesis, the idea of the project was to build scripts capable of finding incorrect data in a large dataset. The overall idea was the following: Aiming a classification algorithm, the datasat has multiple input attributes that are used to guess a target attribute. Then, the goal was to evaluate the capability of guessing the target attribute following the idea that, if there exist multiple instances in the dataset with the same input attributes but a different target, then any algorithm (regardless of its complexity) will be unable to be 100% sucessful. Let's illustrate that with a common dataset in machine learning: the Iris dataset. It is made of data about Iris flowers and contains 5 attributes: the width and length of the sepals, the width and length of the petals and the species. Then, a classification algorithm has to guess the species based on the width and length of the sepals and petals. Applied on this dataset, the idea of the project was to check whether two flowers have the same length/width of the sepals and petals but being of two different species. When considering the perfect equality as the comparison function between the input attributes, the algorithm has a complexity of O(n.log(n)) (using a Map as data structure). However, when considering a tolerance in the comparison function, the naive algorithm as a O(n²) complexity, which is not acceptable for large datasets. Thus, my goal was to find ways to perform this search within an acceptable computation time. I did a literature search to solve the problem of comparing each element of the dataset to all the other elements in a reasonable computational time. Following this, I implemented a method called Blocking. The idea behind it is the following:
Mapping - First, the values are mapped into blocks of
similar values. The goal is to avoid comparisons between values that are obviously different.
Reducing - The second step is to actually compare the values, using a naive
algorithm, block by block. This step can be parallelized, as each block is independent.
I added a step to the process, to ensure to be exhaustive, taking advantage of the fact that we know the tolerance in the comparison function. This step consists in duplicating the values at the edges of the blocks inside the surrounding blocks, so that every value is compared to all of its neighbors, regardless of its absolute position in the dataset. The resulting algorithm returns pairs of contradictory values, i.e. values with similar input attributes while the target attributes are different. This can be inputed into visualisations of problematic/conclicting areas, i.e. areas of the dataset that contain a lot of contradictory values. This kind of visualisation can be useful for experts of the data source in order to explain why contradictory values appeared here or to question the manner of collecting data. Finally, the contradictory pairs can be post-treated to search for the minimum set of values to remove from the dataset in order to have no contradictary pairs. This search is similar to the search for a minimum vertex cover, having as vertex the values and as edges the fact of being contradictory. Indeed, if a single value is contradictory to 10 others, the conclusion can easily be taken to remove this one, but with more complexe networks, the decision is harder to make. The search for the minimum vertex cover is proven to be NP-hard. In the end, the developed python scripts were able to identify in about 30 minutes the pairs of contradictory values on a dataset for which the naive algorithm would have taken about 1 year of calculation. This project allowed me to learn the concepts and problems related to machine learning, as well as to deepen my skills in algorithms and parallel computing.
python
algorithm
machine learning
artificial intelligence
parallel computing
research
sphinx