|
> > Workshop 3
Workshop 3 - Wednesday, June 29 from 9:00 to 12:30
Tools and algorithms for Big Data applications
Tackling challenges in Big Data analytics: a distributed random forest algorithm
Résumé : Machine learning is today a well-known and heavily used technique for turning data into value and automating decision making. Among machine learning algorithms, random forests raise more and more interest due to their efficiency and easiness to interpret. However, training random forests models on large datasets still represents a technical challenge. We here present a random forest implementation that allows to tackle this challenge and to run machine learning analysis on Big Data problems.
Random forests are built as a combination of many decision trees, generally several hundreds of them. A frequently used approach for speeding up the training of random forests consists in performing the training of many of the underlying decision trees in parallel. Despite being efficient on traditional datasets, this method cannot be used in a Big Data context since it requires to load and replicate the dataset of interest multiple times. A better approach is to develop a decision tree algorithm that is able to handle large datasets. Any random forest model built upon such decision trees thus inherits the ability to handle Big Data problems.
In order to take advantage of HPC clusters and to operate on distributed datasets, the decision tree algorithm we propose is based on SPMD parallelism (Single Program Multiple Data) and on MATLAB's MPI API (Message Passing Interface). We will present the results we obtained with this approach in terms of performance and supported data size.
|
Biographie : Marc Wolff is an Application Engineer at MathWorks with a specialization in parallel computing and Big Data. After graduating in Scientific Computing at the University of Strasbourg, Marc did a Ph.D. in Applied Mathematics at the CEA. During his Ph.D., he contributed to the development of simulation codes that ran on very large computing infrastructures.. |
- TERATEC Forum is strictly reserved for professionals.
- Participation to exhibition, conferences and workshops is free (subject to seats available)
- On line registration is obligatory to attend exhibition, conferences or the workshops.
- The Vigipirate security plan being raised to its highest level, it is mandatory to register online in advance and come with an identity card order to participate in TERATEC Forum.
- The badge is free of charge and give you access to all events TERATEC Forum.
For any other information regarding the workshops, please contact :
Jean-Pascal JEGU
Tel : +33 (0)9 70 65 02 10
jean-pascal.jegu@teratec.fr
Campus TERATEC
2, rue de la Piquetterie
91680 BRUYERES-LE-CHATEL
France
|
|