class: center, middle, inverse, title-slide # R Avancé ## ⚔
de la donnée rangée à sa visualisation ### Rémi Dumas ### Insee - Établissement de Bordeaux ### 2019/12/12 (updated: 2020-03-06) --- class: justify, inverse # Les données ordonnées .left-column[] **Hadley Wickham** est *Chief Scientist* à **RStudio** et professeur adjoint de statistiques à l’Université d’*Auckland*. Tout au long de sa carrière, il s'est heurté à des datasets désordonnés. Ainsi, en 2014, il publie « Tidy Data », article de référence qui établit clairement ce qu’est un jeu de données « propre ». un jeu de données est propre quand : .right-column[] * **chaque variable se trouve dans une colonne** * **chaque observation compose une ligne** * **les éléments sont contenus dans le même dataset** --- class: inverse, center, middle # Par exemple --- class: left # Un tableau sur Insee.fr <table class="table table-striped table-hover" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Taux de chomage localisé des départements de Nouvelle-Aquitaine (en %)</caption> <thead> <tr> <th style="text-align:left;"> Code </th> <th style="text-align:left;"> Libellé </th> <th style="text-align:right;"> T3_2018 </th> <th style="text-align:right;"> T2_2019 </th> <th style="text-align:right;"> T3_2019 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> CHARENTE </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.1 </td> <td style="text-align:right;"> 8.2 </td> </tr> <tr> <td style="text-align:left;"> 17 </td> <td style="text-align:left;"> CHARENTE-MARITIME </td> <td style="text-align:right;"> 9.3 </td> <td style="text-align:right;"> 8.6 </td> <td style="text-align:right;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> CORREZE </td> <td style="text-align:right;"> 7.2 </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.9 </td> </tr> <tr> <td style="text-align:left;"> 23 </td> <td style="text-align:left;"> CREUSE </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.0 </td> <td style="text-align:right;"> 8.0 </td> </tr> <tr> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> DORDOGNE </td> <td style="text-align:right;"> 9.3 </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> GIRONDE </td> <td style="text-align:right;"> 8.8 </td> <td style="text-align:right;"> 8.1 </td> <td style="text-align:right;"> 8.2 </td> </tr> <tr> <td style="text-align:left;"> 40 </td> <td style="text-align:left;"> LANDES </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 8.4 </td> </tr> <tr> <td style="text-align:left;"> 47 </td> <td style="text-align:left;"> LOT-ET-GARONNE </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> 64 </td> <td style="text-align:left;"> PYRENEES-ATLANTIQUES </td> <td style="text-align:right;"> 7.6 </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:right;"> 7.2 </td> </tr> <tr> <td style="text-align:left;"> 79 </td> <td style="text-align:left;"> DEUX-SEVRES </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.2 </td> <td style="text-align:right;"> 6.3 </td> </tr> <tr> <td style="text-align:left;"> 86 </td> <td style="text-align:left;"> VIENNE </td> <td style="text-align:right;"> 7.4 </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.9 </td> </tr> <tr> <td style="text-align:left;"> 87 </td> <td style="text-align:left;"> HAUTE-VIENNE </td> <td style="text-align:right;"> 8.3 </td> <td style="text-align:right;"> 7.8 </td> <td style="text-align:right;"> 7.9 </td> </tr> </tbody> </table> .footnote[ [1] Source: Insee, chômage au sens du BIT [2] Les données du dernier trimestre sont provisoires ] --- class: left # Traiter les données des tableaux Comment calculer les évolutions trimestrielles et/ou annuelles pour un tableau, un graphique, une carte ... ```r tx_chom_75 %>% mutate( 'Evolution trimestrielle' = round(100 * (T3_2019/T2_2019 - 1) , 1) ) ``` <table class="table table-striped table-hover" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Taux de chomage localisé des départements de Nouvelle-Aquitaine (en %)</caption> <thead> <tr> <th style="text-align:left;"> Code </th> <th style="text-align:left;"> Libellé </th> <th style="text-align:right;"> T3_2018 </th> <th style="text-align:right;"> T2_2019 </th> <th style="text-align:right;"> T3_2019 </th> <th style="text-align:right;"> Evolution trimestrielle </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 16 </td> <td style="text-align:left;"> CHARENTE </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.1 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 1.2 </td> </tr> <tr> <td style="text-align:left;"> 17 </td> <td style="text-align:left;"> CHARENTE-MARITIME </td> <td style="text-align:right;"> 9.3 </td> <td style="text-align:right;"> 8.6 </td> <td style="text-align:right;"> 8.8 </td> <td style="text-align:right;"> 2.3 </td> </tr> <tr> <td style="text-align:left;"> 19 </td> <td style="text-align:left;"> CORREZE </td> <td style="text-align:right;"> 7.2 </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 3.0 </td> </tr> <tr> <td style="text-align:left;"> 23 </td> <td style="text-align:left;"> CREUSE </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.0 </td> <td style="text-align:right;"> 8.0 </td> <td style="text-align:right;"> 0.0 </td> </tr> <tr> <td style="text-align:left;"> 24 </td> <td style="text-align:left;"> DORDOGNE </td> <td style="text-align:right;"> 9.3 </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.8 </td> <td style="text-align:right;"> 1.1 </td> </tr> <tr> <td style="text-align:left;"> 33 </td> <td style="text-align:left;"> GIRONDE </td> <td style="text-align:right;"> 8.8 </td> <td style="text-align:right;"> 8.1 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 1.2 </td> </tr> <tr> <td style="text-align:left;"> 40 </td> <td style="text-align:left;"> LANDES </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.2 </td> <td style="text-align:right;"> 8.4 </td> <td style="text-align:right;"> 2.4 </td> </tr> <tr> <td style="text-align:left;"> 47 </td> <td style="text-align:left;"> LOT-ET-GARONNE </td> <td style="text-align:right;"> 9.0 </td> <td style="text-align:right;"> 8.7 </td> <td style="text-align:right;"> 8.8 </td> <td style="text-align:right;"> 1.1 </td> </tr> <tr> <td style="text-align:left;"> 64 </td> <td style="text-align:left;"> PYRENEES-ATLANTIQUES </td> <td style="text-align:right;"> 7.6 </td> <td style="text-align:right;"> 7.0 </td> <td style="text-align:right;"> 7.2 </td> <td style="text-align:right;"> 2.9 </td> </tr> <tr> <td style="text-align:left;"> 79 </td> <td style="text-align:left;"> DEUX-SEVRES </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.2 </td> <td style="text-align:right;"> 6.3 </td> <td style="text-align:right;"> 1.6 </td> </tr> <tr> <td style="text-align:left;"> 86 </td> <td style="text-align:left;"> VIENNE </td> <td style="text-align:right;"> 7.4 </td> <td style="text-align:right;"> 6.7 </td> <td style="text-align:right;"> 6.9 </td> <td style="text-align:right;"> 3.0 </td> </tr> <tr> <td style="text-align:left;"> 87 </td> <td style="text-align:left;"> HAUTE-VIENNE </td> <td style="text-align:right;"> 8.3 </td> <td style="text-align:right;"> 7.8 </td> <td style="text-align:right;"> 7.9 </td> <td style="text-align:right;"> 1.3 </td> </tr> </tbody> </table> Problème: Lorsque les colonnes changeront, il faudra réecrire la formule de calcul. --- class: inverse, center, middle # Pré-requis ###Importer et traiter les données <center>  </center> --- class: left # Charger des données ### au format R natif - format RDS (un seul objet) ```r readRDS("my_data.rds") ``` -- - format RData (plusieurs objets) ```r load("my_data.RData") ``` -- ### import de fichiers SAS - le package **haven** ```r data <- read_sas("my_data.sas7bdat") ``` -- ### import de fichiers csv - le package **readr** ```r data <- read_csv2("my_data.csv") ``` --- class: left # Charger des données(2) ### import de fichiers ods - le package **readODS** ```r data <- read_ods("my_data.ods") ``` -- ### import de fichiers xls - le package **readxl** ```r data <- read_xls("my_data.xls") ``` -- - le package **xlsx** ```r data <- read.xlsx("my_data.xlsx") ``` --- class: left # Exporter des données vers un fichier ### au format R natif - format RDS (un seul objet) ```r saveRDS(my_data, file = "my_data.rds") ``` -- - format RData (plusieurs objets) ```r save(data1, data2, "my_data.RData") save.image("my_data.RData") ## sauve tout l'environnement de travail ``` -- ### export de fichiers SAS - le package **haven** ```r write_sas(data, path = "my_data.sas7bdat") ``` --- class: left # Exporter des données vers un fichier(2) -- ### export de fichiers csv - le package **readr** ```r write_csv2(data, path = "my_data.csv") ``` -- ### export de fichiers ods - le package **readODS** ```r write_ods(data, "my_data.ods") ``` -- ### export de fichiers xls - le package **xlsx** ```r write.xlsx2(data, file = "my_data.xlsx") ``` --- class: left # Les variables ### Les types - **Numérique**: *0, -1, 3.4, 6.789456* - **Booléen**: *TRUE, FALSE* ou *T,F* - **Caractère**: *"anticonstitutionnellment"* - **Date**: *2008-12-25* ```r class(pi) ``` ``` ## [1] "numeric" ``` -- ### L'opérateur d'assignation <- ( raccourci: *Alt + 6* ) ```r a <- pi b <- "vaut environ 3,14" c <- Sys.Date() a ``` ``` ## [1] 3.141593 ``` ```r b ``` ``` ## [1] "vaut environ 3,14" ``` ```r c ``` ``` ## [1] "2020-03-06" ``` --- class: left # Les variables numériques ### Les opérations arithmétiques ** + , - , \* , / , ^ , %% , %/% ** ### Les fonctions mathématiques ** sin , cos , tan , log , sqrt** etc ... ### Les fonctions statistiques ** mean , median , quantile , sum , min , max ** etc ... ### Les opérateurs logiques - ** < , <= , > , => , == , != ** - ** | ** (OU) - ** & ** (ET) - ** isTRUE() , is.numeric() , is.logical() , is.character() , is.Date() , is.null() , is.na() ** --- class: left # Les chaines de caractères ### La concaténation ** paste() ** et ** paste0() ** ```r a <- "Insee" b <- "Mesurer pour comprendre " paste(a, "-", b) ``` ``` ## [1] "Insee - Mesurer pour comprendre " ``` ```r paste0(a, "2025") ``` ``` ## [1] "Insee2025" ``` ### Le package *stringr* .right-column[] ```r str_trim(b) ``` ``` ## [1] "Mesurer pour comprendre" ``` ```r str_sub(a,1,3) ``` ``` ## [1] "Ins" ``` ```r str_replace_all(b , "e", "_") ``` ``` ## [1] "M_sur_r pour compr_ndr_ " ``` ```r str_split(b , " ") ``` ``` ## [[1]] ## [1] "Mesurer" "pour" "comprendre" "" ``` .footnote[(http://perso.ens-lyon.fr/lise.vaudor/Descriptoire/_book/manipuler-des-strings-package-stringr.html)] --- class: left # Les vecteurs ### La fonction collecteur: *c()* ```r vec_num <- c(122,154,132,141,147,112) vec_num*1.5 ``` ``` ## [1] 183.0 231.0 198.0 211.5 220.5 168.0 ``` ```r vec_ch <- c("Pierre", "Paul", "Jean", "Jacques", "Magali") vec_ch2 <- c(vec_ch, "Nathalie") mean(vec_num) # taille moyenne des élements du vec_num ``` ``` ## [1] 134.6667 ``` ```r vec_num2 <- c(vec_num, "John") class(vec_num) ``` ``` ## [1] "numeric" ``` ```r paste(vec_ch2,"mesure",vec_num,"cm")[1:3] ``` ``` ## [1] "Pierre mesure 122 cm" "Paul mesure 154 cm" "Jean mesure 132 cm" ``` --- class: left # Les dataframes ### Créer un dataframe: la fonction *data.frame()* ```r df <- data.frame( nom = c("Lambert", "Doussain", "Niang", "Gérard", "Laporte", "Herrand"), ville = c("Bordeaux", "Marseille", "Paris", "Bordeaux", "Paris", "Paris"), conjoint1 = c(1200, 1180, 1750, 2100, 1350, 1100), conjoint2 = c(1450, 1870, 1690, NA, 2350, NA), nb_personnes = c(4, 2, 3, 2, 5, 1), stringsAsFactors = F # pour ne pas avoir de facteurs ) df ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Lambert Bordeaux 1200 1450 4 ## 2 Doussain Marseille 1180 1870 2 ## 3 Niang Paris 1750 1690 3 ## 4 Gérard Bordeaux 2100 NA 2 ## 5 Laporte Paris 1350 2350 5 ## 6 Herrand Paris 1100 NA 1 ``` --- class: left # Les dataframes(2) ### Extraire des lignes, extraire des colonnes ```r df[1,2] # ligne 1, colonne 2 ``` ``` ## [1] "Bordeaux" ``` ```r df[3,] # ligne 3 , toutes les colonnes ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 3 Niang Paris 1750 1690 3 ``` ```r df[,1] # toutes les lignes, 1ère colonne ``` ``` ## [1] "Lambert" "Doussain" "Niang" "Gérard" "Laporte" "Herrand" ``` ```r df$nom #identique au précédent ``` ``` ## [1] "Lambert" "Doussain" "Niang" "Gérard" "Laporte" "Herrand" ``` ### Le pipe **%>%** (raccourci: Ctrl + Alt + M) Enchaine les opérations de la gauche vers la droite ```r df$conjoint1 %>% mean ``` ``` ## [1] 1446.667 ``` ### Les modalités d'une variable: la fonction *unique* ```r df$ville %>% unique ``` ``` ## [1] "Bordeaux" "Marseille" "Paris" ``` --- class: left # Traiter les données avec *dplyr*  .footnote[(https://audhalbritter.com/my-top-8-dplyr-functions/)] Les verbes de dplyr: **summarize**, **slice**, **rename**, **select**, **arrange**, **mutate**, **group_by** --- class: left # *dplyr* : statistiques avec **summarise** ```r df %>% summarise( 'Nombre de ménages' = n(), 'Revenu moyen du 1er conjoint' = mean(conjoint1) ) ``` ``` ## Nombre de ménages Revenu moyen du 1er conjoint ## 1 6 1446.667 ``` --- class: left # *dplyr* : découper les données avec **slice** et **sample_n** ```r df %>% slice(1:3) # découpe les enregistrements de 1 à 3 ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Lambert Bordeaux 1200 1450 4 ## 2 Doussain Marseille 1180 1870 2 ## 3 Niang Paris 1750 1690 3 ``` ```r df %>% sample_n(2) # découpe 2 enregistrements au hasard ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Herrand Paris 1100 NA 1 ## 2 Lambert Bordeaux 1200 1450 4 ``` ```r df %>% sample_frac(0.25) # découpe 25% des données ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Niang Paris 1750 1690 3 ## 2 Doussain Marseille 1180 1870 2 ``` --- class: left # *dplyr* : renommer des variables **rename** - **rename** renomme des variables ```r df %>% rename(origine = ville) # renomme la variable ville en origine ``` ``` ## nom origine conjoint1 conjoint2 nb_personnes ## 1 Lambert Bordeaux 1200 1450 4 ## 2 Doussain Marseille 1180 1870 2 ## 3 Niang Paris 1750 1690 3 ## 4 Gérard Bordeaux 2100 NA 2 ## 5 Laporte Paris 1350 2350 5 ## 6 Herrand Paris 1100 NA 1 ``` --- class: left # *dplyr* : selectionner des variables **select** - **select** selectionne et/ou réorganise et/ou renomme les variables ```r df %>% select(c(3,5)) # les 3ème et 5ème variables ``` ``` ## conjoint1 nb_personnes ## 1 1200 4 ## 2 1180 2 ## 3 1750 3 ## 4 2100 2 ## 5 1350 5 ## 6 1100 1 ``` ```r df %>% select(origine = ville) # la variable ville sera renommée ``` ``` ## origine ## 1 Bordeaux ## 2 Marseille ## 3 Paris ## 4 Bordeaux ## 5 Paris ## 6 Paris ``` --- class: left # *dplyr* : selectionner des variables **select** (2) - **select** selectionne et/ou réorganise et/ou renomme les variables ```r df %>% select(starts_with("conjoint")) # commencent par "conjoint" ``` ``` ## conjoint1 conjoint2 ## 1 1200 1450 ## 2 1180 1870 ## 3 1750 1690 ## 4 2100 NA ## 5 1350 2350 ## 6 1100 NA ``` ```r df %>% select(origine = ville, nom, nb_personnes, everything()) ``` ``` ## origine nom nb_personnes conjoint1 conjoint2 ## 1 Bordeaux Lambert 4 1200 1450 ## 2 Marseille Doussain 2 1180 1870 ## 3 Paris Niang 3 1750 1690 ## 4 Bordeaux Gérard 2 2100 NA ## 5 Paris Laporte 5 1350 2350 ## 6 Paris Herrand 1 1100 NA ``` --- class: left # *dplyr* : trier les données avec **arrange** ```r df %>% arrange(conjoint1) # tri selon conjoint1 ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Herrand Paris 1100 NA 1 ## 2 Doussain Marseille 1180 1870 2 ## 3 Lambert Bordeaux 1200 1450 4 ## 4 Laporte Paris 1350 2350 5 ## 5 Niang Paris 1750 1690 3 ## 6 Gérard Bordeaux 2100 NA 2 ``` ```r df %>% arrange(desc(nb_personnes)) # tri décroissant ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes ## 1 Laporte Paris 1350 2350 5 ## 2 Lambert Bordeaux 1200 1450 4 ## 3 Niang Paris 1750 1690 3 ## 4 Doussain Marseille 1180 1870 2 ## 5 Gérard Bordeaux 2100 NA 2 ## 6 Herrand Paris 1100 NA 1 ``` --- class: left # *dplyr* : créer des colonnes avec **mutate** ```r df %>% mutate(revenu = conjoint1 + conjoint2) # opération ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes revenu ## 1 Lambert Bordeaux 1200 1450 4 2650 ## 2 Doussain Marseille 1180 1870 2 3050 ## 3 Niang Paris 1750 1690 3 3440 ## 4 Gérard Bordeaux 2100 NA 2 NA ## 5 Laporte Paris 1350 2350 5 3700 ## 6 Herrand Paris 1100 NA 1 NA ``` ```r df %>% mutate( conjoint2 = if_else(is.na(conjoint2), 0, conjoint2), revenu_par_personne = (conjoint1 + conjoint2) / nb_personnes ) ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes revenu_par_personne ## 1 Lambert Bordeaux 1200 1450 4 662.500 ## 2 Doussain Marseille 1180 1870 2 1525.000 ## 3 Niang Paris 1750 1690 3 1146.667 ## 4 Gérard Bordeaux 2100 0 2 1050.000 ## 5 Laporte Paris 1350 2350 5 740.000 ## 6 Herrand Paris 1100 0 1 1100.000 ``` --- class: left # *dplyr* : calculs en ligne avec *rowwise()* ```r df %>% mutate( revenu_par_personne = sum(conjoint1, conjoint2, na.rm = T) / nb_personnes ) # calcule la somme de la colonne ``` ``` ## nom ville conjoint1 conjoint2 nb_personnes revenu_par_personne ## 1 Lambert Bordeaux 1200 1450 4 4010.000 ## 2 Doussain Marseille 1180 1870 2 8020.000 ## 3 Niang Paris 1750 1690 3 5346.667 ## 4 Gérard Bordeaux 2100 NA 2 8020.000 ## 5 Laporte Paris 1350 2350 5 3208.000 ## 6 Herrand Paris 1100 NA 1 16040.000 ``` ```r df %>% rowwise() %>% mutate( revenu_par_personne = sum(conjoint1, conjoint2, na.rm = T) / nb_personnes ) # la somme ne se fait que pour la ligne courante ``` ``` ## Source: local data frame [6 x 6] ## Groups: <by row> ## ## # A tibble: 6 x 6 ## nom ville conjoint1 conjoint2 nb_personnes revenu_par_personne ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Lambert Bordeaux 1200 1450 4 662. ## 2 Doussain Marseille 1180 1870 2 1525 ## 3 Niang Paris 1750 1690 3 1147. ## 4 Gérard Bordeaux 2100 NA 2 1050 ## 5 Laporte Paris 1350 2350 5 740 ## 6 Herrand Paris 1100 NA 1 1100 ``` --- class: left # *dplyr* : statistiques par modalités avec *group_by* ```r df %>% rowwise() %>% mutate(total = sum(conjoint1, conjoint2, na.rm = T)) %>% ungroup %>% # pour "oublier" le comptage par ligne group_by(ville) %>% summarise(`Nombre de ménages` = n(), `Revenu moyen` = mean(total), `Nombre de personnes` = sum(nb_personnes), `Taille moyenne` = `Nombre de personnes`/`Nombre de ménages` ) ``` ``` ## # A tibble: 3 x 5 ## ville `Nombre de ménages` `Revenu moyen` `Nombre de personnes` `Taille moyenne` ## <chr> <int> <dbl> <dbl> <dbl> ## 1 Bordeaux 2 2375 6 3 ## 2 Marseille 1 3050 2 2 ## 3 Paris 3 2747. 9 3 ``` --- class: left # *dplyr* : Selection de lignes .left-column[] ```r a <- df %>% select(nom,ville) b <- data.frame( nom = c("Gérard", "N'Guyen", "Lambert", "Dupont", "Niang"), ville = c("Bordeaux", "Marseille", "Paris", "Bordeaux", "Paris"), stringsAsFactors = F ) ``` .right-column[ ### union() ```r union(a,b) ``` ``` ## nom ville ## 1 Lambert Bordeaux ## 2 Doussain Marseille ## 3 Niang Paris ## 4 Gérard Bordeaux ## 5 Laporte Paris ## 6 Herrand Paris ## 7 N'Guyen Marseille ## 8 Lambert Paris ## 9 Dupont Bordeaux ``` ] ### intersect() ```r intersect(a,b) ``` ``` ## nom ville ## 1 Niang Paris ## 2 Gérard Bordeaux ``` ### setdiff() ```r setdiff(a,b) ``` ``` ## nom ville ## 1 Lambert Bordeaux ## 2 Doussain Marseille ## 3 Laporte Paris ## 4 Herrand Paris ``` --- class: left # *dplyr* : Concaténation .left-column[] ```r c <- data.frame( age = c(25, 45, 51, 18, 33, 40) ) ``` </br></br> .right-column[ ### bind_cols() ```r bind_cols(a,c) ``` ``` ## nom ville age ## 1 Lambert Bordeaux 25 ## 2 Doussain Marseille 45 ## 3 Niang Paris 51 ## 4 Gérard Bordeaux 18 ## 5 Laporte Paris 33 ## 6 Herrand Paris 40 ``` ] ### bind_rows() ```r bind_rows(a,b) ``` ``` ## nom ville ## 1 Lambert Bordeaux ## 2 Doussain Marseille ## 3 Niang Paris ## 4 Gérard Bordeaux ## 5 Laporte Paris ## 6 Herrand Paris ## 7 Gérard Bordeaux ## 8 N'Guyen Marseille ## 9 Lambert Paris ## 10 Dupont Bordeaux ## 11 Niang Paris ``` --- class: left # *dplyr* : Jointures .right-column[] ### left_join() Enrichit la table de gauche avec de nouvelles colonnes ### right_join() Enrichit la table de droite avec de nouvelles colonnes ### inner_join() Joint toutes les colonnes des lignes communes aux deux tables ### full_join() Joint toutes les colonnes ### semi_join() Les lignes de la table de gauche présentes dans celle de droite ### anti_join() Les lignes de la table de gauche qui ne sont pas présentes dans celle de droite --- class: inverse, center, middle # Ranger un jeu de données ### [taɪdi]   --- class: left # Rendre les données exploitables .left-column[] Les extensions du **tidyverse** comme *dplyr* ou *ggplot2* utilisent des données “rangées”, appelées *tidy data*. Prenons un exemple avec les données suivantes, qui indique la population de trois pays pour quatre années différentes : ### Les données non brutes <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> 1992 </th> <th style="text-align:right;"> 1997 </th> <th style="text-align:right;"> 2002 </th> <th style="text-align:right;"> 2007 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Belgium </td> <td style="text-align:right;"> 10045622 </td> <td style="text-align:right;"> 10199787 </td> <td style="text-align:right;"> 10311970 </td> <td style="text-align:right;"> 10392226 </td> </tr> <tr> <td style="text-align:left;"> France </td> <td style="text-align:right;"> 57374179 </td> <td style="text-align:right;"> 58623428 </td> <td style="text-align:right;"> 59925035 </td> <td style="text-align:right;"> 61083916 </td> </tr> <tr> <td style="text-align:left;"> Germany </td> <td style="text-align:right;"> 80597764 </td> <td style="text-align:right;"> 82011073 </td> <td style="text-align:right;"> 82350671 </td> <td style="text-align:right;"> 82400996 </td> </tr> </tbody> </table> Les années sont en colonne, ce qui signifie que chaque année est un "indicateur" (une variable). Or, pour représenter une variable, il faut qu'elle soit relative à des "modalités", chaque pays représentant une "série" (une courbe différente). --- class: left # Utilisation des deux types de tableaux .right-column[] Les deux exemples de données brutes ci-contre démontrent l'utilité d'avoir un jeu "rangé". Ce qui ne convient pas: - les colonnes du premier jeu comprennent deux informations: l'épice utilisé et l'ordre d'utilisation - les deux tableaux correspondent aux mêmes données et sont transposables - il sera plus difficile de faire des comptages, des modélisations ou des représentations graphiques avec un tableau de contingence qu'avec un tableau de données ordonnées et rangées. --- class: left # Rassembler les colonnes: *gather* ### Syntaxe .right-column[]Le verbe **gather** va transformer les colonnes qui contiennent ces valeurs en lignes selon le schéma vu à la diapo précédente. La syntaxe est la suivante: </br> ```r gather(data, key = variable, value = valeur, -id) ``` ou avec le pipe %>%: ```r data %>% gather(key = variable, value = valeur, -id) ``` - *data* est le jeu de données à ordonner - *variables* sera le nom de la colonne qui rassemblera les variables - *valeurs* sera le nom de la colonne qui rassemblera les valeurs > Attention: Le verbe *gather* est remplacé par *pivot_longer*. La syntaxe est `pivot_longer(data, liste_de_colonnes, names_to , values_to)` --- class: left # Exemple d'utilisation de *gather* Le tableau suivant contient les valeurs des populations et des espérances de vie pour 2002 et 2007. En voici les 10 premières lignes: <table class="table" style="font-size: 12px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:right;"> pop_2002 </th> <th style="text-align:right;"> pop_2007 </th> <th style="text-align:right;"> lifeExp_2002 </th> <th style="text-align:right;"> lifeExp_2007 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 25268405 </td> <td style="text-align:right;"> 31889923 </td> <td style="text-align:right;"> 42.129 </td> <td style="text-align:right;"> 43.828 </td> </tr> <tr> <td style="text-align:left;"> Albania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 3508512 </td> <td style="text-align:right;"> 3600523 </td> <td style="text-align:right;"> 75.651 </td> <td style="text-align:right;"> 76.423 </td> </tr> <tr> <td style="text-align:left;"> Algeria </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 31287142 </td> <td style="text-align:right;"> 33333216 </td> <td style="text-align:right;"> 70.994 </td> <td style="text-align:right;"> 72.301 </td> </tr> <tr> <td style="text-align:left;"> Angola </td> <td style="text-align:left;"> Africa </td> <td style="text-align:right;"> 10866106 </td> <td style="text-align:right;"> 12420476 </td> <td style="text-align:right;"> 41.003 </td> <td style="text-align:right;"> 42.731 </td> </tr> <tr> <td style="text-align:left;"> Argentina </td> <td style="text-align:left;"> Americas </td> <td style="text-align:right;"> 38331121 </td> <td style="text-align:right;"> 40301927 </td> <td style="text-align:right;"> 74.340 </td> <td style="text-align:right;"> 75.320 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> Oceania </td> <td style="text-align:right;"> 19546792 </td> <td style="text-align:right;"> 20434176 </td> <td style="text-align:right;"> 80.370 </td> <td style="text-align:right;"> 81.235 </td> </tr> <tr> <td style="text-align:left;"> Austria </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 8148312 </td> <td style="text-align:right;"> 8199783 </td> <td style="text-align:right;"> 78.980 </td> <td style="text-align:right;"> 79.829 </td> </tr> <tr> <td style="text-align:left;"> Bahrain </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 656397 </td> <td style="text-align:right;"> 708573 </td> <td style="text-align:right;"> 74.795 </td> <td style="text-align:right;"> 75.635 </td> </tr> <tr> <td style="text-align:left;"> Bangladesh </td> <td style="text-align:left;"> Asia </td> <td style="text-align:right;"> 135656790 </td> <td style="text-align:right;"> 150448339 </td> <td style="text-align:right;"> 62.013 </td> <td style="text-align:right;"> 64.062 </td> </tr> <tr> <td style="text-align:left;"> Belgium </td> <td style="text-align:left;"> Europe </td> <td style="text-align:right;"> 10311970 </td> <td style="text-align:right;"> 10392226 </td> <td style="text-align:right;"> 78.320 </td> <td style="text-align:right;"> 79.441 </td> </tr> </tbody> </table> Nous allons rassembler les colonnes pop_2002, pop_2007, lifeExp_2002, lifeExp_2007: ```r gather(gapmind_messy, key = variables, value = valeur, -c(country, continent)) ``` ``` ## # A tibble: 6 x 4 ## country continent variables valeur ## <chr> <chr> <chr> <dbl> ## 1 Namibia Africa lifeExp_2002 51.5 ## 2 United States Americas lifeExp_2007 78.2 ## 3 Ghana Africa pop_2007 22873338 ## 4 Reunion Africa lifeExp_2007 76.4 ## 5 Mozambique Africa lifeExp_2007 42.1 ## 6 Central African Republic Africa lifeExp_2002 43.3 ``` --- class: left # Conséquence d'un *gather* sur les dimensions du jeu de données ### Le jeu de données avant *gather* ``` ## Classes 'tbl_df', 'tbl' and 'data.frame': 142 obs. of 6 variables: ## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ... ## $ continent : chr "Asia" "Europe" "Africa" "Africa" ... ## $ pop_2002 : int 25268405 3508512 31287142 10866106 38331121 19546792 8148312 656397 135656790 10311970 ... ## $ pop_2007 : int 31889923 3600523 33333216 12420476 40301927 20434176 8199783 708573 150448339 10392226 ... ## $ lifeExp_2002: num 42.1 75.7 71 41 74.3 ... ## $ lifeExp_2007: num 43.8 76.4 72.3 42.7 75.3 ... ``` ### Le jeu de données après *gather* ``` ## Classes 'tbl_df', 'tbl' and 'data.frame': 568 obs. of 4 variables: ## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ... ## $ continent: chr "Asia" "Europe" "Africa" "Africa" ... ## $ variables: chr "pop_2002" "pop_2002" "pop_2002" "pop_2002" ... ## $ valeur : num 25268405 3508512 31287142 10866106 38331121 ... ``` On remarque que le nombre de lignes est passé de 142 à 568. En effet, 4 colonnes ont été rassemblées, soit une multiplication par quatre du nombre de lignes. --- class: left # Séparer une colonne: *separate* ### Syntaxe .right-column[]Le verbe **separate** va retrouver les informations multiples contenues dans une colonne et les répartir dans des colonnes différentes. La syntaxe est la suivante: </br></br> ```r separate(data, col = colonne, into = c("col1", "col2"), se = "chaine") ``` - *data* est le jeu de données - *colonne* est le nom de la colonne qui contient les informations - *into* est le nom des colonnes qui contiendront les informations separées - *separate* est la chaine de caractère qui sépare les informations dans *colonne* --- class: left # Exemple d'utilisation de *separate* ### Exemple Dans le tableau précédent, la colonne *variables* contient deux informations: la variable et l'année. On va séparer cette colonne en deux, une colonne *variable* et une colonne *annee*. ```r separate(gapmind, col = variables, into = c("variable", "annee"), sep = "_") ``` <table class="table table-striped table-hover table-condensed" style="font-size: 10px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> continent </th> <th style="text-align:left;"> variable </th> <th style="text-align:left;"> annee </th> <th style="text-align:right;"> valeur </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Hungary </td> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> pop </td> <td style="text-align:left;"> 2007 </td> <td style="text-align:right;"> 9956108.000 </td> </tr> <tr> <td style="text-align:left;"> Liberia </td> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:right;"> 43.753 </td> </tr> <tr> <td style="text-align:left;"> Cambodia </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> pop </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:right;"> 12926707.000 </td> </tr> <tr> <td style="text-align:left;"> Portugal </td> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2007 </td> <td style="text-align:right;"> 78.098 </td> </tr> <tr> <td style="text-align:left;"> Afghanistan </td> <td style="text-align:left;"> Asia </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:right;"> 42.129 </td> </tr> <tr> <td style="text-align:left;"> Botswana </td> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:right;"> 46.634 </td> </tr> <tr> <td style="text-align:left;"> Romania </td> <td style="text-align:left;"> Europe </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2007 </td> <td style="text-align:right;"> 72.476 </td> </tr> <tr> <td style="text-align:left;"> Tanzania </td> <td style="text-align:left;"> Africa </td> <td style="text-align:left;"> lifeExp </td> <td style="text-align:left;"> 2007 </td> <td style="text-align:right;"> 52.517 </td> </tr> </tbody> </table> --- class: left # Le verbe **extract** **extract** permet de créer de nouvelles colonnes à partir de sous-chaînes d’une colonne de texte existante, identifiées par des groupes dans une expression régulière. ```r df <- data.frame( eleve = c("Félicien Machin", "Raymonde Bidule", "Martial Truc"), note = c("5/20", "6/10", "87/100"), stringsAsFactors = FALSE) df %>% extract(eleve, c("initiale_prenom", "initiale_nom"), "^(.).* (.).*$", remove = FALSE) ``` ``` ## eleve initiale_prenom initiale_nom note ## 1 Félicien Machin F M 5/20 ## 2 Raymonde Bidule R B 6/10 ## 3 Martial Truc M T 87/100 ``` --- class: left # Regrouper plusieurs colonnes: *unite* ### Syntaxe ```r unite(data, col_a_creer, col_a_regrouper, separateur) ``` ### Exemple Soit le tableau suivant: ``` ## dep codcom libcom ## 1 01 001 L'Abergement-Clémenciat ## 2 01 002 L'Abergement-de-Varey ## 3 01 004 Ambérieu-en-Bugey ## 4 01 005 Ambérieux-en-Dombes ## 5 01 006 Ambléon ## 6 01 007 Ambronay ``` Regroupons les colonnes *dep* et *codcom* en une seule colonne, *codgeo*: ```r listecom %>% unite(codgeo, dep, codcom, sep="") %>% head(6) ``` ``` ## codgeo libcom ## 1 01001 L'Abergement-Clémenciat ## 2 01002 L'Abergement-de-Varey ## 3 01004 Ambérieu-en-Bugey ## 4 01005 Ambérieux-en-Dombes ## 5 01006 Ambléon ## 6 01007 Ambronay ``` --- class: left # Completer un jeu de données: *complete* ### Syntaxe ```r complete(data, col_a_combiner, fill = liste_valeurs) ``` ### Exemple Quatre personnes ont visité des villes d'Italie et leur ont attribué une note. <table class="table table-striped table-hover table-condensed" style="font-size: 8px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Nom </th> <th style="text-align:left;"> Ville </th> <th style="text-align:left;"> Pays </th> <th style="text-align:right;"> Note </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Vanessa </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Anaïs </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Anaïs </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Vanessa </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 5 </td> </tr> </tbody> </table> Il manque deux combinaisons Nom, Ville. Le verbe **complete** va créer les enregistrements avec les valeurs *Italie* et *NA* pour les champs Pays et Note. ```r complete(data, Nom, Ville, fill = list(Pays = "Italie", Note = NA)) ``` <table class="table table-striped table-hover table-condensed" style="font-size: 8px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Nom </th> <th style="text-align:left;"> Ville </th> <th style="text-align:left;"> Pays </th> <th style="text-align:right;"> Note </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Anaïs </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> Anaïs </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Anaïs </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Louis </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Pierre </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Vanessa </td> <td style="text-align:left;"> Naples </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Vanessa </td> <td style="text-align:left;"> Rome </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> NA </td> </tr> <tr> <td style="text-align:left;"> Vanessa </td> <td style="text-align:left;"> Venise </td> <td style="text-align:left;"> Italie </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> --- class: left # Réaliser un tableau: *spread* .right-column[] Le verbe **spread** fait le contraire de **gather** à savoir répartir les modalités d'une variable et leurs valeurs comme nouvelles colonnes du tableau. ### Syntaxe ```r spread(data, key = variables , value = valeurs, fill = liste_valeurs) ``` ### Exemple Reprenons notre exemple avant l'application du verbe **unite**. ```r data %>% spread(key = Ville, value = Note, fill = "nc") ``` ``` ## Nom Pays Naples Rome Venise ## 1 Anaïs Italie nc 2 5 ## 2 Louis Italie 5 1 1 ## 3 Pierre Italie 3 4 4 ## 4 Vanessa Italie 2 nc 3 ``` --- class: left # Exemple récapitulatif 1/ .right-column[]On télecharge le fichier de la série des taux de chômage par département sur Insee.fr<sup>1</sup>. On souhaite réaliser le tableau ci-contre, présent sur le tableau de bord de la conjoncture de Nouvelle-Aquitaine<sup>2</sup>. .footnote[ [1] https://www.insee.fr/fr/statistiques/series/102760732 [2] https://www.insee.fr/fr/statistiques/2121832#alpc_0104 ] Les données, sont dans un tableau dont les deux premières colonnes sont le code et le libellé du département et les suivantes correspondent chacune à un trimestre. ```r tx_chom <- readxl::read_xls("data/tx_chom.xls") tx_chom[1:4,1:10] ``` ``` ## # A tibble: 4 x 10 ## Code Libellé T1_1982 T2_1982 T3_1982 T4_1982 T1_1983 T2_1983 T3_1983 T4_1983 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 01 AIN 3.6 3.7 3.8 4 4 4 4.2 4.4 ## 2 02 AISNE 8.1 8.3 8.4 8.3 8.4 8.4 8.7 9.1 ## 3 03 ALLIER 7 7.2 7.3 7.4 7.4 7.4 7.5 8 ## 4 04 ALPES-DE-HAUTE-PROVENCE 5.2 5.4 5.5 5.8 5.9 5.9 6 6.4 ``` --- class:left # Exemple récapitulatif 2/ On récupère le trimestre le plus récent, et on construit la liste des trimestres qui devront apparaitre dans notre tableau final. ```r trim <- as.numeric(str_sub(last(names(tx_chom)),2,2)) ann <- as.numeric(str_sub(last(names(tx_chom)),4,7)) trimestres <- c( paste0("T", trim, "_", ann - 1), paste0("T", 1 + ( trim + 2) %% 4, "_", ann), paste0("T", trim, "_", ann) ) trimestres ``` ``` ## [1] "T3_2018" "T2_2019" "T3_2019" ``` On charge un table de zonages pour enrichir le tableau avec la région. ```r load("data/zonages.Rda") tx_chom <- tx_chom %>% inner_join(dc_zonages %>% distinct(dep, reg) , by = c("Code" = "dep")) %>% inner_join(lreg, by = c("reg")) %>% select(dep = Code, ldep = Libellé, Région = lreg, everything(), -reg) ``` ``` ## # A tibble: 3 x 6 ## dep ldep Région T1_1982 T2_1982 T3_1982 ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 01 AIN Auvergne-Rhône-Alpes 3.6 3.7 3.8 ## 2 02 AISNE Hauts-de-France 8.1 8.3 8.4 ## 3 03 ALLIER Auvergne-Rhône-Alpes 7 7.2 7.3 ``` --- class: inverse, center, middle # Les tableaux ###Réalisation et mise en forme  --- class: inverse, center, middle # Les graphiques ###Création, mise en forme et export  --- # Prout Install the **xaringan** package from [Github](https://github.com/yihui/xaringan): ```r devtools::install_github("yihui/xaringan") ``` -- You are recommended to use the [RStudio IDE](https://www.rstudio.com/products/rstudio/), but you do not have to. - Create a new R Markdown document from the menu `File -> New File -> R Markdown -> From Template -> Ninja Presentation`;<sup>1</sup> -- - Click the `Knit` button to compile it; -- - or use the [RStudio Addin](https://rstudio.github.io/rstudioaddins/)<sup>2</sup> "Infinite Moon Reader" to live preview the slides (every time you update and save the Rmd document, the slides will be automatically reloaded in RStudio Viewer. .footnote[ [1] 中文用户请看[这份教程](http://slides.yihui.name/xaringan/zh-CN.html) [2] See [#2](https://github.com/yihui/xaringan/issues/2) if you do not see the template or addin in RStudio. ] --- background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg) background-position: 50% 50% class: center, bottom, inverse # You only live once! --- # Hello Ninja As a presentation à la con ninja, you certainly should not be satisfied by the "Hello World" example. You need to understand more about two things: 1. The [remark.js](https://remarkjs.com) library; 1. The **xaringan** package; Basically **xaringan** *injected* the chakra of R Markdown (minus Pandoc) into **remark.js**. The slides are rendered by remark.js in the web browser, and the Markdown source needed by remark.js is generated from R Markdown (**knitr**). --- # remark.js You can see an introduction of remark.js from [its homepage](https://remarkjs.com). You should read the [remark.js Wiki](https://github.com/gnab/remark/wiki) at least once to know how to - create a new slide (Markdown syntax<sup>*</sup> and slide properties); - format a slide (e.g. text alignment); - configure the slideshow; - and use the presentation (keyboard shortcuts). It is important to be familiar with remark.js before you can understand the options in **xaringan**. .footnote[[*] It is different with Pandoc's Markdown! It is limited but should be enough for presentation purposes. Come on... You do not need a slide for the Table of Contents! Well, the Markdown support in remark.js [may be improved](https://github.com/gnab/remark/issues/142) in the future.] --- background-image: url(https://github.com/yihui/xaringan/releases/download/v0.0.2/karl-moustache.jpg) background-size: cover class: center, bottom, inverse # I was so happy to have discovered remark.js! --- class: inverse, middle, center # Using xaringan --- # xaringan Provides an R Markdown output format `xaringan::moon_reader` as a wrapper for remark.js, and you can use it in the YAML metadata, e.g. ```yaml --- title: "A Cool Presentation" output: xaringan::moon_reader: yolo: true nature: autoplay: 30000 --- ``` See the help page `?xaringan::moon_reader` for all possible options that you can use. --- # remark.js vs xaringan Some differences between using remark.js (left) and using **xaringan** (right): .pull-left[ 1. Start with a boilerplate HTML file; 1. Plain Markdown; 1. Write JavaScript to autoplay slides; 1. Manually configure MathJax; 1. Highlight code with `*`; 1. Edit Markdown source and refresh browser to see updated slides; ] .pull-right[ 1. Start with an R Markdown document; 1. R Markdown (can embed R/other code chunks); 1. Provide an option `autoplay`; 1. MathJax just works;<sup>*</sup> 1. Highlight code with `{{john}}`; 1. The RStudio addin "Infinite Moon Reader" automatically refreshes slides on changes; ] .footnote[[*] Not really. See next page.] --- # Math Expressions You can write LaTeX math expressions inside a pair of dollar signs, e.g. $\alpha+\beta$ renders `\(\alpha+\beta\)`. You can use the display style with double dollar signs: ``` $$\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i$$ ``` `$$\bar{X}=\frac{1}{n}\sum_{i=1}^nX_i$$` Limitations: 1. The source code of a LaTeX math expression must be in one line, unless it is inside a pair of double dollar signs, in which case the starting `$$` must appear in the very beginning of a line, followed immediately by a non-space character, and the ending `$$` must be at the end of a line, led by a non-space character; 1. There should not be spaces after the opening `$` or before the closing `$`. 1. Math does not work on the title slide (see [#61](https://github.com/yihui/xaringan/issues/61) for a workaround). --- # R Code ```r # a boring regression fit = lm(dist ~ 1 + speed, data = cars) coef(summary(fit)) ``` ``` # Estimate Std. Error t value Pr(>|t|) # (Intercept) -17.579095 6.7584402 -2.601058 1.231882e-02 # speed 3.932409 0.4155128 9.463990 1.489836e-12 ``` ```r dojutsu = c('地爆天星', '天照', '加具土命', '神威', '須佐能乎', '無限月読') grep('天', dojutsu, value = TRUE) ``` ``` # character(0) ``` --- # R Plots ```r par(mar = c(4, 4, 1, .1)) plot(cars, pch = 19, col = 'darkgray', las = 1) abline(fit, lwd = 2) ``` <!-- --> --- # Tables If you want to generate a table, make sure it is in the HTML format (instead of Markdown or other formats), e.g., ```r knitr::kable(head(iris), format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> Sepal.Length </th> <th style="text-align:right;"> Sepal.Width </th> <th style="text-align:right;"> Petal.Length </th> <th style="text-align:right;"> Petal.Width </th> <th style="text-align:left;"> Species </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 5.1 </td> <td style="text-align:right;"> 3.5 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:left;"> setosa </td> </tr> <tr> <td style="text-align:right;"> 4.9 </td> <td style="text-align:right;"> 3.0 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:left;"> setosa </td> </tr> <tr> <td style="text-align:right;"> 4.7 </td> <td style="text-align:right;"> 3.2 </td> <td style="text-align:right;"> 1.3 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:left;"> setosa </td> </tr> <tr> <td style="text-align:right;"> 4.6 </td> <td style="text-align:right;"> 3.1 </td> <td style="text-align:right;"> 1.5 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:left;"> setosa </td> </tr> <tr> <td style="text-align:right;"> 5.0 </td> <td style="text-align:right;"> 3.6 </td> <td style="text-align:right;"> 1.4 </td> <td style="text-align:right;"> 0.2 </td> <td style="text-align:left;"> setosa </td> </tr> <tr> <td style="text-align:right;"> 5.4 </td> <td style="text-align:right;"> 3.9 </td> <td style="text-align:right;"> 1.7 </td> <td style="text-align:right;"> 0.4 </td> <td style="text-align:left;"> setosa </td> </tr> </tbody> </table> --- # HTML Widgets I have not thoroughly tested HTML widgets against **xaringan**. Some may work well, and some may not. It is a little tricky. Similarly, the Shiny mode (`runtime: shiny`) does not work. I might get these issues fixed in the future, but these are not of high priority to me. I never turn my presentation into a Shiny app. When I need to demonstrate more complicated examples, I just launch them separately. It is convenient to share slides with other people when they are plain HTML/JS applications. See the next page for two HTML widgets. --- ```r library(leaflet) leaflet() %>% addTiles() %>% setView(-93.65, 42.0285, zoom = 17) ```
--- ```r DT::datatable( head(iris, 10), fillContainer = FALSE, options = list(pageLength = 8) ) ```
--- # Some Tips - When you use the "Infinite Moon Reader" addin in RStudio, your R session will be blocked by default. You can click the red button on the right of the console to stop serving the slides, or use the _daemonized_ mode so that it does not block your R session. To do the latter, you can set the option ```r options(servr.daemon = TRUE) ``` in your current R session, or in `~/.Rprofile` so that it is applied to all future R sessions. I do the latter by myself. To know more about the web server, see the [**servr**](https://github.com/yihui/servr) package. -- - Do not forget to try the `yolo` option of `xaringan::moon_reader`. ```yaml output: xaringan::moon_reader: yolo: true ``` --- # Some Tips - Slides can be automatically played if you set the `autoplay` option under `nature`, e.g. go to the next slide every 30 seconds in a lightning talk: ```yaml output: xaringan::moon_reader: nature: autoplay: 30000 ``` -- - A countdown timer can be added to every page of the slides using the `countdown` option under `nature`, e.g. if you want to spend one minute on every page when you give the talk, you can set: ```yaml output: xaringan::moon_reader: nature: countdown: 60000 ``` Then you will see a timer counting down from `01:00`, to `00:59`, `00:58`, ... When the time is out, the timer will continue but the time turns red. --- # Some Tips - The title slide is created automatically by **xaringan**, but it is just another remark.js slide added before your other slides. The title slide is set to `class: center, middle, inverse, title-slide` by default. You can change the classes applied to the title slide with the `titleSlideClass` option of `nature` (`title-slide` is always applied). ```yaml output: xaringan::moon_reader: nature: titleSlideClass: [top, left, inverse] ``` -- - If you'd like to create your own title slide, disable **xaringan**'s title slide with the `seal = FALSE` option of `moon_reader`. ```yaml output: xaringan::moon_reader: seal: false ``` --- # Some Tips - There are several ways to build incremental slides. See [this presentation](https://slides.yihui.name/xaringan/incremental.html) for examples. - The option `highlightLines: true` of `nature` will highlight code lines that start with `*`, or are wrapped in `{{ }}`, or have trailing comments `#<<`; ```yaml output: xaringan::moon_reader: nature: highlightLines: true ``` See examples on the next page. --- # Some Tips .pull-left[ An example using a leading `*`: ```r if (TRUE) { ** message("Very important!") } ``` Output: ```r if (TRUE) { * message("Very important!") } ``` This is invalid R code, so it is a plain fenced code block that is not executed. ] .pull-right[ An example using `{{}}`: ```` ```{r tidy=FALSE} if (TRUE) { *{{ message("Very important!") }} } ``` ```` Output: ```r if (TRUE) { * message("Very important!") } ``` ``` ## Very important! ``` It is valid R code so you can run it. Note that `{{}}` can wrap an R expression of multiple lines. ] --- # Some Tips An example of using the trailing comment `#<<` to highlight lines: ````markdown ```{r tidy=FALSE} library(ggplot2) ggplot(mtcars) + aes(mpg, disp) + geom_point() + #<< geom_smooth() #<< ``` ```` Output: ```r library(ggplot2) ggplot(mtcars) + aes(mpg, disp) + * geom_point() + * geom_smooth() ``` --- # Some Tips When you enable line-highlighting, you can also use the chunk option `highlight.output` to highlight specific lines of the text output from a code chunk. For example, `highlight.output = TRUE` means highlighting all lines, and `highlight.output = c(1, 3)` means highlighting the first and third line. ````md ```{r, highlight.output=c(1, 3)} head(iris) ``` ```` ``` *## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa *## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ``` Question: what does `highlight.output = c(TRUE, FALSE)` mean? (Hint: think about R's recycling of vectors) --- # Some Tips - To make slides work offline, you need to download a copy of remark.js in advance, because **xaringan** uses the online version by default (see the help page `?xaringan::moon_reader`). - You can use `xaringan::summon_remark()` to download the latest or a specified version of remark.js. By default, it is downloaded to `libs/remark-latest.min.js`. - Then change the `chakra` option in YAML to point to this file, e.g. ```yaml output: xaringan::moon_reader: chakra: libs/remark-latest.min.js ``` - If you used Google fonts in slides (the default theme uses _Yanone Kaffeesatz_, _Droid Serif_, and _Source Code Pro_), they won't work offline unless you download or install them locally. The Heroku app [google-webfonts-helper](https://google-webfonts-helper.herokuapp.com/fonts) can help you download fonts and generate the necessary CSS. --- # Macros - remark.js [allows users to define custom macros](https://github.com/yihui/xaringan/issues/80) (JS functions) that can be applied to Markdown text using the syntax `![:macroName arg1, arg2, ...]` or ``. For example, before remark.js initializes the slides, you can define a macro named `scale`: ```js remark.macros.scale = function (percentage) { var url = this; return '<img src="' + url + '" style="width: ' + percentage + '" />'; }; ``` Then the Markdown text ```markdown  ``` will be translated to ```html <img src="image.jpg" style="width: 50%" /> ``` --- # Macros (continued) - To insert macros in **xaringan** slides, you can use the option `beforeInit` under the option `nature`, e.g., ```yaml output: xaringan::moon_reader: nature: beforeInit: "macros.js" ``` You save your remark.js macros in the file `macros.js`. - The `beforeInit` option can be used to insert arbitrary JS code before `remark.create()`. Inserting macros is just one of its possible applications. --- # CSS Among all options in `xaringan::moon_reader`, the most challenging but perhaps also the most rewarding one is `css`, because it allows you to customize the appearance of your slides using any CSS rules or hacks you know. You can see the default CSS file [here](https://github.com/yihui/xaringan/blob/master/inst/rmarkdown/templates/xaringan/resources/default.css). You can completely replace it with your own CSS files, or define new rules to override the default. See the help page `?xaringan::moon_reader` for more information. --- # CSS For example, suppose you want to change the font for code from the default "Source Code Pro" to "Ubuntu Mono". You can create a CSS file named, say, `ubuntu-mono.css`: ```css @import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic); .remark-code, .remark-inline-code { font-family: 'Ubuntu Mono'; } ``` Then set the `css` option in the YAML metadata: ```yaml output: xaringan::moon_reader: css: ["default", "ubuntu-mono.css"] ``` Here I assume `ubuntu-mono.css` is under the same directory as your Rmd. See [yihui/xaringan#83](https://github.com/yihui/xaringan/issues/83) for an example of using the [Fira Code](https://github.com/tonsky/FiraCode) font, which supports ligatures in program code. --- # Themes Don't want to learn CSS? Okay, you can use some user-contributed themes. A theme typically consists of two CSS files `foo.css` and `foo-fonts.css`, where `foo` is the theme name. Below are some existing themes: ```r names(xaringan:::list_css()) ``` ``` ## [1] "chocolate-fonts" "chocolate" "default-fonts" "default" "duke-blue" ## [6] "fc-fonts" "fc" "hygge-duke" "hygge" "kunoichi" ## [11] "lucy-fonts" "lucy" "metropolis-fonts" "metropolis" "middlebury-fonts" ## [16] "middlebury" "ninjutsu" "rladies-fonts" "rladies" "robot-fonts" ## [21] "robot" "rutgers-fonts" "rutgers" "shinobi" "tamu-fonts" ## [26] "tamu" "uo-fonts" "uo" "uol-fonts" "uol" ``` --- # Themes To use a theme, you can specify the `css` option as an array of CSS filenames (without the `.css` extensions), e.g., ```yaml output: xaringan::moon_reader: css: [default, metropolis, metropolis-fonts] ``` If you want to contribute a theme to **xaringan**, please read [this blog post](https://yihui.name/en/2017/10/xaringan-themes). --- class: inverse, middle, center background-image: url(https://upload.wikimedia.org/wikipedia/commons/3/39/Naruto_Shiki_Fujin.svg) background-size: contain # Naruto --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/b/be/Sharingan_triple.svg) background-size: 100px background-position: 90% 8% # Sharingan The R package name **xaringan** was derived<sup>1</sup> from **Sharingan**, a dōjutsu in the Japanese anime _Naruto_ with two abilities: - the "Eye of Insight" - the "Eye of Hypnotism" I think a presentation is basically a way to communicate insights to the audience, and a great presentation may even "hypnotize" the audience.<sup>2,3</sup> .footnote[ [1] In Chinese, the pronounciation of _X_ is _Sh_ /ʃ/ (as in _shrimp_). Now you should have a better idea of how to pronounce my last name _Xie_. [2] By comparison, bad presentations only put the audience to sleep. [3] Personally I find that setting background images for slides is a killer feature of remark.js. It is an effective way to bring visual impact into your presentations. ] --- # Naruto terminology The **xaringan** package borrowed a few terms from Naruto, such as - [Sharingan](http://naruto.wikia.com/wiki/Sharingan) (写輪眼; the package name) - The [moon reader](http://naruto.wikia.com/wiki/Moon_Reader) (月読; an attractive R Markdown output format) - [Chakra](http://naruto.wikia.com/wiki/Chakra) (查克拉; the path to the remark.js library, which is the power to drive the presentation) - [Nature transformation](http://naruto.wikia.com/wiki/Nature_Transformation) (性質変化; transform the chakra by setting different options) - The [infinite moon reader](http://naruto.wikia.com/wiki/Infinite_Tsukuyomi) (無限月読; start a local web server to continuously serve your slides) - The [summoning technique](http://naruto.wikia.com/wiki/Summoning_Technique) (download remark.js from the web) You can click the links to know more about them if you want. The jutsu "Moon Reader" may seem a little evil, but that does not mean your slides are evil. --- class: center # Hand seals (印) Press `h` or `?` to see the possible ninjutsu you can use in remark.js.  --- class: center, middle # Thanks! Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan). The chakra comes from [remark.js](https://remarkjs.com), [**knitr**](http://yihui.name/knitr), and [R Markdown](https://rmarkdown.rstudio.com).