Mi odisea de obtener metadatos scRNAseq

Quiero seleccionar un conjunto de datos de scRNAseq público de este documento Los análisis de células individuales revelan subconjuntos de células inmunitarias clave asociados con la respuesta al bloqueo de PD-L1 en el cáncer de mama triple negativo

ffq

Primero intenté ffqpero me dio errores.

ffq obtiene información de metadatos de las siguientes bases de datos:

  • GEO: ómnibus de expresión génica,
  • SRA: archivo de lectura de secuencia,
  • EMBL-EBI: Instituto Europeo de Bioinformática del Laboratorio Europeo de Biología Molecular,
  • DDBJ: Banco de datos de ADN de Japón,
  • NIH Biosample: materiales de origen biológico utilizados en ensayos experimentales,
  • ENCODE: La Enciclopedia de Elementos de ADN.
(base) ➜  ~ ffq GSE169246
[2022-06-06 14:32:22,716]    INFO Parsing GEO GSE169246
[2022-06-06 14:32:24,418]    INFO Finding supplementary files for GEO GSE169246
[2022-06-06 14:32:28,001] WARNING There are 83 samples for GSE169246
[2022-06-06 14:32:28,001]    INFO Parsing GSM GSM5188367
[2022-06-06 14:32:33,047]    INFO Finding supplementary files for GSM GSM5188367
[2022-06-06 14:32:34,409]    INFO No supplementary files found for GSM5188367
[2022-06-06 14:32:39,659]    INFO Getting sample for GSM5188367
[2022-06-06 14:32:41,847] WARNING No sample found. Either the provided GSM accession is invalid or raw data was not provided for this record

obtener los metadatos usando GEOquery

otros métodos: pysradb https://github.com/saketkc/pysradb

luego probé GEOquery

library(GEOquery)
library(tidyverse)
library(here)


meta<- getGEO(GEO= "GSE169246",GSEMatrix=FALSE)

los meta la ranura gsms del objeto contiene información útil


[email protected][email protected]$characteristics_ch1
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Progression"          "treatment: Chemo"

[email protected][email protected]$title 
#> [1] "Prog_P008_b"

[email protected][email protected]$source_name_ch1
#> [1] "P008"

[email protected] es una lista de listas. purrr::map() es una herramienta perfecta para ordenarlo.

puedes aprender más purrr de https://jennybc.github.io/purrr-tutorial/


names([email protected])
#>  [1] "GSM5188367" "GSM5188368" "GSM5188369" "GSM5188370" "GSM5188371"
#>  [6] "GSM5188372" "GSM5188373" "GSM5188374" "GSM5188375" "GSM5188376"
#> [11] "GSM5188377" "GSM5188378" "GSM5188379" "GSM5188380" "GSM5188381"
#> [16] "GSM5188382" "GSM5188383" "GSM5188384" "GSM5188385" "GSM5188386"
#> [21] "GSM5188387" "GSM5188388" "GSM5188389" "GSM5188390" "GSM5188391"
#> [26] "GSM5188392" "GSM5188393" "GSM5188394" "GSM5188395" "GSM5188396"
#> [31] "GSM5188397" "GSM5188398" "GSM5188399" "GSM5188400" "GSM5188401"
#> [36] "GSM5188402" "GSM5188403" "GSM5188404" "GSM5188405" "GSM5188406"
#> [41] "GSM5188407" "GSM5188408" "GSM5188409" "GSM5188410" "GSM5188411"
#> [46] "GSM5188412" "GSM5188413" "GSM5188414" "GSM5188415" "GSM5188416"
#> [51] "GSM5188417" "GSM5188418" "GSM5188419" "GSM5188420" "GSM5188421"
#> [56] "GSM5188422" "GSM5188423" "GSM5188424" "GSM5188425" "GSM5188426"
#> [61] "GSM5188427" "GSM5188428" "GSM5188429" "GSM5188430" "GSM5188431"
#> [66] "GSM5188432" "GSM5188433" "GSM5188434" "GSM5188435" "GSM5188436"
#> [71] "GSM5188437" "GSM5188438" "GSM5188439" "GSM5188440" "GSM5188441"
#> [76] "GSM5188442" "GSM5188443" "GSM5188444" "GSM5188445" "GSM5585280"
#> [81] "GSM5585281" "GSM5585282" "GSM5585283"
[email protected][email protected]
#> $channel_count
#> [1] "1"
#> 
#> $characteristics_ch1
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $contact_address
#> [1] "Yiheyuan Road"
#> 
#> $contact_city
#> [1] "Beijing"
#> 
#> $contact_country
#> [1] "China"
#> 
#> $contact_department
#> [1] "BIOPIC"
#> 
#> $contact_email
#> [1] "[email protected]"
#> 
#> $contact_institute
#> [1] "Peking Univerisity"
#> 
#> $contact_laboratory
#> [1] "Zemin Zhang's Lab"
#> 
#> $contact_name
#> [1] "Yuanyuan,,Zhang"
#> 
#> $contact_state
#> [1] "Beijing"
#> 
#> $`contact_zip/postal_code`
#> [1] "100871"
#> 
#> $data_processing
#> [1] "For 10X data, Cell Ranger 3.0.0 was used to quantify gene expression level, identify TCR sequences and quantify ATAC-seq peaks ."
#> [2] "Genome_build: GRCh38 for RNA expression,TCR data and ATAC peaks."                                                                
#> 
#> $data_row_count
#> [1] "0"
#> 
#> $description
#> [1] "10X Genomics platform"
#> 
#> $extract_protocol_ch1
#> [1] "Peripheral blood mononuclear cells (PBMCs) were isolated with HISTOPAQUE-1077 (Sigma-Aldrich) solution. In brief, 3 ml fresh peripheral blood was collected at baseline, 4 weeks after treatment initiation and disease progression in EDTA anticoagulant tubes and subsequently layered onto HISTOPAQUE-1077. Tumor biopsy samples were first stored in RNAlater RNA stabilization reagent (QIAGEN) after collection and kept on ice to avoid RNA degradation. Samples were then collected in the RPMI-1640 medium (Gibco) with 10% FBS (Gibco), and enzymatically digested with gentle MACS Tumor Dissociation Kit (Miltenyi Biotec) for 60 min on a rotor at 37 °C according to the manufacturer’s protocol. The dissociated cells were next passed through a 40-µm cell-strainer (BD) in the RPMI-1640 medium (Invitrogen) with 10% FBS until uniform cell suspensions were obtained. Subsequently, the suspended cells were passed through cell strainers and centrifuged at 400 g for 10 min. Red blood cells were removed via the same procedure described above. After washing twice with 1x PBS (Invitrogen), the cell pellets were re-suspended in sorting buffer (PBS supplemented with 1% FBS)."
#> [2] "single cells were sorted into 1.5 ml tubes (Eppendorf) and counted manually under the microscope. The concentration of single cell suspensions was adjusted to 500-1200 cells/ul. Cells were loaded between 7,000 and 15,000 cells/chip position using the 10x Chromium Single cell 5’ Library, Gel Bead & Multiplex Kit and Chip Kit (10x Genomics, V3 barcoding chemistry) according to the manufacturer’s instructions. All the subsequent steps were performed following the standard manufacturer’s protocols. Purified libraries were analyzed by an Illumina Hiseq X Ten sequencer with 150-bp paired-end reads."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
#> [3] "10x Chromium Single cell 5’ + VDJ Library, and 10x Chromium Single cell ATAC-seq Library"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
#> 
#> $geo_accession
#> [1] "GSM5188367"
#> 
#> $instrument_model
#> [1] "HiSeq X Ten"
#> 
#> $last_update_date
#> [1] "Sep 17 2021"
#> 
#> $library_selection
#> [1] "cDNA"
#> 
#> $library_source
#> [1] "transcriptomic"
#> 
#> $library_strategy
#> [1] "RNA-Seq"
#> 
#> $molecule_ch1
#> [1] "polyA RNA"
#> 
#> $organism_ch1
#> [1] "Homo sapiens"
#> 
#> $platform_id
#> [1] "GPL20795"
#> 
#> $relation
#> [1] "BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN18383051"
#> 
#> $series_id
#> [1] "GSE169246"
#> 
#> $source_name_ch1
#> [1] "P007"
#> 
#> $status
#> [1] "Public on Sep 15 2021"
#> 
#> $submission_date
#> [1] "Mar 19 2021"
#> 
#> $supplementary_file_1
#> [1] "NONE"
#> 
#> $taxid_ch1
#> [1] "9606"
#> 
#> $title
#> [1] "Pre_P007_b"
#> 
#> $treatment_protocol_ch1
#> [1] "Half of 22 TNBC patients were treated with paclitaxel monotherapy and the other half were treated with paclitaxel plus atezolizumab as first-line treatment"
#> 
#> $type
#> [1] "SRA"

purrr::map([email protected], [email protected]$characteristics_ch1)[1:5]
#> $GSM5188367
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188368
#> [1] "tissue: lymph_node"          "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188369
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188370
#> [1] "tissue: lymph_node"          "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo" 
#> 
#> $GSM5188371
#> [1] "tissue: blood"               "disease state: TNBC patient"
#> [3] "group: Pre-treatment"        "treatment: anti-PDL1+Chemo"

Además, ¿has usado el impresionante stack y unstack función en base R? Hay tantas buenas funciones ocultas.


meta1<- purrr::map([email protected], [email protected]$characteristics_ch1) %>%
        stack() %>%
        separate(values, into = c("condition", "value"), sep= ": ")%>%
        pivot_wider(names_from= condition, values_from = value) %>%
        janitor::clean_names()
        

meta2<- purrr::map([email protected], [email protected]$title) %>%
        stack() %>%
        dplyr::rename(sample_id = values)


meta3<- purrr::map([email protected], [email protected]$source_name_ch1) %>%
        stack() %>%
        dplyr::rename(patient_id = values)



head(meta1)
#> 
#>   ind        tissue     disease_state group         treatment      
#>   <fct>      <chr>      <chr>         <chr>         <chr>          
#> 1 GSM5188367 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 2 GSM5188368 lymph_node TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 3 GSM5188369 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 4 GSM5188370 lymph_node TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 5 GSM5188371 blood      TNBC patient  Pre-treatment anti-PDL1+Chemo
#> 6 GSM5188372 blood      TNBC patient  Pre-treatment Chemo
head(meta2)
#>    sample_id        ind
#> 1 Pre_P007_b GSM5188367
#> 2 Pre_P007_t GSM5188368
#> 3 Pre_P012_b GSM5188369
#> 4 Pre_P012_t GSM5188370
#> 5 Pre_P014_b GSM5188371
#> 6 Pre_P023_b GSM5188372
head(meta3)
#>   patient_id        ind
#> 1       P007 GSM5188367
#> 2       P007 GSM5188368
#> 3       P012 GSM5188369
#> 4       P012 GSM5188370
#> 5       P014 GSM5188371
#> 6       P023 GSM5188372


meta<- left_join(meta1, meta2) %>%
        left_join(meta3)

head(meta)
#> 
#>   ind        tissue     disease_state group       treatment sample_id patient_id
#>   <fct>      <chr>      <chr>         <chr>       <chr>     <chr>     <chr>     
#> 1 GSM5188367 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P007… P007      
#> 2 GSM5188368 lymph_node TNBC patient  Pre-treatm… anti-PDL… Pre_P007… P007      
#> 3 GSM5188369 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P012… P012      
#> 4 GSM5188370 lymph_node TNBC patient  Pre-treatm… anti-PDL… Pre_P012… P012      
#> 5 GSM5188371 blood      TNBC patient  Pre-treatm… anti-PDL… Pre_P014… P014      
#> 6 GSM5188372 blood      TNBC patient  Pre-treatm… Chemo     Pre_P023… P023

meta<- meta %>%
        filter(!str_detect(sample_id, "ATAC-seq"))

Obtener los datos de respuesta

Ahora, tendré que ir a la publicación y encontrar la tabla complementaria que contiene los datos de respuesta. Se parece a esto:

Todo el mundo debería leer esto https://datacarpentry.org/spreadsheet-ecology-lesson/02-common-mistakes/ y Organización de datos en hoja de cálculo.

Usar Excel sabiamente no es una ciencia espacial, pero usarlo correctamente puede hacer que la vida del bioinformático sea mucho más fácil.

clinical_data<- readxl::read_xlsx("~/Downloads/1-s2.0-S1535610821004992-mmc2.xlsx", skip = 1)


print(clinical_data, n = 40)
#> # A tibble: 27 × 19
#>    Treatment     `Patient ID` `Biopsied lesi…` `Diameter of b…` `Size of targe…`
#>    <chr>         <chr>        <chr>            <chr>                       <dbl>
#>  1 <NA>          <NA>         <NA>             <NA>                           NA
#>  2 Anti-PD-L1+ … P019         Lymph Node       15                             36
#>  3 <NA>          P010         Lung             21                             35
#>  4 <NA>          P012         Lymph Node       28                             28
#>  5 <NA>          P007         Lymph Node       22                             22
#>  6 <NA>          P017         Lymph Node       16                             45
#>  7 <NA>          P001         -                -                              45
#>  8 <NA>          P002         Chest Wall       48                             48
#>  9 <NA>          P014         -                -                              11
#> 10 <NA>          P004         Chest Wall       35                             35
#> 11 <NA>          P005         Liver            87                             97
#> 12 <NA>          P016         Chest Wall       24                             24
#> 13 Chemo         P022         Breast           33                             48
#> 14 <NA>          P011         -                -                              30
#> 15 <NA>          P020         Breast           37                             55
#> 16 <NA>          P008         Lung             22                             22
#> 17 <NA>          P013         Liver            36                            152
#> 18 <NA>          P025         Breast           26                             26
#> 19 <NA>          P018         Breast           48                             48
#> 20 <NA>          P023         Breast           26                             42
#> 21 <NA>          P024         Breast           72                             95
#> 22 <NA>          P003         Chest Wall       11                             30
#> 23 <NA>          P028         -                26                             57
#> 24 N,Not availa… <NA>         <NA>             <NA>                           NA
#> 25 * Negative v… <NA>         <NA>             <NA>                           NA
#> 26 # PR,partial… <NA>         <NA>             <NA>                           NA
#> 27 $The + sign … <NA>         <NA>             <NA>                           NA
#> # … with 14 more variables: Age <dbl>, `Stage (TNM )` <chr>, `PD-L1` <chr>,
#> #   `TILs (%)` <chr>, Tumor <chr>, ...11 <chr>, ...12 <chr>, Blood <chr>,
#> #   ...14 <chr>, ...15 <chr>,
#> #   `Relative change of target lesions (8 weeks after treatment initiation vs. Pre-treatment)*` <chr>,
#> #   `Relative change of biopsied lesions (8 weeks after treatment initiation vs. Pre-treatment)*` <chr>,
#> #   `Clinical efficacy#` <chr>, `Time to progression (months)$` <chr>

clinical_data<- clinical_data %>%
        janitor::clean_names() %>%
        slice(2:23) %>%
        tidyr::fill(treatment) %>%
        dplyr::select(!(tumor:x15)) 

sí, y N para no disponible y, a veces, significa no disponible. arreglemos eso.

clinical_data<- clinical_data %>%
        
        mutate_all(~str_trim(.x, side ="both")) %>%
        mutate_at(vars(-pd_l1), ~str_replace(.x , "^-$", NA_character_)) %>%
        mutate_all( ~str_replace(.x , "^N$", NA_character_))

fusionar las dos tablas

meta<- meta %>%
        left_join(clinical_data) %>%
        arrange(patient_id, biopsied_lesion) %>%
        select(-biopsied_lesion)

print(meta, n = 20)
#> 
#>    ind        tissue     disease_state group      treatment sample_id patient_id
#>    <fct>      <chr>      <chr>         <chr>      <chr>     <chr>     <chr>     
#>  1 GSM5188381 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P001… P001      
#>  2 GSM5188414 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P001      
#>  3 GSM5188376 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P002… P002      
#>  4 GSM5188377 chest_wall TNBC patient  Pre-treat… anti-PDL… Pre_P002… P002      
#>  5 GSM5188410 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P002      
#>  6 GSM5188411 chest_wall TNBC patient  Post-trea… anti-PDL… Post_P00… P002      
#>  7 GSM5188437 blood      TNBC patient  Progressi… anti-PDL… Prog_P00… P002      
#>  8 GSM5188401 chest_wall TNBC patient  Post-trea… Chemo     Post_P00… P003      
#>  9 GSM5585280 blood      TNBC patient  Pre-treat… Anti-PD-… Pre_P004… P004      
#> 10 GSM5585281 chest_wall TNBC patient  Pre-treat… Anti-PD-… Pre_P004… P004      
#> 11 GSM5585282 blood      TNBC patient  Post-trea… Anti-PD-… Post_P00… P004      
#> 12 GSM5585283 blood      TNBC patient  Progressi… Anti-PD-… Prog_P00… P004      
#> 13 GSM5188398 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P005… P005      
#> 14 GSM5188399 liver      TNBC patient  Pre-treat… anti-PDL… Pre_P005… P005      
#> 15 GSM5188430 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P005      
#> 16 GSM5188431 liver      TNBC patient  Post-trea… anti-PDL… Post_P00… P005      
#> 17 GSM5188367 blood      TNBC patient  Pre-treat… anti-PDL… Pre_P007… P007      
#> 18 GSM5188368 lymph_node TNBC patient  Pre-treat… anti-PDL… Pre_P007… P007      
#> 19 GSM5188402 blood      TNBC patient  Post-trea… anti-PDL… Post_P00… P007      
#> 20 GSM5188433 blood      TNBC patient  Progressi… anti-PDL… Prog_P00… P007      
#> 
#> 
#> 
#> 
#> 
#> 

Eso es lo que se necesita para obtener los metadatos de nivel de muestra de un conjunto de datos público. Es difícil de automatizar por completo y uno tiene que ir a GEO y tablas complementarias, y ajustar los datos al formato deseado. Si me preguntas, te diré que esta es la bioinformática de la vida real.

Happy BioinFORMAtics

Fuente del artículo

Deja un comentario