Skip to content

Commit 5723ada

Browse files
jeffwong-nflxwesm
authored andcommitted
ARROW-3731: MVP to read parquet in R library
I am contributing to [Arrow 3731](https://issues.apache.org/jira/browse/ARROW-3731). This PR has the minimum functionality to read parquet files into an arrow::Table, which can then be converted to a tibble. Multiple parquet files can be read inside `lapply`, and then concatenated at the end. Steps to compile 1) Build arrow and parquet c++ projects 2) In R run `devtools::load_all()` What I could use help with: The biggest challenge for me is my lack of experience with pkg-config. The R library has a `configure` file which uses pkg-config to figure out what c++ libraries to link to. Currently, `configure` looks up the Arrow project and links to -larrow only. We need it to also link to -lparquet. I do not know how to modify pkg-config's metadata to let it know to link to both -larrow and -lparquet Author: Jeffrey Wong <jeffreyw@netflix.com> Author: Romain Francois <romain@purrple.cat> Author: jeffwong-nflx <jeffreyw@netflix.com> Closes apache#3230 from jeffwong-nflx/master and squashes the following commits: c67fa3d <jeffwong-nflx> Merge pull request #3 from jeffwong-nflx/cleanup 1df3026 <Jeffrey Wong> don't hard code -larrow and -lparquet 8ccaa51 <Jeffrey Wong> cleanup 75ba5c9 <Jeffrey Wong> add contributor 56adad2 <jeffwong-nflx> Merge pull request #2 from romainfrancois/3731/parquet-2 7d6e64d <Romain Francois> read_parquet() only reading one parquet file, and gains a `as_tibble` argument e936b44 <Romain Francois> need parquet on travis too ff260c5 <Romain Francois> header was too commented, renamed to parquet.cpp 9e1897f <Romain Francois> styling etc ... 456c5d2 <Jeffrey Wong> read parquet files 22d89dd <Jeffrey Wong> hardcode -larrow and -lparquet
1 parent 66f0d39 commit 5723ada

11 files changed

Lines changed: 131 additions & 48 deletions

File tree

.travis.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -326,6 +326,8 @@ matrix:
326326
language: r
327327
cache: packages
328328
latex: false
329+
env:
330+
- ARROW_TRAVIS_PARQUET=1
329331
before_install:
330332
# Have to copy-paste this here because of how R's build steps work
331333
- eval `python $TRAVIS_BUILD_DIR/ci/detect-changes.py`

r/DESCRIPTION

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Version: 0.11.0.9000
44
Authors@R: c(
55
person("Romain", "François", email = "romain@rstudio.com", role = c("aut", "cre")),
66
person("Javier", "Luraschi", email = "javier@rstudio.com", role = c("ctb")),
7+
person("Jeffrey", "Wong", email = "jeffreyw@netflix.com", role = c("ctb")),
78
person("Apache Arrow", email = "dev@arrow.apache.org", role = c("aut", "cph"))
89
)
910
Description: R Integration to 'Apache' 'Arrow'.
@@ -62,6 +63,7 @@ Collate:
6263
'memory_pool.R'
6364
'message.R'
6465
'on_exit.R'
66+
'parquet.R'
6567
'read_record_batch.R'
6668
'read_table.R'
6769
'reexports-bit64.R'

r/NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@ export(read_arrow)
123123
export(read_csv_arrow)
124124
export(read_feather)
125125
export(read_message)
126+
export(read_parquet)
126127
export(read_record_batch)
127128
export(read_schema)
128129
export(read_table)

r/R/RcppExports.R

Lines changed: 4 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/R/parquet.R

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
#' Read parquet file from disk
19+
#'
20+
#' @param file a file path
21+
#' @param as_tibble should the [arrow::Table][arrow__Table] be converted to a tibble.
22+
#' @param ... currently ignored
23+
#'
24+
#' @return a [arrow::Table][arrow__Table], or a data frame if `as_tibble` is `TRUE`.
25+
#'
26+
#' @export
27+
read_parquet <- function(file, as_tibble = TRUE, ...) {
28+
tab <- shared_ptr(`arrow::Table`, read_parquet_file(f))
29+
if (isTRUE(as_tibble)) {
30+
tab <- as_tibble(tab)
31+
}
32+
tab
33+
}

r/README.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ git clone https://github.com/apache/arrow.git
2525
cd arrow/cpp && mkdir release && cd release
2626

2727
# It is important to statically link to boost libraries
28-
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
28+
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
2929
make install
3030
```
3131

r/README.md

Lines changed: 16 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ git clone https://github.com/apache/arrow.git
1414
cd arrow/cpp && mkdir release && cd release
1515

1616
# It is important to statically link to boost libraries
17-
cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
17+
cmake .. -DARROW_PARQUET=ON -DCMAKE_BUILD_TYPE=Release -DARROW_BOOST_USE_SHARED:BOOL=Off
1818
make install
1919
```
2020

@@ -38,48 +38,19 @@ tf <- tempfile()
3838
#> # A tibble: 10 x 2
3939
#> x y
4040
#> <int> <dbl>
41-
#> 1 1 -0.255
42-
#> 2 2 -0.162
43-
#> 3 3 -0.614
44-
#> 4 4 -0.322
45-
#> 5 5 0.0693
46-
#> 6 6 -0.920
47-
#> 7 7 -1.08
48-
#> 8 8 0.658
49-
#> 9 9 0.821
50-
#> 10 10 0.539
51-
arrow::write_arrow(tib, tf)
52-
53-
# read it back with pyarrow
54-
pa <- import("pyarrow")
55-
as_tibble(pa$open_file(tf)$read_pandas())
56-
#> # A tibble: 10 x 2
57-
#> x y
58-
#> <int> <dbl>
59-
#> 1 1 -0.255
60-
#> 2 2 -0.162
61-
#> 3 3 -0.614
62-
#> 4 4 -0.322
63-
#> 5 5 0.0693
64-
#> 6 6 -0.920
65-
#> 7 7 -1.08
66-
#> 8 8 0.658
67-
#> 9 9 0.821
68-
#> 10 10 0.539
69-
```
70-
71-
## Development
72-
73-
### Code style
74-
75-
We use Google C++ style in our C++ code. Check for style errors with
76-
77-
```
78-
./lint.sh
79-
```
80-
81-
You can fix the style issues with
82-
41+
#> 1 1 0.0855
42+
#> 2 2 -1.68
43+
#> 3 3 -0.0294
44+
#> 4 4 -0.124
45+
#> 5 5 0.0675
46+
#> 6 6 1.64
47+
#> 7 7 1.54
48+
#> 8 8 -0.0209
49+
#> 9 9 -0.982
50+
#> 10 10 0.349
51+
# arrow::write_arrow(tib, tf)
52+
53+
# # read it back with pyarrow
54+
# pa <- import("pyarrow")
55+
# as_tibble(pa$open_file(tf)$read_pandas())
8356
```
84-
./lint.sh --fix
85-
```

r/configure

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,13 @@
2626
# R CMD INSTALL --configure-vars='INCLUDE_DIR=/.../include LIB_DIR=/.../lib'
2727

2828
# Library settings
29-
PKG_CONFIG_NAME="arrow"
29+
PKG_CONFIG_NAME="arrow parquet"
3030
PKG_DEB_NAME="arrow"
3131
PKG_RPM_NAME="arrow"
3232
PKG_CSW_NAME="arrow"
3333
PKG_BREW_NAME="apache-arrow"
3434
PKG_TEST_HEADER="<arrow/api.h>"
35-
PKG_LIBS="-larrow"
35+
PKG_LIBS="-larrow -lparquet"
3636

3737
# Use pkg-config if available
3838
pkg-config --version >/dev/null 2>&1

r/man/read_parquet.Rd

Lines changed: 21 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/src/RcppExports.cpp

Lines changed: 12 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)