Skip to content

Introduce GoogleDrive Fetcher for tika-pipes#2077

Open
bartek wants to merge 4 commits into
apache:mainfrom
bartek:bartek/tika-fetcher-google
Open

Introduce GoogleDrive Fetcher for tika-pipes#2077
bartek wants to merge 4 commits into
apache:mainfrom
bartek:bartek/tika-fetcher-google

Conversation

@bartek

@bartek bartek commented Dec 6, 2024

Copy link
Copy Markdown

This introduces a Google Drive fetcher into the tika-pipes project

@bartek bartek mentioned this pull request Dec 6, 2024
@bartek

bartek commented Dec 6, 2024

Copy link
Copy Markdown
Author

@tballison Cleaned up version of #2074 now that I sorted out what's going on with the tika-grpc-3x-features branch (it will require cleanup)

@THausherr

THausherr commented Dec 7, 2024

Copy link
Copy Markdown
Contributor

The pom.xml has several properties that are not used, e.g. kiota, wiremock, nimbus, etc. You probably copied them from another (older) pom.xml. I think you only need the first one.

@bartek

bartek commented Dec 9, 2024

Copy link
Copy Markdown
Author

The pom.xml has several properties that are not used, e.g. kiota, wiremock, nimbus, etc. You probably copied them from another (older) pom.xml. I think you only need the first one.

Thanks, I was indeed copying and did not review the pom.xml too deeply. Cleaned up: 880a2ea

@THausherr

Copy link
Copy Markdown
Contributor

There are still 3 that you don't need, one that isn't used and two that are defined in the parent.

@bartek bartek force-pushed the bartek/tika-fetcher-google branch from 880a2ea to 0884a58 Compare December 9, 2024 10:46
@bartek

bartek commented Dec 9, 2024

Copy link
Copy Markdown
Author

There are still 3 that you don't need, one that isn't used and two that are defined in the parent.

I think I got them in the final push. Is there a way to identify these during build? I didn't notice it in the output

@THausherr THausherr changed the title Introduce GoogleDriver Fetcher for tika-pipes Introduce GoogleDrive Fetcher for tika-pipes Dec 9, 2024
@THausherr

Copy link
Copy Markdown
Contributor

I got these by looking at the source code. This is just me, I like smaller pom.xml files that are easier to understand and maintain.

Is it possible for you to create some sort of unit test, or is this impossible because one would need some google drive access?

@bartek

bartek commented Dec 9, 2024

Copy link
Copy Markdown
Author

I got these by looking at the source code. This is just me, I like smaller pom.xml files that are easier to understand and maintain.

Thanks for the commits and notes. I'm not too familiar with the project and am entering through tika-pipes and its fetcher requirements, so I appreciate your patience.

Is it possible for you to create some sort of unit test, or is this impossible because one would need some google drive access?

I imagine we could mock the response from Google Drive, so at least we test happy/sad paths. Let me have a try at it.

@THausherr

THausherr commented Dec 9, 2024

Copy link
Copy Markdown
Contributor

So I was able to fix the google driver fetcher pom.xml, but now the pipes gRPC server is failing with dependency convergence errors. 😬 I'll try to fix that too.

@THausherr

Copy link
Copy Markdown
Contributor

I managed to do a complete build locally, mostly by moving the dependencyManagement stuff I introduced to the parent. I'll do another test locally and then add this here.

@bartek

bartek commented Dec 9, 2024

Copy link
Copy Markdown
Author

@THausherr Looks like you got this building. Are you happy with your changes? If so I will squash them into a single commit.

@THausherr

Copy link
Copy Markdown
Contributor

Yes!

@bartek

bartek commented Dec 9, 2024

Copy link
Copy Markdown
Author

@THausherr Great. Btw, since these changes, I am unable to build tika-pipes (which is what I am building, not the whole project). It looks like the pom.xml that was previously expected no longer is applicable. Are you able to help?

Here's the error:

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  02:33 min
[INFO] Finished at: 2024-12-09T12:47:12-04:00
[INFO] ------------------------------------------------------------------------
+ mvn dependency:copy-dependencies -f /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc
[INFO] Scanning for projects...
[ERROR] [ERROR] Some problems were encountered while processing the POMs:
[FATAL] Non-readable POM /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc/pom.xml: /Users/bartek/workspace/tika/tika-pipes/tika-grpc/example-dockerfile/../../../tika-pipes/tika-grpc/pom.xml (No such file or directory) @ 

And here's my build script:

set -x

TAG_NAME=$1

if [ -z "${TAG_NAME}" ]; then
    echo "Single command line argument is required which will be used as the -t parameter of the docker build command"
    exit 1
fi

SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
TIKA_SRC_PATH=${SCRIPT_DIR}/../../..
OUT_DIR=${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/tika-docker

mvn clean install -Dossindex.skip -DskipTests=true -Denforcer.skip=true -Dossindex.skip=true -f "${TIKA_SRC_PATH}" || exit
mvn dependency:copy-dependencies -f "${TIKA_SRC_PATH}/tika-pipes/tika-grpc" || exit
rm -rf "${OUT_DIR}"
mkdir -p "${OUT_DIR}"

project_version=$(mvn help:evaluate -Dexpression=project.version -q -DforceStdout -f "${TIKA_SRC_PATH}")

cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/dependency" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-gcs/target/tika-fetcher-gcs-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-az-blob/target/tika-fetcher-az-blob-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-http/target/tika-fetcher-http-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-microsoft-graph/target/tika-fetcher-microsoft-graph-${project_version}.jar" "${OUT_DIR}/libs"
cp -r "${TIKA_SRC_PATH}/tika-pipes/tika-fetchers/tika-fetcher-s3/target/tika-fetcher-s3-${project_version}.jar" "${OUT_DIR}/libs"

cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/target/tika-grpc-${project_version}.jar" "${OUT_DIR}/libs"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/src/test/resources/log4j2.xml" "${OUT_DIR}"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/src/test/resources/tika-pipes-test-config.xml" "${OUT_DIR}/tika-config.xml"
cp "${TIKA_SRC_PATH}/tika-pipes/tika-grpc/example-dockerfile/Dockerfile" "${OUT_DIR}/Dockerfile"

cd "${OUT_DIR}" || exit

# build single arch
#docker build "${OUT_DIR}" -t "${TAG_NAME}"

# Or we can build multi-arch - https://www.docker.com/blog/multi-arch-images/
docker buildx create --name tikabuilder
# see https://askubuntu.com/questions/1339558/cant-build-dockerfile-for-arm64-due-to-libc-bin-segmentation-fault/1398147#1398147
docker run --rm --privileged tonistiigi/binfmt --install amd64
docker run --rm --privileged tonistiigi/binfmt --install arm64
docker buildx build --builder=tikabuilder "${OUT_DIR}" -t "${TAG_NAME}" --platform linux/amd64,linux/arm64 --push
docker buildx stop tikabuilder

bartek and others added 3 commits December 9, 2024 12:51
This allows the fetching of items using files.get from Google Drive
@bartek bartek force-pushed the bartek/tika-fetcher-google branch from f5e9dbe to c8d3ea7 Compare December 9, 2024 16:52
@THausherr

THausherr commented Dec 9, 2024

Copy link
Copy Markdown
Contributor

I didn't touch tika-grpc/pom.xml at all.

Your script has "tika-pipes/tika-grpc" however "tika-grpc" is at the top level.

@bartek

bartek commented Dec 10, 2024

Copy link
Copy Markdown
Author

@THausherr Thanks, I sorted that out. Looks like my paths are based on tika-grpc-3x-features branch paths.

@nddipiazza

Copy link
Copy Markdown
Contributor

I am porting this into https://github.com/nddipiazza/tika-pipes
starting now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants