Skip to content

Error handling for Grobid when not responding#35

Closed
Sanakhamassi wants to merge 6 commits into
ScienciaLAB:mainfrom
Sanakhamassi:fix/Display-an-error-message-when-grobid-is-not-responding
Closed

Error handling for Grobid when not responding#35
Sanakhamassi wants to merge 6 commits into
ScienciaLAB:mainfrom
Sanakhamassi:fix/Display-an-error-message-when-grobid-is-not-responding

Conversation

@Sanakhamassi

@Sanakhamassi Sanakhamassi commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Related to issue #11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds explicit error handling for Grobid failures so the Streamlit UI can surface a clear “please try later” message instead of failing ambiguously (issue #11).

Changes:

  • Introduce GrobidServiceError and raise it when Grobid errors or returns non-200.
  • Catch GrobidServiceError in the Streamlit upload/embedding flow and display an error message.
  • Add a (currently redundant) guard in DocumentQAEngine for missing Grobid output.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
streamlit_app.py Catches Grobid failures during embedding creation and shows a user-facing error.
document_qa/grobid_processors.py Defines GrobidServiceError and raises it from Grobid processing on failure/non-200.
document_qa/document_qa_engine.py Imports/raises GrobidServiceError when Grobid structure is missing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread streamlit_app.py Outdated
Comment thread streamlit_app.py Outdated
Comment thread document_qa/grobid_processors.py
Comment thread document_qa/document_qa_engine.py
Comment thread document_qa/grobid_processors.py

@lfoppiano lfoppiano left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a missing space, the rest looks fine. I did not test it, so please make sure you tested it before merge/squash.

Comment thread streamlit_app.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread streamlit_app.py Outdated
Comment on lines +319 to +327
tmp_file = NamedTemporaryFile()
tmp_file.write(bytearray(binary))
st.session_state['binary'] = binary

st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(
tmp_file.name,
chunk_size=chunk_size,
perc_overlap=0.1
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @Sanakhamassi here you need to either add tempFile in the with () or handle that somehow

Comment thread streamlit_app.py Outdated
st.session_state['doc_id'] = None
st.session_state['loaded_embeddings'] = False
st.session_state['uploaded'] = False
st.error(f"{message} Please try later.")
Comment on lines 221 to 223
if grobid_url:
self.grobid_processor = GrobidProcessor(grobid_url)
self.grobid_processor = GrobidProcessor(grobid_url, ping_server=False)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanakhamassi why was this removed?

Comment on lines +108 to +125
try:
pdf_file, status, text = self.grobid_client.process_pdf("processFulltextDocument",
input_path,
consolidate_header=True,
consolidate_citations=False,
segment_sentences=False,
tei_coordinates=coordinates,
include_raw_citations=False,
include_raw_affiliations=False,
generateIDs=True)
except Exception as exc:
raise GrobidServiceError("Grobid service did not respond.") from exc

if status != 200:
return
raise GrobidServiceError(
f"Grobid service returned status {status}.",
status_code=status
)

@lfoppiano lfoppiano left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better than before, however there are few changes that needs to be done.

Comment on lines 221 to 223
if grobid_url:
self.grobid_processor = GrobidProcessor(grobid_url)
self.grobid_processor = GrobidProcessor(grobid_url, ping_server=False)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sanakhamassi why was this removed?

Comment thread streamlit_app.py Outdated
Comment thread streamlit_app.py Outdated
Comment on lines +319 to +327
tmp_file = NamedTemporaryFile()
tmp_file.write(bytearray(binary))
st.session_state['binary'] = binary

st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(
tmp_file.name,
chunk_size=chunk_size,
perc_overlap=0.1
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @Sanakhamassi here you need to either add tempFile in the with () or handle that somehow

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

document_qa/document_qa_engine.py:266

  • The type annotation says Tuple[List[Document], list], but this method actually returns a list of strings (context_as_text) plus coordinates. Update the annotation to match the returned value (e.g., Tuple[List[str], list]) or return List[Document] if that’s what callers should get.
    def query_storage(self, query: str, doc_id, context_size=4) -> Tuple[List[Document], list]:
        """
        Returns the context related to a given query
        """
        documents, coordinates = self._get_context(doc_id, query, context_size)

Comment thread streamlit_app.py
Comment on lines +325 to +331
st.session_state['binary'] = binary

st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings(
tmp_path,
chunk_size=chunk_size,
perc_overlap=0.1
)
Comment on lines 229 to +233
output_parser=None,
context_size=4,
extraction_schema=None,
verbose=False
) -> (Any, str):
) -> Tuple[Any, str]:
return parsed_output

def _run_query(self, doc_id, query, context_size=4) -> (List[Document], list):
def _run_query(self, doc_id, query, context_size=4) -> Tuple[List[Document], list]:
Comment on lines +116 to 120
include_raw_affiliations=False,
generateIDs=True)
except Exception as exc:
raise GrobidServiceError("Grobid service did not respond.") from exc

grobid_url=None,
memory=None
memory=None,
ping_grobid_server: bool = False
lfoppiano added a commit that referenced this pull request Jun 7, 2026
Finalizes PR #35 (issue #11). Resolves the outstanding Copilot and
reviewer remarks and adds handling for Grobid responses that are empty
or truncated (HTTP 200 with no usable body), which were previously
undetected.

grobid_processors.process_structure (single validation chokepoint):
- Catch only requests.exceptions.RequestException instead of bare
  Exception, so local/usage errors (bad path, parser bugs) keep their
  real traceback instead of being mislabelled "Grobid did not respond".
- Raise GrobidServiceError for empty body, malformed/truncated XML, and
  well-formed XML with no extractable text.

document_qa_engine:
- ping_grobid_server now defaults to True (fail-fast for library users);
  Streamlit passes ping_grobid_server=False to degrade gracefully.
- Remove the now-dead "if not structure" guard.
- Fix return-type hints (query_document now consistently returns a
  3-tuple; query_storage -> tuple[List[str], list]; _run_query ->
  tuple[Any, list]); drop unused Tuple import; fix verbose hash print.

streamlit_app:
- Drop the `hash` alias that shadowed the builtin.
- Clean the error message (no duplicated status, correct punctuation).

tests:
- Use requests exceptions; assert local errors are not masked; add empty
  / malformed / no-extractable-text cases. 19 passed.
lfoppiano added a commit that referenced this pull request Jun 7, 2026
Finalizes PR #35 (issue #11). Resolves the outstanding Copilot and
reviewer remarks and adds handling for Grobid responses that are empty
or truncated (HTTP 200 with no usable body), which were previously
undetected.

grobid_processors.process_structure (single validation chokepoint):
- Catch only requests.exceptions.RequestException instead of bare
  Exception, so local/usage errors (bad path, parser bugs) keep their
  real traceback instead of being mislabelled "Grobid did not respond".
- Raise GrobidServiceError for empty body, malformed/truncated XML, and
  well-formed XML with no extractable text.

document_qa_engine:
- ping_grobid_server now defaults to True (fail-fast for library users);
  Streamlit passes ping_grobid_server=False to degrade gracefully.
- Remove the now-dead "if not structure" guard.
- Fix return-type hints (query_document now consistently returns a
  3-tuple; query_storage -> tuple[List[str], list]; _run_query ->
  tuple[Any, list]); drop unused Tuple import; fix verbose hash print.

streamlit_app:
- Drop the `hash` alias that shadowed the builtin.
- Clean the error message (no duplicated status, correct punctuation).

tests:
- Use requests exceptions; assert local errors are not masked; add empty
  / malformed / no-extractable-text cases. 19 passed.
lfoppiano added a commit that referenced this pull request Jun 7, 2026
This PR extended the work from @Sanakhamassi in #35
@lfoppiano

Copy link
Copy Markdown
Collaborator

Thanks @Sanakhamassi this PR has been merged through #43. There were some changes to be made and I simply done that in a local branch.

@lfoppiano lfoppiano closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants