Skip to content

Commit bcdaefc

Browse files
authored
Merge pull request #319 from korpling/less-frequent-corpusstorage-status
Improve logging
2 parents 423d4e8 + e4b71f2 commit bcdaefc

22 files changed

Lines changed: 459 additions & 257 deletions

CHANGELOG.md

Lines changed: 24 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
55

66
## [Unreleased]
77

8-
## Fixed
8+
### Added
9+
10+
- New optional `file` option for the `[logging]` section in the webservice
11+
configuration. Can be used to additionally output all log messages to the given
12+
file.
13+
- `Graph:ensure_loaded_parallel` returns the actually loaded components that did
14+
exist.
15+
16+
### Fixed
917

10-
- Crash could occur when finding inversed connected nodes in PrePost graph
11-
storage due to a subtraction resulting in negative number.
18+
- Less frequent corpus cache status updates in log. Before, every corpus access
19+
could trigger an entry into the log which is not desired under heavy load.
1220

1321
## [3.7.1] - 2025-04-14
1422

@@ -53,7 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
5361
### Fixed
5462

5563
- Fixed out of bounds error parsing legacy meta queries with multiple
56-
alternatives (https://github.com/korpling/graphANNIS/pull/308)
64+
alternatives (https://github.com/korpling/graphANNIS/pull/308)
5765

5866
## [3.5.0] - 2024-09-02
5967

@@ -75,7 +83,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7583
### Fixed
7684

7785
- Do not use recursion to calculate the indirect coverage edges in the model
78-
index, since this could fail for deeply nested structures.
86+
index, since this could fail for deeply nested structures.
7987

8088
## [3.3.3] - 2024-07-12
8189

@@ -85,7 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
8593
importer.
8694
- Fix `FileTooLarge` error when searching for token precedence where the
8795
statistics indicate that this search is impossible.
88-
96+
8997
## [3.3.2] - 2024-07-04
9098

9199
### Fixed
@@ -295,7 +303,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
295303
- Compile releases on Ubuntu 20.04 instead of 18.04, which means the minimal
296304
GLIBC version is 2.31. This is necessary, since GitHub actions deprecated this
297305
Ubuntu version.
298-
306+
299307

300308
### Fixed
301309

@@ -375,7 +383,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
375383
first token of context regions in `subgraph` when the returned context regions
376384
do not overlap. This allows sorting the context regions that belong to the
377385
same data source but are not connected by ordinary `Ordering/annis/` edges.
378-
386+
379387

380388
## [2.2.2] - 2022-07-26
381389

@@ -487,7 +495,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
487495
C-API), so this release is not technically backwards-compatible. Adapting to
488496
the updated API should be restricted to handle the errors returned by the
489497
functions.
490-
- The changes to the error handling also affects the C-API. These following
498+
- The changes to the error handling also affects the C-API. These following
491499
functions have now a `ErrorList` argument:
492500
* `annis_cs_list_node_annotations`
493501
* `annis_cs_list_edge_annotations`
@@ -525,7 +533,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
525533
- RelANNIS version 3.3 files with segmentation might also have a missing "span" column.
526534
In case the "span" column is null, always attempt to reconstruct the actual value from
527535
the corresponding node annotation instead of failing directly.
528-
536+
529537
### Changed
530538

531539
- Avoid unnecessary compacting of disk tables when collecting graph updates during import.
@@ -542,7 +550,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
542550
adjacency lists. This improves search for tokens because the Coverage components
543551
are typically adjacency lists, and we need to make sure the token nodes don't
544552
have any outgoing edges.
545-
- Fixed miscalculation of whitespace string capacity which could lead to
553+
- Fixed miscalculation of whitespace string capacity which could lead to
546554
`memory allocation failed` error.
547555

548556
## [1.4.0] - 2021-12-03
@@ -554,9 +562,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
554562
### Fixed
555563

556564
- Limit the used main memory cache per `DiskTable` by only using a disk block cache for the C1 table.
557-
Since we use a lot of disk-based maps during import of relANNIS files, the previous behavior could
565+
Since we use a lot of disk-based maps during import of relANNIS files, the previous behavior could
558566
add up to > 1GB easily, wich amongst other issues caused #205 to happen.
559-
With this change, during relANNIS import the main memory usage should be limited to be less than 4GB,
567+
With this change, during relANNIS import the main memory usage should be limited to be less than 4GB,
560568
which seams more reasonable than the previous 20+GB
561569
- Reduce memory footprint during import when corpus contains a lot of escaped strings (as in #205)
562570
- Avoid creating small fragmented main memory when importing corpora from relANNIS to help to fix #205
@@ -592,7 +600,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
592600

593601
### Added
594602

595-
- Added generic operator negation without existence assumption,
603+
- Added generic operator negation without existence assumption,
596604
if only one side of the negated operator is optional (#187).
597605

598606
## [1.1.0] - 2021-09-09
@@ -674,7 +682,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
674682

675683
### Removed
676684

677-
- Replaced the `update_statistics` function in `CorpusStorage` with the more general `reoptimize_implementation` function.
685+
- Replaced the `update_statistics` function in `CorpusStorage` with the more general `reoptimize_implementation` function.
678686
The new function is available via the `re-optimize` command in the CLI.
679687

680688
### Added
@@ -684,7 +692,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
684692

685693
### Fixed
686694

687-
- Importing a relANNIS corpus could fail because the integer would wrap around from negative to a large value when calculating the `tok-whitespace-after` annotation value. This large value would then be used to allocate memory, which will fail.
695+
- Importing a relANNIS corpus could fail because the integer would wrap around from negative to a large value when calculating the `tok-whitespace-after` annotation value. This large value would then be used to allocate memory, which will fail.
688696
- Adding `\$` to the escaped input sequence in the relANNIS import, fixing issues with some old SFB 632 corpora
689697
- Unbound near-by-operator (`^*`) was not limited to 50 in quirks mode
690698
- Workaround for duplicated document names when importing invalid relANNIS corpora

cli/tests/cli.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ fn standard_filter() -> Settings {
1313
// Filter out the time stamps
1414
settings.add_filter("[0-9]+:[0-9]+:[0-9]+ ", "12:00:00");
1515
// The loaded and also total available RAM size can vary
16-
settings.add_filter("[0-9.]+ [MG]B / [0-9.]+ [MG]B", "100 / 300 MB");
16+
settings.add_filter("[0-9.]+[MG]B / [0-9.]+[MG]B", "100MB / 300MB");
1717
// The loading and time can vary
1818
settings.add_filter("in [0-9]+ ms", "in 10 ms");
1919
settings

cli/tests/snapshots/cli__list_corpora_fully_loaded.snap

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ success: true
1515
exit_code: 0
1616
----- stdout -----
1717
12:00:00[INFO] Loaded corpus sample-disk-based-3.3
18-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
19-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
18+
12:00:00[INFO] Corpus cache after preloading sample-disk-based-3.3: 100MB / 300MB - loaded corpora [sample-disk-based-3.3]
2019
12:00:00[INFO] Preloaded corpus in 10 ms
2120
sample-disk-based-1.5 (not loaded)
2221
sample-disk-based-3.2 (not loaded)
@@ -27,4 +26,3 @@ sample-memory-based-3.3 (not loaded)
2726
graphANNIS says good-bye!
2827

2928
----- stderr -----
30-

cli/tests/snapshots/cli__list_corpora_partially_loaded.snap

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ success: true
1515
exit_code: 0
1616
----- stdout -----
1717
12:00:00[INFO] Loaded corpus sample-disk-based-3.3
18-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
19-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
18+
12:00:00[INFO] Corpus cache after loading components: 100MB / 300MB - loaded corpora [sample-disk-based-3.3]
2019
12:00:00[INFO] Executed query in 10 ms
2120
result: 44 matches in 4 documents
2221
sample-disk-based-1.5 (not loaded)
@@ -28,4 +27,3 @@ sample-memory-based-3.3 (not loaded)
2827
graphANNIS says good-bye!
2928

3029
----- stderr -----
31-

cli/tests/snapshots/cli__show_corpus_info.snap

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@ success: true
1515
exit_code: 0
1616
----- stdout -----
1717
12:00:00[INFO] Loaded corpus sample-disk-based-3.3
18-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
19-
12:00:00[INFO] Total cache size is 100 / 300 MB and loaded corpora are: sample-disk-based-3.3.
18+
12:00:00[INFO] Corpus cache after preloading sample-disk-based-3.3: 100MB / 300MB - loaded corpora [sample-disk-based-3.3]
2019
12:00:00[INFO] Preloaded corpus in 10 ms
2120
Status: "fully loaded"
2221
Token search shortcut possible: true
@@ -75,4 +74,3 @@ Status: "fully loaded"
7574
graphANNIS says good-bye!
7675

7776
----- stderr -----
78-

core/src/graph/mod.rs

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -927,8 +927,13 @@ impl<CT: ComponentType> Graph<CT> {
927927
}
928928

929929
/// Ensure that the graph storage for a the given component is loaded and ready to use.
930-
/// Loading is done in paralell.
931-
pub fn ensure_loaded_parallel(&mut self, components_to_load: &[Component<CT>]) -> Result<()> {
930+
/// Loading is done in parallel.
931+
///
932+
/// Returns the list of actually loaded (and existing) components.
933+
pub fn ensure_loaded_parallel(
934+
&mut self,
935+
components_to_load: &[Component<CT>],
936+
) -> Result<Vec<Component<CT>>> {
932937
// We only load known components, so check the map if the entry exists
933938
// and that is not loaded yet.
934939
let components_to_load: Vec<_> = components_to_load
@@ -959,11 +964,13 @@ impl<CT: ComponentType> Graph<CT> {
959964
.collect();
960965

961966
// insert all the loaded components
967+
let mut result = Vec::with_capacity(loaded_components.len());
962968
for (c, gs) in loaded_components {
963969
let gs = gs?;
964970
self.components.insert(c.clone(), Some(gs));
971+
result.push(c.clone());
965972
}
966-
Ok(())
973+
Ok(result)
967974
}
968975

969976
pub fn optimize_impl(&mut self, disk_based: bool) -> Result<()> {

docs/src/rest/configuration.md

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ cache = {PercentOfFreeMemory = 25.0}
1717

1818
[logging]
1919
debug = false
20+
# Optional path to a logging file.
21+
# If not given, only log to stdout/stderr
22+
file = "/var/log/graphannis.log"
2023

2124
[auth]
2225
anonymous_access_all_corpora = false
@@ -38,25 +41,28 @@ A new database file will be created at this path when the service is started and
3841
Also, you can decide if you want to prefer disk-based storage of annotations by setting the value for the `disk_based` key to `true`.
3942

4043
You can configure how much memory is used by the service for caching loaded corpora with the `cache` key.
41-
There are two types of strategies:
44+
There are two types of strategies:
4245

43-
- `PercentOfFreeMemory` estimates the free space of memory for the system during startup and only uses the given value (as percent) of the available free space.
46+
- `PercentOfFreeMemory` estimates the free space of memory for the system during startup and only uses the given value (as percent) of the available free space.
4447
- `FixedMaxMemory` will use at most the given value in Megabytes.
4548

4649
For example, setting the configuration value to
4750
```toml
4851
cache = {PercentOfFreeMemory = 80.0}
49-
```
50-
will use 80% of the available free memory and
52+
```
53+
will use 80% of the available free memory and
5154
```toml
5255
cache = {FixedMaxMemory = 8000}
53-
```
56+
```
5457
at most 8 GB of RAM.
5558

5659
## [logging] section
5760

58-
Per default, graphANNIS will only output information, warning and error messages.
59-
To also enable debug output, set the value for the `debug` field to `true`.
61+
Per default, graphANNIS will only output information, warning and error
62+
messages. To also enable debug output, set the value for the `debug` field to
63+
`true`. You can set the optional value `file` to a file path to also add the log
64+
messages to the given file. **The log file is not emptied automatically, you
65+
have to clean it regulary**, e.g. with `logrotate` on a Linux server.
6066

6167
## [auth] section
6268

0 commit comments

Comments
 (0)