Releases · openzim/gutenberg · GitHub

24 Nov 16:41

benoit74

3.0.1 Latest

Latest

Added

Add CLI flag to customize ZIM Name (#340)

Fixed

Add missing flags in offliner definition (#339)

Assets 2

28 Oct 13:50

benoit74

3.0.0

Breaking

Remove any optimization logic + use of S3 cache, dropping --optimization-cache and --use-any-optimized-version flags (#300)
Move to another default mirror + add CLI flag to select mirror to use, dropping --rdf-url flag and adding --mirror_url CLI flag (#301)
Reengineer the scraper (#312)
- Source list of books from CSV at https://gutenberg.pglaf.org/cache/epub/feeds/pg_catalog.csv.gz for quicker startup
- Stop downloading big RDF archive and download only needed individual RDF files (saves lot of time when only few books are requested ; looks like negligible penalty when many books are needed)
- Get rid of SQLite database used to persist data across runs, too much maintenance effort and impact on the filesystem for limited benefit
- Get rid of the option to generate one ZIM per language ; scraper now always produce one single ZIM
- Data is now directly transferred to the ZIM, without touching the filesystem
- Get rid of the "steps" approach, not used (anymore?) in production and difficult to maintain
- Many CLI options removed: --use-any-optimized-version, --zim, --download, --parse, --prepare, -m/--one-language-one-zim, --dlc, -d/--dl-folder, -e/--static-folder
Drop support for Gutenberg bookshelves and add support for LibraryOfCongress (#265)

Added

Add standard --output option to control where ZIM files are written (#314)
Add ability to generate smaller selection (#184)
Log failing book ID in case of fatal error (#333)

Changed

Finalize implementation of scraper progress (#289)
Move to another default mirror + add CLI flag to select mirror to use (#301)
Split build logic in Dockerfile to separate dependencies layer from scraper code layer (#302)
Prepare structure for TranslateWiki (#312)
Remove « .html » extension (#166)
Configure libzim verbosity based on --debug flag (#326)

Fixed

Stop ignoring HTML illustrations containing cover in their name (#270)
Fix JS/JSON files generation (#297, #298)
Fix navigation to bookshelves with special characters (#305)
Bookshelves with special characters cannot be opened (#306)
Fix internationalization of the "Copyrighted" license label (#253)
Properly compute (including sorting) ZIM Language items + allow to override with --zim-languages (#323)
Fix support for ZIMs without full-text index (#326)

Assets 2

06 Jun 16:06

benoit74

2.2.0

Added

Add support for --debug flag to output debug logs
Add support for -L long_description flags
Add request timeout for util.py (#197)
Add Booklanguage DB to support multi-languages books (#218)
Add RTL support to UI (#248)
Add language filter to combobox for requested languages (#249)

Changed

Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames) (#97)
Prefer EPUB 3 to EPUB (#235)
Do not force the presence of PDF format for all books (#160)
Replace usage of os.path and path.py with pathlib.Path (#195)
Finalize ZIM metadata title translations and multilingual detection (#229)
Replaced magic number with named constant and clarified comment regarding book ID URL rules (#196)
Replace print and pp calls with logger (#192)
Update to Python3.13
Update python-scraperlib to 5.1.1 and dependencies (#188)
Rename Book DB table fields (#199)
Update multi-resolution favicons (#165)

Fixed

Fix regression on missing HTML content (#219)
Simplify the logger name (used gutenberg2zim instead of gutenberg2zim.constants) (#206)
Add retry logic on book downloads (#254)
Fix UI and navigation glitches on bookshelves (#262)
Remove dependencies on binaries + buggy pngquant (#257)

Assets 2

17 Jan 13:45

benoit74

2.1.1

Added

Publisher ZIM metadata can now be customized at CLI (#210)

Changed

Publisher ZIM metadata default value is changed to openZIM intead of Kiwix (#210)

Fixed

Do not fail if temporary directory already exists (#207)
Typo in Scraper ZIM metadata (#212)
Adapt to hatchling v1.19.0 which mandates packages setting (#211)

Assets 2

18 Aug 15:22

benoit74

2.1.0

Changed

Fixed regression with broken filters on on multiple-languages ZIM (#175)
Fixed Name metadata that was incorrectly including period (#177)
Fixed Language metadata (and filename) for multilang ZIMs (#174)
Using zimscraperlib 2.1.0
Using localized Title and Description metadata (#148)
Fixed regression with epub files stored as application/zip (#181)
Adopt Python bootstrap conventions, especially migration to hatch instead of setuptools and Github CI Workflows adaptations (#190)
Removed inline Javascript in HTML files (#145)

Fixed

Support single quotes in author names (#162)
Migrated to another Gutenberg server (#187)
Removed useless file languages_06_2018 (#180)

Removed

Removed Datatables JS code from repository, fetch online now (#116)
Dropped Python 2 support (#191)

Assets 2

21 Feb 08:46

rgaudin

2.0.0

Added

Porgress report using --stats-filename

Changed

Updated dependencies, including zimscraperlib (2.0)
Now creating no-namespace ZIM with Illustration
Fixed/reduced sqlite timeouts
Better handling of rsync'd list of URLs
RDF files are not extracted to disk anymore (faster on selections)
Remove all Urls from DB before processing rsync'd ones
Fixed --concurrency short flag (now -c)
Docker image now uses python3.11
DB don't use a separate Format table anymore

Removed

Dependency to zimwriterfs binary.
-r/--rdf-folder flag: rdf not extracted to disk anymore
--export: HTML files not written to disk first anymore
--dev: idem
Binaries from docker images: jpegoptim, pngquant, gifsicle, zip, curl, p7zip

Assets 2