log4j2updated to2.17.1(was2.4.1) these are provided dependencies.spring-bootupdated to1.5.22.RELEASE(was1.3.8.RELEASE).spring-coreupdated to4.3.9.RELEASE(was4.2.8.RELEASE).- Fix google library conflicts with
guavaandgson.
- Various code changes to allow compilation and build on Java 11.
hotels-oss-parentversion to 6.2.1 (was 5.0.0).
- Issue where rename table operation would be incorrect if tables are in different databases.
- Added check in
deleteoperation so that it doesn't try to delete empty lists of keys.
- Issue where external AVRO schemas generated lots of copy jobs. See #203.
- Added fields
sourceTableandsourcePartitionsto theCopierContextclass.
- Added method
newInstance(CopierContext)tocom.hotels.bdp.circustrain.api.copier.CopierFactory. This provides Copiers with more configuration information in a future proof manner. See #195.
- Deprecated other
newInstance()methods oncom.hotels.bdp.circustrain.api.copier.CopierFactory.
- Changed version of
hive.versionto2.3.7(was2.3.2). This allows Circus Train to be used on JDK>=9.
- Replication mode
FULL_OVERWRITEto overwrite a previously replicated table and delete its data. Useful for incompatible schema changes.
- Updated S3S3Copier to have a configurable max number of threads to pass to TransferManager.
- Fix AssumeRoleCredentialProvider not auto-renewing credentials on expiration.
- Fixed issue where replication breaks if struct columns have changed. See #173.
- Minimum supported Java version is now 8 (was 7).
hotels-oss-parentversion to 5.0.0 (was 4.3.1).- Updated property
aws-jdk.versionto 1.11.728 (was 1.11.505). - Updated property
httpcomponents.httpclient.versionto 4.5.11 (was 4.5.5).
- When replicating tables with large numbers of partitions,
Replica.updateMetadatanow calls add/alter partition in batches of 1000. See #166.
- AVRO Schema Copier now re-uses the normal 'data' copier instead of its own. See #162.
- Changed the order of the generated partition filter used by "HiveDiff" - it is now reverse natural order (which means new partitions first when partitions are date/time strings). When in doubt use the circus-train-tool
check-filters.shto see what would be generated.
Fixed issue where partition-limit is not correctly applied when generating a partition filter. See #164.
- Default
avro-serde-optionsmust now be included withintransform-options. This is a backwards incompatible change to the configuration file. Please see Avro Schema Replication for more information. - Updated
jacksonversion to 2.10.0 (was 2.9.10). hotels-oss-parentversion to 4.2.0 (was 4.0.0). Contains updates to the copyright header.
- Table properties can now be added to default transformations.
- Added
copier-options.assume-roleto assume a role when using the S3S3 copier.
- Table transformation to add custom properties to tables during a replication.
- If a user doesn't specify
avro-serde-options, Circus Train will still copy the external schema over to the target table. See #131. - Added
copier-options.assume-roleto assume a role when using the S3MapReduceCp copier class. See README.md for details.
- Excluded
org.pentaho:pentaho-aggdesigner-algorithmfrom build.
- Bug in
AbstractAvroSerDeTransformationwhere the config state wasn't refreshed on every replication.
- Updated
jacksonversion to 2.9.10 (was 2.9.8). - Updated
beejuversion to 2.0.0 (was 1.2.1). - Updated
circus-train-minimal.yml.templateto include the requiredhousekeepingconfiguration for using the default schema with H2.
- Updated
housekeepingversion to 3.1.0 (was 3.0.6). Contains various housekeeping fixes.
- Updated
housekeepingversion to 3.0.6 (was 3.0.5). This change modifies the default script for creating a housekeeping schema (fromclasspath:/schema.sqlto empty string) and can cause errors for users that use the schema provided by default. To fix the errors, the propertyhousekeeping.db-init-scriptcan be updated toclasspath:/schema.sqlwhich uses a file provided by default by Circus Train. - Updated
hotels-oss-parentversion to 4.0.0 (was 2.3.5).
- Clear partitioned state correctly for
SnsListener. See #104. - Fixed issue where in certain cases the table location of a partitioned table would be scheduled for housekeeping.
- Removed default script for creating a housekeeping schema to allow the use of schemas that are already created. See #111.
- Upgraded AWS SDK to remove deprecation warning. See #102.
- Upgraded
hcommon-hive-metastoreversion to 1.3.0 (was 1.2.4) to fix Thrift compatibility bug. See #115.
- Configurable retry mechanism to handle flaky AWS S3 to S3 copying. See #56.
- Refactored project to remove checkstyle and findbugs warnings.
- Upgraded
hotels-oss-parentversion to 2.3.5 (was 2.3.3). - Upgraded
housekeepingversion to 3.0.5 (was 3.0.0). - Upgraded
jacksonversion to 2.9.8 (was 2.9.7).
- Support for getting AWS Credentials within a FARGATE instance in ECS. See #109.
- Added replication-strategy configuration that can be used to support propagating deletes (drop table/partition operations). See README.md for more details.
- Ability to specify an S3 canned ACL via
copier-options.canned-acl. See #99.
- Increased version (1.2.4) of hcommon-hive-metastore to fix an issue where the wrong exception was being propagated in the compatibility layer.
- Housekeeping can be configured to control query batch size, this controls memory usage. See #40.
- Housekeeping readme moved to Housekeeping project. See #31.
- Upgraded Housekeeping library to also store replica database and table name in Housekeeping database. See #30.
- Upgraded
hotels-oss-parentpom to 2.3.3 (was 2.0.6). See #97.
- Narrowed component scanning to be internal base packages instead of
com.hotels.bdp.circustrain. See #95. Note this change is not backwards compatible for any Circus Train extensions that are in thecom.hotels.bdp.circustrainpackage - these were in effect being implicitly scanned and loaded but won't be now. Instead these extensions will now need to be added using Circus Train's standard extension loading mechanism. - Upgraded
jackson.versionto 2.9.7 (was 2.6.6),aws-jdk.versionto 1.11.431 (was 1.11.126) andhttpcomponents.httpclient.versionto 4.5.5 (was 4.5.2). See #91. - Refactored general metastore tunnelling code to leverage hcommon-hive-metastore libraries. See #85.
- Refactored the remaining code in
core.metastorefromcircus-train-coreto leverage hcommon-hive-metastore libraries.
- circus-train-gcp: avoid temporary copy of key file to
user.dirwhen using absolute path to Google Cloud credentials file by transforming it into relative path. - circus-train-gcp: relative path can now be provided in the configuration for the Google Cloud credentials file.
- circus-train-vacuum-tool moved into Housekeeping project under the module housekeeping-vacuum-tool.
- Configuration classes moved from Core to API sub-project. See #78.
- Refactored general purpose Hive metastore code to leverage hcommon-hive-metastore libraries. See #72.
- Avro schemas were not being replicated when a avro.schema.url without a scheme was specified. See #74
- Avro schemas were not being replicated when a HA NameNode is configured and the Avro replication feature is used. See #69.
- Add SSH timeout and SSH strict host key checking capabilities. #64.
- Using hcommon-ssh-1.0.1 dependency to fix issue where metastore exceptions were lost and not propagated properly over tunnelled connections.
- Replace SSH support with hcommon-ssh library. #46.
- Housekeeping was failing when attempting to delete a path which no longer exists on the replica filesystem. Upgraded Circus Train's Housekeeping dependency to a version which fixes this bug. See #61.
- Ability to select Copier via configuration. See #55.
- Clearer replica-check exception message. See #47.
- S3-S3 Hive Diff calculating incorrect checksum on folders. See #49.
- SNS message now indicates if message was truncated. See #41.
- Exclude Guava 17.0 in favour of Guava 20.0 for Google Cloud library compatibility.
- Add dependency management bom for Google Cloud dependencies.
- Backwards compatibility with Hive 1.2.x.
- Added ability to configure AWS Server Side Encryption for
S3S3Copierviacopier-options.s3-server-side-encryptionconfiguration property.
- Upgrade housekeeping to version 1.0.2.
- Google FileSystem classes not being placed onto the mapreduce.application.classpath in S3MapReduceCp and DistCp mapreduce jobs.
- Google FileSystem and S3 FileSystems added to mapreduce.application.classpath in circus-train-gcp and circus-train-aws respectively.
- #23 - Housekeeping failing due to missing credentials.
- Added
replicaTableLocation,replicaMetastoreUrisandpartitionKeysto SNS message.
- SNS Message
protocolVersionchanged from "1.0" to "1.1". - Updated documentation for circus-train-aws-sns module (full reference of SNS message format, more examples).
- Fixed references to README.md in command line runner help messages to point to correct GitHub locations.
- Upgraded Hive version from 1.2.1 to 2.3.2 (changes are backwards compatible).
- Upgraded Spring Platform version from 2.0.3.RELEASE to 2.0.8.RELEASE.
- Replaced
TunnellingMetaStoreClient"concrete" implementation with a Java reflectionTunnellingMetaStoreClientInvocationHandler. - Replicating a partitioned table containing no partitions will now succeed instead of silently not replicating the table metadata.
- Most functionality from Housekeeping module moved to https://github.com/HotelsDotCom/housekeeping.
- Maven group ID changed to com.hotels.
- Exclude logback in parent POM.
- First open source release.
- Various small code cleanups.
S3S3Copiercaptures cross region replications from US-Standard AWS regions.
- Mock S3 end-point for HDFS-S3 and S3-S3 replications.
- New
S3MapreduceCpproperties to control the size of the buffer used by the S3TransferManagerand to control the upload retries of the S3 client. Refer to README.md for details.
EventIdExtractorRegEx changed so that it captures new event ID's and legacy event ID's.- Add read limit to prevent AWS library from trying to read more data than the size of the buffer provided by the Hadoop
FileSystem.
- circus-train-housekeeping support for storing housekeeping data in JDBC compliant SQL databases.
- circus-train-parent updated to inherit from hww-parent version 12.1.3.
- Support for replication of Hive views.
- Removed circus-train-aws dependency on internal patched hadoop-aws.
- Fixed error when replicating partitioned tables with empty partitions.
- Removed circus-train-aws dependency from circus-train-core.
- Circus Train Tools to be packaged as TGZ.
- Updated to parent POM 12.1.0 (latest versions of dependencies and plugins).
- Relocate only Guava rather than Google Cloud Platform Dependencies + Guava
CircusTrainContextinterface as it was a legacy leftover way to make CT pluggable.
- Fixed broken Circus Train Tool Scripts.
- S3/HDFS to GS Hive replication.
- Support for users to be able to specify a list of
extension-packagesin their YAML configuration. This adds the specified packages to Spring's component scan, thereby allowing the loading of extensions via standard Spring annotations, such as@Componentand@Configuration. - Added circus-train-package module that builds a TGZ file with a runnable circus-train script.
- Changed public interface
com.hotels.bdp.circustrain.api.event.TableReplicationListener, for easier error handling.
- RPM module has been pulled out to a top-level project.
- New and improved HDFS-to-S3 copier.
- Changed default
instance.homefrom$user.dirto$user.home. If you relied on the$user.dirplease set theinstance.homevariable in the YAML config.
- Fixed issue in LoggingListener where the wrong number of altered partitions was reported.
- Fixed issue with the circus-train-aws-sns module.
- Added S3 to S3 replication.
- Should have been major release. The public
com.hotels.bdp.circustrain.api.copier.CopierFactory.supportsSchemesmethod has changed signature. Please adjust your code if you rely on this for Circus Train extensions.
- Clean up of AWS Credential Provider classes.
- Fixed tables in documentation to display correctly.
- Added support for skipping missing partition folder errors.
- Added support for transporting AvroSerDe files when the avro.schema.url is specified within the SERDEPROPERTIES rather than the TBLPROPERTIES.
- Fixed bug where table-location matching avro schema base-url causes an IOException to be thrown in future reads of the replica table.
- Added transformation config per table replication (Used in Avro SerDe transformation)
- Replaced usage of reflections.org with Spring's scanning provider.
- Documented methods for implementing custom copiers and data transformations.
- Cleaned up copier options configuration classes.
- Support for composite copiers.
- Extensions for
HiveMetaStoreClientFactoryto allow integration with AWS Athena. - Updated to extend hdw-parent 9.2.2 which in turn upgrades hive.version to 1.2.1000.2.4.3.3-2.
- Support for downloading Avro schemas from the URI on the source table and uploading it to a user specified URL on replication.
- Multiple transformations can now be loaded onto the classpath for application on replication rather than just one.
- Update to new parent with HDP dependency updates for the Hadoop upgrade.
- Fixed bug where incorrect table name resulted in a NullPointerException.
- Added new "replication mode: METADATA_UPDATE" feature which provides the ability to replicate metadata only which is useful for updating the structure and configuration of previously replicated tables.
- Added new "replication mode: METADATA_MIRROR" feature which provides the ability to replicate metadata only, pointing the replica table at the original source data locations.
- Replication and Housekeeping can now be executed in separate processes.
- Add the option
--modules=replicationto the scriptscircus-train.shto perform replication only. - Use the scripts
housekeeping.shandhousekeeping-rush.shto perform housekeeping in its own process.
- Add the option
- Made scripts and code workable with HDP-2.4.3.
- Fixed issue where wrong replication was listed as the failed replication.
- Support for replication from AWS to on-premises when running on on-premises cluster.
- Configuration element
replica-catalog.s3is nowsecurity. The following is an example of how to migrate your configuration files to this new version:
Old configuration file
```
...
replica-catalog:
name: my-replica
hive-metastore-uris: thrift://hive.localdomain:9083
s3:
credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```
New configuration file
```
...
replica-catalog:
name: my-replica
hive-metastore-uris: thrift://hive.localdomain:9083
security:
credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```
- Exit codes based on success or error.
- Ignoring params that seem to be added in the replication process.
- Support sending
S3_DIST_CP_BYTES_REPLICATED/DIST_CP_BYTES_REPLICATEDmetrics to graphite for running (S3)DistCp jobs.
- Support for SHH tunneling on source catalog.
- Fixes for filter partition generator.
- Enabled possibility to generate filter partitions for incremental replication.
- Introduction of the transformations API: users can now provide a metadata transformation function for tables, partitions and column statistics.
- Fixed issue with deleted paths.
- Added some stricter preconditions to the vacuum tool so that data is not unintentionally removed from tables with inconsistent metadata.
- Added the 'vacuum' tool for removing data orphaned by a bug in circus-train versions earlier than 2.0.0.
- Moved the 'filter tool' into a 'tools' sub-module.
- Fixed issue where housekeeping would fail when two processes deleted the same entry.
- SSH tunnels with multiple hops. The property
replica-catalog.metastore-tunnel.userhas been replaced withreplica-catalog.metastore-tunnel.routeand the propertyreplica-catalog.metastore-tunnel.private-keyhad been replaced withreplica-catalog.metastore-tunnel.private-keys. Refer to README.md for details. - The executable script has been split to provide both non-RUSH and RUSH executions. If you are not using RUSH the keep using
circus-train.shand if you are using RUSH then you can either change your scripts to invokecircus-train-rush.shinstead or add the new parameter rush in the first position when invokingcircus-train.sh. - Removal of property
graphite.enabled. - Improvements and fixes to the housekeeping process that manages old data deletion:
- Binds
S3AFileSystemtos3[n]://schemes in the tool for housekeeping. - Only remove housekeeping entries from the database if:
- The leaf path described by the record no longer exists AND another sibling exists who can look after the ancestors
- OR the ancestors of the leaf path no longer exist.
- Stores the eventId of the deleted path along with the path.
- If an existing record does not have the previous eventId then reconstruct it from the path (to support legacy data for the time being).
- Binds
DistCPtemporary path is now set per task.