Text Classification module

Content management Bundled: DX Core

Download	Multiple submodules
Edition	DX Core
License	MLA
Issues	TXTREC
Maven site	Text Classification
Latest	1.1.6

Download

Multiple submodules

Edition

DX Core

License

MLA

Issues

TXTREC

Maven site

Text Classification

Latest

1.1.6

The Text Classification module uses the Amazon Comprehend service to analyze and tag your text content. Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. Magnolia uses the AWS Key Phrases service (BatchDetectKeyPhrases) to detect key phrases in your content during the classification process.

A key phrase is a string containing a noun phrase that describes a particular thing. It generally consists of a noun and the modifiers that distinguish it. For example, day'' is a noun; ``a beautiful day is a noun phrase that includes an article (a) and an adjective (beautiful). Each key phrase includes a score that indicates the level of confidence that Amazon Comprehend has that the string is a noun phrase. You can use the score to determine if the detection has high enough confidence for your application.

Module structure

artifactID Description

artifactID	Description
`magnolia-text-classification-parent`	Parent reactor.
`magnolia-text-classification`	Provides the text classification module and service.
`magnolia-text-classification-api`	Provides an API to classify text.
`magnolia-amazon-text-classification`	Provides functionality to classify text via Amazon Comprehend.
`magnolia-pages-content-tags-integration`	Provides functionality to integrate content tags and the text classification service using decorations in the Pages app.
`magnolia-pages-content-tags-integration-compatibility`	Magnolia 6.2 compatibility submodule that provides the `CompatibilityPageTextAggregator`. This text aggregator gathers texts according to both old and new API dialogs of page areas and components.

magnolia-text-classification-parent

Parent reactor.

magnolia-text-classification

Provides the text classification module and service.

magnolia-text-classification-api

Provides an API to classify text.

magnolia-amazon-text-classification

Provides functionality to classify text via Amazon Comprehend.

magnolia-pages-content-tags-integration

Provides functionality to integrate content tags and the text classification service using decorations in the Pages app.

magnolia-pages-content-tags-integration-compatibility

Magnolia 6.2 compatibility submodule that provides the CompatibilityPageTextAggregator. This text aggregator gathers texts according to both old and new API dialogs of page areas and components.

Installing with Maven

Bundled modules are automatically installed for you.

If the module is unbundled, add the following to your bundle including your project’s <dependencyManagement> section and your webapp’s <dependencies> section. If the module is unbundled but the version is managed by the parent POM, add the following to your webapp’s <dependencies> section.

<dependency>
  <groupId>info.magnolia.ai.text</groupId>
  <artifactId>magnolia-text-classification</artifactId>
  <version>1.1.6</version> (1)
</dependency>

1	Should you need to specify the module version, do it using `<version>`.

<dependency>
  <groupId>info.magnolia.ai.text</groupId>
  <artifactId>magnolia-text-classification-api</artifactId>
  <version>1.1.6</version> (1)
</dependency>

1	Should you need to specify the module version, do it using `<version>`.

<dependency>
  <groupId>info.magnolia.ai.text</groupId>
  <artifactId>magnolia-amazon-text-classification</artifactId>
  <version>1.1.6</version> (1)
</dependency>

1	Should you need to specify the module version, do it using `<version>`.

<dependency>
  <groupId>info.magnolia.ai.text</groupId>
  <artifactId>magnolia-pages-content-tags-integration</artifactId>
  <version>1.1.6</version> (1)
</dependency>

1	Should you need to specify the module version, do it using `<version>`.

<dependency>
  <groupId>info.magnolia.ai.text</groupId>
  <artifactId>magnolia-pages-content-tags-integration-compatibility</artifactId>
  <version>1.1.6</version> (1)
</dependency>

1	Should you need to specify the module version, do it using `<version>`.

Configuration

If you want to turn off the text classification feature, modify the file /text-classification/config.yaml in the Resource Files app, so that enabled is set to false. By default, it is set to true.

# turn off the module with this property
enabled: false (1)
aggregateDefinition:
  fieldTypes: [text, textField, richText, richTextField, composite, compositeField, switchable, switchableField]
termFilteringDefinition:
  excludedTerms: []

1	The text classification feature is disabled.

Expand to see how text classification is disabled

]

Turning off the text classification might be helpful if you have Amazon credentials stored in the Passwords app which could otherwise cause text classification to be enabled or generate warning messages in the log.

When using our out-of-the-box solution:

The pages-content-tags-integration submodule brings the content-tags functionality to the Pages app and handles aggregating text from the website workspace.
The magnolia-amazon-text-classification submodule provides an out-of-the-box implementation to use Amazon Comprehend.

This solution is straightforward to configure:

Configure the connection to the Amazon Comprehend classification service.
Configure the aggregateDefinition for the Pages app (website workspace) to specify:
- The field types to be aggregated.
- Any terms you want to blacklist. For example, you may want to filter out your company name.
Adjust the minConfidence property to change the classification confidence score.

If you so require, you can also write:

Your own text aggregator implementation to run text classification on a custom content app.
Your own text classifier implementation to use another third-party text classification service to classify and tag your content.

Amazon Comprehend service

AWS service permissions

First, make sure that you have acquired appropriate permissions for the service in the Amazon IAM Management Console.

AWS service permissions

Add your security credentials to Magnolia

You need an AWS secret access key to make secure REST requests to the Amazon Comprehend API. Access keys consist of two parts:

Access Key ID
Secret Access Key

Generate the key in the security credentials section of the Amazon IAM Management Console. (In the navigation bar on the upper right, choose your user name, and then choose My Security Credentials.)

Add the two parts of the key to your Magnolia instance in the Password manager app using the following names:

##aws-credentials

Padlock icon aws_access_key_id

Padlock icon aws_secret_access_key

For more information about the key, see Understanding and Getting Your Security Credentials.

Configuring the service

Under /amazon-text-classification/config.yaml, you must configure the following properties for the classification service:

region:
  name: your_aws_region_name
languageCode: en
minConfidence: 0.85

Properties

Property Description

region name

required

Label designating a regional endpoint to which the text classification service connects, such as eu-west-1.

You must set a region name to configure the Amazon Comprehend service in Magnolia.

To reduce data latency, AWS offers several regional endpoints. Each of the endpoints can be referred to in service configurations by a region name, for example eu-west-1. Note that if you pick a region that does not support this service, you may get erratic results.

For a list of available regions and labels, see docs.aws.amazon.com/general/latest/gr/rande.html#comprehend_region.

languageCode

required, default is `en`

The language of the input documents. You can specify any of the primary languages supported by Amazon Comprehend: German (de), English (en''), Spanish (es''), French (fr''), Italian (``it), or Portuguese (pt). All documents must be in the same language.

Amazon Comprehend can perform text analysis on English, French, German, Italian, Portuguese, and Spanish texts.

minConfidence

required, default is `0.85`

The confidence score of the classification.

A decimal value between 0 and 1. The filter drops the tags with a confidence score lower than the value of this property.

The Amazon Comprehend solution returns a confidence score for each key phrase tag. Tags with a confidence score lower than the value of the minConfidence property are dropped.

Setting the value higher usually results in fewer key phrase tags being returned for your content. A higher confidence score means that the tag more correctly describes the text.

Configuring text aggregators

The pages-content-tags-integration module brings the content-tags functionality to the Pages app and handles aggregating text from the website workspace out-of-the-box.

The pages-content-tags-integration-compatibility module handles aggregating text for both the legacy Magnolia 5 UI Pages app and the new 6 UI Pages app.

Text aggregators collect and aggregate the content that the classification service analyzes and generates tags from. You can specify from which field types content should be taken in the text aggregator configuration.

Defining field types

By default, the text aggregator for the Pages app gathers text from text, rich text, composite and switchable field types.

text-classification/src/main/resources/text-classification/config.yaml

aggregateDefinition:
  fieldTypes: [text, textField, richText, richTextField, composite, compositeField, switchable, switchableField]

Excluding terms from the classification tags

You can blacklist the terms you do not want to appear in your tags. For example, you may want to exclude your company name.

To do so, go to the Resource files app, under /text-classification/config.yaml and add comma-separated terms to the excludedTerms list. In this example, the words ACME, corporation and coyote are excluded:

text-classification/src/main/resources/text-classification/config.yaml

termFilteringDefinition:
  excludedTerms: [ACME, corporation, coyote]

Note that the blacklist is case insensitive.

Creating custom content app text aggregators

If you want to run text classification on a custom content app, you must write your own text aggregator implementation.

To do so:

Implement the TextAggregator interface.
TextAggregator uses multi-binding so you must annotate it with @Multibinding and add it to the module descriptor as a component for injection. For example, see pages-content-tags-integration/src/main/resources/META-INF/magnolia/pages-content-tags-integration.xml.
Decorate the text-classification configuration file, for example:

customModule-content-tags-integration/decorations/text-classification/config/config.yaml

workspaceClassificationConfigurations:
  website:
    textAggregatorClassName: info.magnolia.ai.text.YourTextAggregator
    workspace: yourworkspace
    nodeType: mgnl:yournodetype

Properties

Property Description

Property	Description
`workspaceClassificationConfigurations`	required
`website`	required Arbitrary, unique name for the decoration configuration.
`textAggregatorClassName`	required Fully qualified classname for your text aggregator. Example: `info.magnolia.ai.text.PageTextAggregator`
`workspace`	required The workspace where the content to be analyzed is stored.
`nodeType`	required The name of the JCR node type for storing an item of the given content type. Example: `mgnl:page`

workspaceClassificationConfigurations

required

website

required

Arbitrary, unique name for the decoration configuration.

textAggregatorClassName

required

Fully qualified classname for your text aggregator.

Example: info.magnolia.ai.text.PageTextAggregator

workspace

required

The workspace where the content to be analyzed is stored.

nodeType

required

The name of the JCR node type for storing an item of the given content type.

Example: mgnl:page

Creating custom text classifiers

The magnolia-amazon-text-classification submodule provides an out-of-the-box implementation to use Amazon Comprehend.

However, if you want to use another third-party text classification service to classify and tag your content, you can write your own custom text classifier implementation.

Before configuring the text classifier, make sure you have administrator access to your third-party classification service, including the API documentation.

To create a custom text classifier you must implement the info.magnolia.ai.text.TextClassifier interface.

Note that you can inject the TextClassifier interface as a component in any running instance of Magnolia.

/**
 * Commons interface to classify text.
 */
public interface TextClassifier {

    /**
     * Takes a {@link String text} as parameter and returns a {@link Collection collection}
     * of {@link TextLabel Text label}s as output.
     *
     * <p>
     * Returns empty collection for the cases below:
     * <li>Upon exception</li>
     * <li>Text couldn't be classified</li>
     * </p>
     */
    Collection<TextLabel> classify(String text);

    /**
     * Takes a collection containing the text of the input documents as a parameter.
     *
     * @param texts
     * A collection containing the text of the input documents.
     * @return Returns a {@link Map map} where keys are input texts, values are {@link Collection collection}s of detected {@link TextLabel Text label}s
     * for the input text or empty collections if an error occurs while processing the input text.
     * The returned map preserves the order of the texts in the input collection.
     *
     * <p>
     * Returns an empty map for the cases below:
     * <li>Input {@link Collection collection} is null or empty</li>
     * <li>All documents in input {@link Collection collection} are processed with an error</li>
     * </p>
     */
    default Map<String, Collection<TextLabel>> classify(Collection<String> texts) {
        if (CollectionUtils.isNotEmpty(texts)) {
            return texts.stream()
                    .collect(Collectors.toMap(mapper -> mapper, this::classify));
        }
        return Collections.emptyMap();
    }

Only one TextClassifier should be used in a Magnolia instance. Remove the out-of-the-box AmazonTextClassifier if you choose to implement your own.

If you have more than one module that specifies the TextClassifier implementation in the module class, the TextClassifier from the module that was started last is used.

See the following files for an example implementation:

info.magnolia.ai.text.amazon.AmazonTextClassifier
META-INF/magnolia/amazon-text-classification.xml

Running text classification

The text classification and tagging action are executed during the startup of the author instance. You can also trigger the action manually in the Pages app by selecting one or more pages and clicking the Run classification action.

Pages that have already been tagged are marked as such using a JCR property called lastTaggingAttemptDateByTextClassifier. Executing the manual classification action forces a new tag to be set even if the content was previously tagged.

The text classification feature is available only on author instances.

Removing tags

Once a page has been tagged, you can remove some or all of the tags by selecting the page and clicking the Modify tags action in the Pages app.

In the dialog box that opens, you can remove individual tags or click Remove all tags.

Note that content tagging currently has an issue when creating tags of words with accented characters. For example, Genève is tagged as Gen-ve. This means that searching for the tag Geneve or Genève will not return any results. The issue is being tracked here: CONTTAGS-69 No support of special characters SELECTED.

Text Classification module

Module structure

Installing with Maven

Configuration

Amazon Comprehend service

AWS service permissions

Add your security credentials to Magnolia

Configuring the service

Properties

Configuring text aggregators

Defining field types

Excluding terms from the classification tags

Creating custom content app text aggregators

Properties

Creating custom text classifiers

Running text classification

Removing tags

Location

Main doc sections