Text Generation in Customer Service (Part 2)

Retrieval Performance

Generation Quality (Standard Scores)

Figure 1: Test set results of the proposed response generation model on DC and TT dataset. Automatic scoring metrics are: AverageSentenceSimilarity (S), BLEU-4 (B), NIST (N), METEOR (M), ROUGE-L (R) and CIDEr (C).

For the Retrieve Only method we consider only the fetched historical response (without refinement) as a hypothesis. For baseline, RetRef and Hybrid methods, we consider hypotheses that are produced using (top-k,top-p with temperature) decoding setup, as its corpus-level score was better than other combinations. In ranking enforced versions of RetRef and Hybrid, instead of new generation, for each query we picked the hypothesis with the highest rank score for evaluation.

To assess our model, we utilize commonly used metrics such as Average Sentence Similarity, BLEU, NIST, METEOR, ROUGE-L and CIDEr. The average sentence similarity score is a measure of semantic similarity between reference response and hypothesis. We use sentence-BERT [1], a trained Siamese BERT-networks to encode a reference and a hypothesis and then calculate the cosine similarity of the resulting embeddings. The final similarity score is the mean value over the test set.

For both datasets, our proposed retrieval-based response generation model (RetRef+Rank) outperforms all other baselines in all the metrics (Figure 1). Specifically, it achieves an average improvement of 16% for DC and 7% for TT dataset in all metrics over the fine tuned GPT-2 baseline without retrieval. Understandably, knowledge retrieval plays a key role in this improvement. On the other hand, without refinement, the Retrieve Only approach yields the worst scores. The Hybrid version can switch between baseline and RetRef based on the availability of suitable retrieved responses and is evaluated including such test cases. Nevertheless, it outperformed the baseline model by a significant margin across all metrics and datasets.

To measure the extent to which model incorporates a retrieved knowledge in generation, following previous work, we measure word overlap between generated and retrieved responses. The result shows that our RetRef+Rank model retained more than 70% words from retrieved information in 51% and 57% of the test generation of DC and TT dataset, respectively. This is a clear improvement over baseline and the basic RetRef model which shows such overlap less frequently.

Qualitative Analysis

Ablation Study

Examples and Discussion

Table 1: Sample response generation using our RetRef+Rank model

Table 1 provides a randomly selected generation. It suggests that our model’s response is aligned with the customer letter type. The customer inquiry is responded accordingly with clarification, appreciation and information. In addition, having historical knowledge, our model is not only capable of producing an informed response but also refines that according to the query. A few limitations of our model include its inability to verify time-sensitive historical information and handling multiple questions from a single message. Additionally, the way it offers a coupon or commits a follow-up to a customer poses a risk. To resolve these issues, a risk or confidence measuring system can be introduced which will seek human inspection before dispatching a risky response. We leave this as future work.

Paraphrasing Outcome

Table 2: Question and options settings for our human evaluation task

For each question, there are four choices and they are ordered from bad to good. The first two questions (a and b) are focused to evaluate the quality of generated response (hypothesis) from two perspectives: 1. Relevance of response without knowing the actual information and 2. Informativeness of the response with the access to true information. The last two questions (c and d) are set to assess the performance and impact of the retrieval system respectively. Each case is rated by 3 different raters and the final score is average of their ratings.

Figure 2: Distribution of the options selected by the annotators

Based on the Fleiss’ Kappa score, the raters’ agreement for the choices of question a) and c) was fair and for question b) and d), it was moderate and good respectively. In around 75% cases of both question c) and d) the rater chose ‘option 4’. This indicates for those cases raters found that the retrieved query is very similar to the corresponding customer query and the generated responses are aligned with the retrieved information.

The evaluation also shows that for around 59% of the cases, the raters found the generated response is plausible or relevant without reference. With respect to the reference, the raters considered the models’ response at least equivalent to the reference in 61% cases.

Table 3: Comparison of distributions between copy and non-copy cases

In 75% cases the model’s generated responses are partly or fully copy of the retrieved response. Out of these cases, 39% generated responses lexically match with the corresponding reference responses. According to human evaluation, in 65% of the cases, the copy-driven generated responses are equivalent or preferable to reference response. These human supported 65% cases include almost all the machine-evaluated matched cases (38% out of aforementioned 39%). Furthermore, the human and machine evaluated label of a generated response (whether equivalent to the reference response) has co-occurred 71.98% times of all the copy-based generations.

On the other hand, in around 25% cases the generated responses are found to be lexically different (not copied) from the corresponding retrieved responses. 18% of such generated responses lexically match with the reference response whereas human evaluation considers 54% of them are equivalent or preferable to reference response. The human and machine evaluation of the generated responses matches for 80.99% cases.

Identification of applicable case

  • Retrieved query is not a good match to the current query.
  • Retrieved response is very loosely connected to the current query.
  • Complex, infrequent (less experienced) current query (same holds for less generic, complex retrieved response).
  • In the above scenarios, the model tends to craft responses instead of following retrieved responses and in the process often generates generic responses that it is frequently exposed to during training for the corresponding reason code. On the other hand, in the absence of the above three scenarios, the model is often seen to follow the retrieved response which typically results in human-favored response.

    Therefore, we want to capture the above three scenario with the following measures:

  • Retrieval score.
  • Training an entailment model between query and response.
  • Complexity has been seen in larger queries, so query length can be a feature to measure query complexity. Sometimes simple queries from infrequent reason code also make the mode hallucinate. For retrieved responses, a larger length won’t prevent generating a faithful response if it is a generic one.
  • Table 4: Relation between the generation quality and query-response similarities

    Based on the previous three observations about when the model typically succeeds, we check to see if the data supports that. In table 4, generated response quality is defined by sum of raters rate, semantic and lexical similarity between reference and the generated response. And each row gives the correlation and p-val (level of significance) between items in the first column and generated response quality. We notice that the correlation of retrieval score (Query-Retrieved Query similarity) is highly significant (p-val << 0.05) in determining model’s generation quality. The other significant correlation between generation quality and retrieved response length indicates the model is typically right when it copies large generic retrieved responses.

    Query-Ret. response similarity, is not exactly entailment and its correlation is not significant as well. Finally, query length is used as a heuristic of complexity. However, its correlation to response quality is also not strong.

    To further improve the predictability as to when the model can be useful we apply three entailment models on the human rated dataset. A cross-encoder model is utilized to predict the human-like rating as to whether the entailment of a query-retrieved response pair or a query-hypothesis pair is likely to make a good final output.

    We compared the results of the three cross encoder models for both pairs.

    Table 5: Performance of entailment models on query-retrieved response and query-hypothesis pairs

    Table 5 lists the accuracies of the three entailment models. The result shows our BERT-based cross-encoder improves the accuracy by around 100% over the other two models. It indicates we can effectively estimate the generation’s usefulness in around 70% cases

    Deja un comentario

    Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *

    Scroll al inicio