Guidelines for the Analysis and Design of Argumentation-Based Recommendation Systems

Recommender systems study the characteristics of its users and applying different kinds of processing to the available data, find a subset of items that may be of interest to a given user in a specific situation. Argumentation-based tools offer the possibility of analyzing complex and dynamic domains by generating and analyzing arguments for and against recommending a specific item based on the users’ preferences. This approach allows us to analyze the qualitative and quantitative characteristics of the recommended items, and to provide explanations to increase transparency. In this article, we develop a set of software engineering guidelines for the analysis and design of recommender systems leveraging this approach.

& EXISTING RECOMMENDER SYSTEMS ("RS," for short) cannot formally address the defeasible nature of user preferences in complex environments. 1 Decisions about preferences are driven mainly by heuristics, which are typically based on classifying the choices of previous users or on gathering information from other users with similar interests. In addition, as discussed by Briguez et al. 2   not have a clear underlying model, making it difficult to provide users with a simple explanation of how the system arrived at its recommendations. Another problem is that modeling users' preference criteria is not an easy task, since it generally requires dealing with incomplete and potentially inconsistent knowledge.
Tools developed in the area of argumentationbased reasoning offer the possibility of analyzing complex and dynamic domains by studying the arguments for and against recommending a specific item based on user preferences. Specifically, defeasible argumentation leverages models that contain inconsistency, evaluating arguments that support contradictory conclusions and deciding which ones to keep. An argument supports a conclusion from a set of premises; 3 a conclusion C constitutes a piece of tentative information that an agent is willing to accept. If the agent then acquires new information, the conclusion C-along with the arguments that support itcould be invalidated. The validity of a conclusion C is guaranteed when there is an argument that provides justification for C that is undefeated. This process involves the construction of an argument A for C, and the analysis of counterarguments that are possible defeaters of A; as these defeaters are arguments, it must be verified that they are not themselves defeated, and so on. This analysis has a valuable byproduct: the set of all arguments can be used to provide explanations for recommendations provided by the system, increasing its transparency.
There is a large body of work on frameworks to carry out this kind of reasoning; the most closely related to this work are those based on rules, which consider the structure of the arguments that model a discussion. [4][5][6] Figure 1 presents a brief outline of their basic elements.
Such systems have a knowledge base (KB) that allows storing information expressed in a logical language. Inference rules allow us to leverage certain information (antecedents) to derive new information (consequents). Other elements of the KB include facts or presumptions, representing evidence obtained from the environment; such evidence typically plays a central role in firing rules and thus building arguments, which are then evaluated via an exhaustive analysis to decide which are accepted and which conclusions can be guaranteed from the current knowledge. A key property of argumentation-based reasoning is nonmonotonicity-the incorporation of new information can generate new arguments that contradict existing ones and, therefore, invalidate statements that were previously guaranteed.
In the domain of RS, the frameworks developed in defeasible argumentation offer the possibility of analyzing complex and dynamic situations by studying arguments for/against recommending an item based on user preferences, focusing on both qualitative and quantitative features. Though the process of obtaining recommendations in this manner is very different from traditional approaches, they share the same main idea: establish a similarity between items and users, and use that similarity to derive recommendations. The main difference is that traditional methods establish similarities through purely numerical analyses, while argumentationbased recommender system (ABRS) use a dialectical process similar to how human beings debate to establish similarity-in particular, we can prioritize different elements depending on users' preferences. Thus, based on the set of rules that define the behavior of the recommender system together with knowledge of the domain, the system will establish the set of arguments for recommending an item, carry out the dialectical process, and execute the corresponding actions.

STATE OF THE ART
There are several works that propose using argumentation to enhance recommendations. Early work includes Chesñevar et al. 1 and Briguez et al. 2,7 present an application in the domain of film recommendation, stressing the importance of considering both qualitative and quantitative aspects; furthermore, explanations that support the recommendations are generated in natural language. Other recent efforts leveraging argumentation-based tools are Rago et al. 8,9 Finally, other relevant work involves applying data-driven approaches, as discussed by Portugal et al. 10,11 None of these efforts focus on software development methodologies-this article thus aims to present a set of guidelines to support knowledge and software engineers in the analysis and design of ABRS, aiming to fill this gap in the current intelligent systems development literature.

ANALYSIS AND DESIGN OF ABRS
When executing the analysis and design of a RS with these characteristics, it is desirable to focus on four central aspects: 1) KB design, 2) recommendation mechanism, 3) design and presentation of explanations, and 4) design of user interactions. We now introduce a series of methodological guidelines defined around these aspects. Figure 2 illustrates the relationships that exist among these different tasks.

Stages 1-2: Domain Analysis and DB Design
The two main entities in RS are items and users-it is essential to analyze the relationships between them, since they are central to user preferences. Depending on the domain, it is necessary to refine their description to obtain more information from the participating entities. Examples of relationships include: "number of times the user listened to a song and the rating given" and "number of times that a recommended article was shared or valuation was provided to it." The result of Stage 1 offers a detailed and clear description of the domain, and defines the attributes associated with the entities that are most relevant for the task.
The next step is to design the databasethere are three basic options: 1) create and populate the database that feeds the system, 2) reuse an existing dataset related to the domain, or 3) extend an existing dataset or merge several datasets to enhance the available information. The latter two options allow us to determine the recommendation mechanism without having to design and populate the underlying database, which can be a very complex and expensive process until enough relevant data are obtained.
Stages 1-2 can be mapped directly to those of traditional methods.

Stage 3: KB Generation
The KB is the structure where knowledge of the domain is formally represented. Its generation can be carried out in three steps: Analyze the domain and establish the criteria to be used in the generation of recommendations.
Specify a preference criterion to apply in case the rules established in the previous step generate contradictory results.
Specify the KB in a formal logical language.
During the first step, we create statements in natural language that express how items should (not) be recommended. Though here it may be necessary to appeal to domain experts to generate rules and priorities, it is also possible to leverage existing tools to process large volumes of data, such as data mining, machine learning, genetic algorithms, and information retrieval. 10,12 For instance, with association rule mining it is possible to find which characteristics best describe similarity between items, and prioritize rules. In the second step, a preference criterion is established among the criteria to reflect the domain's characteristics and the users' preferences. Finally, one must specify the patterns in a formal logical language to be interpreted, analyzed, and manipulated by the reasoner. Traditional methods cannot be applied directly in this stage, since they seek to formalize metrics based on quantitative aspects to characterize similar elements, while in ABRS similar elements are characterized through rules. It is possible to adapt traditional approaches to generate a more general representation of such metrics so that they can be mapped into rules to feed dialectical processes.

Stage 4: Analysis and Design of the Reasoner
The reasoner is the system's main component-it interprets available knowledge, create a model based on that knowledge, analyzes the relationships between arguments, resolves conflicts between them, and ultimately issues recommendations.
The first component of an argumentative reasoner is the Inference Engine, which provides the ability to analyze domain knowledge and infer new knowledge to be used in the recommendation process. The literature highlights three alternatives to represent and formalize logicbased arguments: as a proof tree based on the premises, 13 as a sequence of proofs (or derivations), 14 or as a pair of conclusion-premises implying that there is a proof for the conclusion from the premises in the underlying logic. 3,4 The second component is the Relationship Interpreter. One of the essential definitions for any argumentative system is the definition of conflict (also known as counter-argumentation or attack) among arguments, which characterizes disagreement. This component is responsible for interpreting the relationships between arguments, creating a model, and establishing a preference order among arguments. Argumentative systems typically parameterize the comparison criterion between arguments, which is generally specified by the knowledge engineer in relation to the application domain.
The last component is the Semantic Analyzer, which determines acceptability of arguments by considering their interactions-given an argument, it considers its defeaters, the defeats of its defeaters, and so on. The definition of a mechanism for deciding acceptability of arguments determines precisely how inferences are obtained. In the literature, there are various proposals for this; in particular, either a declarative approach or a procedural approach can be followed. The former establishes conditions that a set of acceptable arguments must meet, 15,16 while in the latter a specific algorithm is provided. 4,14 Stage 5: Evaluation The performance of the proposed system should now be evaluated regarding its capability to characterize users' preferences-i.e., determine if it is capable of predicting the items that its users would like, including additional considerations such as variety and surprise.
Shani and Gunawardana 17 present a study that emphasizes the evaluation of RS, defining three possible experimental setups: offline experiments, user studies, and large-scale online experiments. They also describe the most important properties that systems must satisfy, together with their satisfaction criteria. Typical properties are: accuracy of predictions, coverage of recommendations offered, cold start capability, confidence, credibility, novelty and originality, diversity, usefulness, risk, robustness, privacy, adaptability, scalability, and performance. The selection of experiments, and the subset of properties to emphasize in the evaluation, is part of the activities carried out by knowledge engineers based on their evaluation of the domain.

Stage 6: Design of Outputs
Once all the machinery for producing recommendations is in place, the next step is to design their presentation to users-this consists of designing the system interface, along with any justifications for these selections. In particular, the knowledge engineer must establish how the entities that support the recommendations will be analyzed and represented, to facilitate their understanding and generate a satisfactory explanation. According to Lacave and Diez, 18 an explanation must be "understandable," allow "to improve knowledge," and be "satisfactory" in the sense of fulfilling the interlocutor's expectations. The generation of an understandable explanation refers to a justification or coherent explanation requested by the interlocutor. On the other hand, Walton 19 defined an explanation as "a transfer of knowledge from one interlocutor to another within the context of a dialog." Finally, according to Moulin et al. 20 an explanation "has to be planned and then transmitted in an appropriate way"; i.e., it is an "object to be designed" and a "communicative act" to be achieved. The agent who acts as interlocutor only has partial knowledge of the subject in question, for which it requests an explanation in hopes that the agent generating it can fill their knowledge gaps.
In the domain of argumentation, structured approaches provide important advantages in the task of translating the structure of arguments into natural language propositions. Another interesting aspect is the possibility they afford to visualize, through tree-based structures, the dialectical process generated to support a recommendation, which can be part of an extended explanation.
This stage is modular-it depends on the domain, since we consider that explaining and showing the process to generate a recommendation is very useful for certain domains (such as investments, medical diagnosis, or risk analysis), but may prove to be overwhelming and unhelpful in more mundane ones (such as multimedia).

Stage 7: Design of User Interactions
This stage is common to the design of user interactions any system-when the interface with the user is well designed, the user "slides through" the interaction smoothly and effortlessly. In RS, the GUI is focused on what the user sees when requesting a recommendation and how the system outputs (including possible explanations) are presented.

CASE STUDY: MUSIC RECOMMENDATIONS
Methodologies aiding in the successful engineering of software systems must be properly validated. In this section, we implement our proposal to analyze and design an ABRS in the music domain, comparing the results obtained with a baseline RS that resembles those designed under classical schemes. A music RS aims to suggest songs, videos, albums, or artists that appeal to its users-well-known examples include Spotify, You-Tube, Last.fm, Pandora, Genius, among others.
For reasons of space, we only present a reduced version of the analysis carried out in each stage, focusing on showing the intermediate results toward obtaining the final system.

Case Study: Stages 1-2
The main attributes that describe a song are: ID, Title, Author, Album, Genre-for our purposes, these attributes suffice to build reasoning patterns to support issuing recommendations. In most RS, users are represented by their personal information: Name, LastName, Sex, Age, Country.
Relationships represent the interaction between users and items, which are also called transactions and can be encoded via tuples User ID, Song ID, Score, #reproductions. The latter is a secondary parameter to measure the degree to which the song was liked by the user.
Once the domain of the recommender system has been defined, the second stage of the process involves designing how available information is stored in a database.

Case Study: Stage 3
As described in the section "Stage 3: KB Generation," the generation of the KB will be carried out in three steps.
Step 1: Analyze the Domain The system will reason based on the following general criteria: If a user likes a song by a certain artist, they can also be expected to like another song by the same artist. For greater specificity, we can extend this to sharing the same artist and tags.
If a user likes a song of a certain genre, they may be expected to like another song of the same genre.
If a particular song is liked by many users in the system, a given user can also be expected to like it.
Given two similar users, a song liked by one of them can be expected to be liked by the other.
In general, the dual of each of these criteria can also be applied; for instance, if a user does not like a song from a particular genre, they can be expected to not like other songs in that genre. Based on these general statements, criteria were defined to represent that knowledge and direct the recommendation process.
Next, we establish the events that must be considered to trigger reasoning chains. These events are: E1: Two songs are by the same artist. E2: A song and an artist have similar tags whenever they share a number of tags greater than or equal to half the tags assigned to the entity with least number of tags. E3: Two songs have similar tags (genre); determined as above. E4: A song is considered to be "good" whenever the ratio between the number of times it is played and number of listeners is greater than or equal to 6. Otherwise, the song is considered to be "bad." Clearly, some of the parameters used here can be readjusted or modified as needed. For example, a weaker similarity notion can be used to specify closeness between two songs or artists, or the threshold for determining song quality can be adjusted.
Once these foundations are defined, the conditions that trigger the recommendations are derived as follows. Recommendations based on artist: R1: A user may like a given song if there is another song by the same artist that was positively valued by them. R2: A user may not like a given song if there is another song by the same artist and it was negatively valued by them.

Recommendations based on artist and tags:
R3: A user may like a particular song if there is another song by the same artist valued positively by them and whose artist has the same tags as the song to be recommended. R4: A user may not like a particular song if there is another song by the same artist that is negatively valued by them and this artist has the same tags as the song to be recommended.
As explained above, we can derive analogous guidelines based on genre, ratings, and user similarity.
These criteria model knowledge of the application domain-they are the building blocks used to build arguments for/against specific recommendations. Note that they combine qualitative and quantitative approaches, taking advantage of the flexibility offered by argumentation-based reasoning.
Step 2-Establish Priorities We consider that the preference criterion that best responds to this particular domain is rule priority. Therefore, in this step an order of priorities must be defined between the defined criteria. We use the symbol "1" to denote "greater priority than." An example of priority using the rules defined above is: R1 1 R4. The next step involves the formalization of these priorities in a formal logical language.
Step 3-Specify the KB We now present a subset of the rules and priorities formalized in DeLP, whose language is based on logic programming, where the basic concepts such as variables and functions are defined in the usual way. Literals are atoms that can be preceded by the symbol $ denoting a strict negation; facts are positive literals. Strict rules are ordered pairs L 0 L 1 ; . . . ; L n , where the first component L 0 is a literal and the second component, L 1 ; . . . ; L n , is a set finite and not empty if literals. Similarly, a defeasible rule is an ordered pair L 0 À< L 1 ; . . . ; L n , where the first component L 0 is a literal and the second component, L 1 ; . . . ; L n , is a finite and non empty set of literals. Strict rules are used to represent incontrovertible information, while defeasible rules are used to represent defeasible knowledge (that is, tentative information that can be used if nothing opposes it).
In this formalism, the state of the domain is modeled through a defeasible logic program, essentially a set of facts, strict rules, and defeasible rules. Given a defeasible logic program P, the subset of strict rules and facts is denoted with P, and the subset of defeasible rules with D. In this way, a P program can be denoted with ðP; DÞ. Since the set P represents nondefeasible information, this must be noncontradictory. We use the convention that names of variables begin with capital letters, while the constants and names of predicates begin with lowercase letters. Finally, given a program P and query Q, the reasoner must provide a response based on the domain knowledge.

Priorities:
We use "better ruleðrule1; rule2Þ" to encode that R1 has priority over R2.

General rules:
Based on all the rules defined, some further general rules are formalized: recommendðTrack; UserÞÀ< likes by artistðTrack; UserÞ recommendðTrack; UserÞÀ< likes by artist trackðTrack; UserÞ. The formalization of all the criteria, and the priorities over them, define the logic program (model) used by the reasoner to guide the recommendation process.

Case Study: Stage 4
The reasoning machinery provided by Defeasible Logic Programming (DeLP) 4 is used as the argumentation-based tool in the system. An important design aspect is the integration between DeLP programs and relational databases, which affords the necessary information from entities to create arguments. In our proposal, such integration is provided by the Database Integration for DeLP (DBI-DeLP) framework. 21 A DBI-DeLP program is an extended DeLP program with information obtained from one or more databases. An important point is the need to consider the possible presence of contradictory information linked to the use of several databases, which for instance could lead to reasons both in favor of the recommendation of an element and against it.
Since facts in DeLP cannot be contradictory, here we adopt the notion of presumption to represent "defeasible" information.
In DBI-DeLP, the tuples of the database (in our case, the information from the dataset) are represented as a particular type of presumptions called operative presumptions, which are literals of the form predicateðq 1 ; . . . ; q m Þ À< true. A DBI-DeLP program is a DeLP program with a set of operative presumptions, associated with the dataset records used in the RS, which are retrieved at the request of the system to answer a particular query associated with a recommendation and then discarded. In summary, to obtain data relevant to the argumentation process, elements from the literal that the dialectical process are trying to warrant are used to determine relevant records in the database, and the relevant SQL queries are issued. Finally, all the recovered results are transformed into operative presumptions-this dynamic search for relevant information is crucial in adequately leveraging available datasets.
The use of DeLP as the argumentation engine was chosen based on our familiarity with its implementation, which simplified the development of the prototype used for this case study; we also consider that DeLP is a powerful tool based on intuitive concepts. Note, however, that this choice is modular and there are alternative structured argumentation systems in the literature, such as ASPIC+, 5 ABA, 6 and hybrid methods. 8 In this last work, the authors present a method for making predictions in RS, and show experimentally that it is competitive in the movie domain; they also illustrate how it can be used to generate effective explanations, which is a valuable byproduct of many dialectical processes, not just DeLP.

Case Study: Stage 5
The goal of the evaluation is to determine if the system is able to make good predictions regarding items that users like. Here, we report on the results of an offline experiment to evaluate our case study.

Experiment Setup
The dataset consists of 1200 valuations, selected at random with the sole condition that they come from different users; our goal was to avoid the possible introduction of biases stemming from the behavior of specific users. There may be repeated songs in the ratings, since the same song may appear in multiple ratings from different users.
The experiment consists of issuing a series of queries to the system to evaluate its capability of predicting whether or not a given user likes a song. Such queries are of the form "recommendðTrack; UserÞ?" Then, each answer was classified in one of the following categories: according to the response obtained and the evaluation that the user gave to that song within the dataset, * : True Positive (TP): recommended song rated "l" (love) by the user; True Negative (TN): song not recommended, and user rated it "b" (bad); False Positive (FP): recommended song rated "b" (bad); False Negative (FN): song not recommended, and user rated it "l" (love); and Undecided (U): the system neither suggests nor denies the recommendation. The last case occurs when the system cannot guarantee either recommendation or nonrecommendation of a song to a user, thus arriving at an undecided position. To carry out the test, 1200 queries were made to the recommender system (one for each valuation in the dataset).
The experiment was run twice, one for each DeLP program (mixed and quantitative models) in order to present a comparison of the performance of the two approaches-the DeLP programs were derived based on the guidelines presented above. The first program follows the mixed approach, and is composed of all the previously defined preference criteria so the system is able to work with both quantitative and qualitative features of the entities that participate in the domain. The second program is based on criteria that consider purely quantitative features, as in traditional RS.

Results
The results are presented in Figure 3; on the left, we show how each approach performed regarding answer categories, and on the right we have the classical metrics derived from these values. As we can see, the mixed approach outperforms the purely quantitative one in all metrics; its only disadvantage is observed in the true negative category, but this is offset by its performance in the rest. These result clearly show the advantages of applying a mixed approach, which allows us to "refine" the answer given by the quantitative approach based on a more complete analysis of the available information.

Case Study: Stage 6
We propose explanations based on the approach by Briguez et al. 2 where explanations are derived based on the structures of the arguments involved, making a simple replacement of the logical structure to their corresponding colloquial interpretations. As an example, we have Argument structure: recommendðTrack; UserÞÀ< likes by artistðTrack1; UserÞ, listen artistðUser; Track2; Artist; lÞ, Track1 ¼ n ¼ Track2, same artistðTrack1; Track2Þ Explanation: "User, the song Track was recommended since you liked another song by the same artist.
Such explanations can then be offered to users upon request.

*
To ensure that the triple ðTrackUser; RatingÞ had no influence on the prediction made by the system, the corresponding record was removed from the test dataset.

Case Study: Stage 7
For reasons of space, and since the user interface does not differ from typical ones in similar systems, we do not include details of this stage here.

CONCLUSION AND FUTURE WORK
The main contribution of this article lies in the development of methodological guidelines for the analysis and design of an ABRS capable of: 1) making recommendations to its users from an incomplete or inconsistent KB, 2) providing the possibility of adapting the knowledge-based analysis that is carried out according to its users' preferences, 3) analyzing qualitative and quantitative information, and 4) providing explanations. The state of the art of ABRS focuses on studying the characteristics of such systems and the development of prototypes, without applying a methodology to guide the process; i.e., exploiting argumentation-based tools in RS, without referring to the specific design choices made. Our work is therefore a first approach to the investigation of software development methodologies tailored to this type of system. Future work involves further evolving these guidelines toward a formal and solid software development methodology for creating highquality ABRS.