Automatic weighted matching rectifying rule discovery for data repairing

Abu Ahmad, Hiba; Wang, Hongzhi

doi:10.1007/s00778-020-00617-6

Automatic weighted matching rectifying rule discovery for data repairing

Can we discover effective repairing rules automatically from dirty data?

Regular Paper
Published: 09 June 2020

Volume 29, pages 1433–1447, (2020)
Cite this article

Download PDF

Access provided by China Pharmaceutical University

The VLDB Journal Aims and scope Submit manuscript

Automatic weighted matching rectifying rule discovery for data repairing

Download PDF

462 Accesses
5 Citations
Explore all metrics

Abstract

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data quality is one of the most crucial problems in data management. Database systems usually concentrate on the data size aiming to create, maintain, and control a great volume of data. However, real-life data are often dirty and poor quality; about 30% of firms’ data could be dirty [1]. Dirty data are very expensive where their expenses exceed 3 trillion dollars for the US economy [2]. In addition, high-quality data are so critical in decision making. These indicate and emphasize the necessity of data cleaning for organizations. Data repairing is a key problem in data cleaning to uncover data errors and rectify these errors.

Different data dependencies are proposed for data repairing, such as functional dependencies (FDs) [3], conditional functional dependencies (CFDs) [4], matching dependencies (MDs) [5], and recently, conditional matching dependencies (CMDs) [6]. Although data dependencies can judge if errors exist in the data or not, they fail to determine wrong values, and worse, they cannot fix the wrong values. For that, various types of rules are defined to detect and fix errors, which are editing rules [7, 8], fixing rules [9] and Sherlock rules [10]. Even though the rules-based data repairing methods [7, 8, 10] outperform data dependencies-based methods, they need an exterior trustworthy data source or user verifications. In contrast, fixing rules-based method [9] performs data repairing without using master data or involving users, but it can repair only a small number of data errors. Furthermore, all proposed rules for data repairing are provided by domain experts, which is a long-time, impractical and costly task.

We explain the limitations of existing methods in Example 1.

Example 1

Consider the data set $D_{RES}$ in Table 1 of researchers including eight tuples: $t_1,t_2,\dots ,t_8$ where each tuple $t_i$ refers to a researcher, identified by Name, Department (Dept), Nationality (Nation) and Capital. The symbol “$*$” signs all errors whose corrections are given between brackets, for example, $t_2$(Capital)= “ HongKong” is an error, whose correction is “Beijing.”

Suppose a functional dependency fd: RES (Nation $\rightarrow $ Capital) over $D_{RES}$, which indicates that Nation uniquely specifies Capital. Since $(t_1,t_2)$ violates fd where $t_1(Nation)=t_2(Nation)$ but $t_1(Capital)\ne t_2(Capital)$. Thus, fd confirms that it must be errors in the values: $t_1(Nation)$, $t_2(Nation)$, $t_1(Capital)$, $t_2(Capital)$, but it cannot determine which values are wrong or how they can be fixed.

Consider a tuple s in a master data $D_m$ as follows:

TupleID	Country	Capital
s	China	Beijing

On both data sets ($D_{RES}, D_m$), experts can define an editing rule er: $((Nation, Country) \rightarrow (Capital,$ $Capital),t_p=())$. It denotes that: for an input tuple $t \in D_{RES}$ in Table 1, if t(Nation) is correct (a user verifies it) and $\exists s \in D_m$ where $t(Nation)= s(Country)$ and $t(Capital) \ne s(Capital)$; t(Capital) is wrong which is fixed to s(Capital). Accordingly, er detects errors $t_2(Capital)$ and $t_4(Capital)$ and updates them to “Beijing.” Moreover, since er depends on exact matching, it cannot fix $t_6(Capital)$, $t_3(Nation)$, and $t_6(Nation)$.

Table 1 $D_{RES}$: an instance of the schema RES

Full size table

Similarly, consider a Sherlock rule sr: ((Nation, Country), (Capital, $\bot $, Capital) $\rightarrow (=,\not \approx ,=))$, defined on ($D_{RES},$ $D_m$). It depends on the similarity instead of the equality to compare t(Nation) with s(Country), $\forall t\in D_{RES}, s\in D_m$. Therefore, for Table 1, it can annotate each correct value as positive, each wrong value as negative, and update wrong values to correct ones. However, Sherlock rules, as editing rules, are provided by experts using master data.

In opposite, consider a fixing rule, which is not based on master data to define or users verification to apply, fr: ((Nation, “China”),(Capital, {“Hongkong”, “Shanghai”}) $\rightarrow $ “Beijing”), also provided by experts. fr requires an evidence on Nation attribute with correct value, e.g., “China” to fix wrong values in Capital, e.g., “Hongkong” and ”Shanghai” updating them to “Beijing.” Therefore, it can fix $t_2(Capital)$ and $t_4(Capital)$, but it fails to detect or fix $t_6(Capital)$, $t_3(Nation)$ and $t_6(Nation)$. $\square $

Note that fixing rules, which are the only kind of rules for automated and dependable data repairing, require valid evidence from some attributes to detect and fix errors in other related attributes. So, if there is even a typo in the evidence, the errors in the related attributes, as well as that typo, cannot be detected and fixed.

This paper introduces weighted matching rectifying rules to overcome the previous limitations. Example 2 discusses the cases that these rules can cover.

Example 2

Consider the data set$D_{RES}$ in Table 1, Nation and Capital as two related attributes based on fd. We notice, first, that for a particular value of Nation, e.g., “China,” the correct value of Capital, i.e., “Beijing” has more frequency than the wrong ones, e.g., “Hongkong” and “Shanghai.” Second, using approximately valid values of a set of attributes can help us to detect and fix more errors in the related attributes. Based on these two notices, we generate a weighted matching rectifying rule r : ((Nation $\approx $ “China”), (Capital $\in $ {“Hongkong”,“Shanghai”}) $\Rightarrow $ (“China”) $\wedge $ (“Beijing”)). This rule can detect and rectify errors not only in the related attribute, Capital, but also in the evidence attribute, Nation, as follows:

$t_2(Nation)=$ “China” and $t_2(Capital) \in $ { “Shanghai”, “Hongkong”}, then $t_2(Capital)$ is wrong and we update it to “Beijing.” Similarly, $t_4(Capital)$ is rectified.
$t_6(Nation) \approx $ “China” and $t_6(Capital) \in $ { “Shanghai”, “Hongkong”}, then $t_6(Capital)$ is wrong and we update it to “Beijing”, and $t_6(Nation)$ is wrong and we update it to “China.”
$t_3(Nation) \approx $ “China” and $t_3(Capital)$= “Beijing”, then $t_3(Nation)$ is wrong and we update it to “China.” $\square $

The two examples above raise the following challenges to develop a matching rectifying rules-based data repairing method:

How to define weighted matching rectifying rules over the dirty data in-hand with more flexible matching as they can detect and fix different errors dependably and automatically?
How to discover these rules automatically from the dirty data in-hand?
What is the effective method to apply these rules for automated and dependable data repairing?

Considering these challenges, the main contributions of this paper are summarized as follows:

We define weighted matching rectifying rules (WMRRs) that can cover and fix more data errors dependably and automatically.
We propose an automatic rule discovery algorithm based on the dirty data in-hand and their functional dependencies. According to our knowledge, it is the first automatic rule discovery method for data repairing.
We study fundamental problems of WMRRs and develop an automatic algorithm to check rules consistency and resolve rules inconsistency.
We propose an effective data repairing method based on a consistent set of WMRRs.
We conduct comprehensive experiments on two data sets, which verify the effectiveness of our proposed method.

The rest of this paper is organized as follows. Section 2 studies related work. Section 3 defines WMRRs. Section 4 presents the automatic rule discovery algorithm. Section 5 studies the fundamental problems of WMRRs. Section 6 demonstrates the automatic algorithm for rules inconsistency resolution. Section 7 presents the automatic repairing algorithm based on WMRRs. Section 8 reports our experimental results, and finally, the paper is concluded in Sect. 9.

2 Related work

Many studies have addressed data cleaning problems, especially data repairing which can be classified as follows.

Dependencies-based Data Repairing Heuristic data repairing based on data dependencies has been broadly proposed [3, 11, 12]. They addressed the problem of exploring a consistent database with a minimum difference from the original database [13]. They used various cost functions to fix errors and employed different dependencies such as FDs [14, 15], CFDs [16, 17], CFDs and MDs [18], and DCs [19]. However, consistent data are not necessarily correct. Therefore, the proposed solutions by these methods cannot guarantee correctness.

Rule-based Data Repairing Unlike dependencies-based methods, rule-based methods are more dependable and conservative. Therefore, rules have been developed for different data cleaning problems such as ER-rules for entity resolution [20, 21], editing rules [8], fixing rules [9] and Sherlock rules [10] for data repairing. Fixing rules can be discovered by user interactions [22] or provided by experts [9] like editing rules [8] and Sherlock rules [10]. In contrast, our proposed rules, WMRRs, are discovered automatically based on the data in-hand. From another side, editing rules [8] depend on master data and user verifications to perform reliable repairing, while Sherlock rules depend on master data for automatic repairing. Related to this study, fixing rules [9] perform reliable and automatic repairing without external data sources. However, the repairing process based on fixing rules is incomplete, i.e., only a little number of errors can be fixed, where fixing rules focus on repairing correctness at the expense of repairing completeness. In opposite, weighted matching rectifying rules focus on both completeness and correctness of repairing. Moreover, WMRR-data repairing is reliable and automatic. We will explain experimentally (Sect. 8) how WMRR-based data repairing can significantly improve the recall with maintaining the precision of repairing.

Data Repairing using Knowledge Bases Some methods have utilized knowledge bases for data repairing. KATARA [23] used knowledge bases and crowdsourcing to detect correct and wrong values, so it is a nonautomatic method. For rule discovery, [24] used knowledge bases to generate deductive rules (DRs) which can identify correct and wrong values, and fix errors only if there is enough evidence. However, rule discovery needs enough correct and wrong record examples to investigate the right and error semantics of the data from the knowledge base, and the expensive expert knowledge is still necessary to validate the extracted semantics. For data repairing, [24] requires to design effective semantic links between dirty databases and knowledge bases which is user-guided, i.e., nonautomatic. In contrast, we aim automatic rule discovery and automatic data repairing based on the data in-hand utilizing correct data to fix wrong data without any external source.

User-Guided Data Repairing Since users and experts can help to perform reliable repairing, they were involved in various data repairing methods [25,26,27], even in rule-based methods [10, 22, 24], as we discussed before. However, depending on users is commonly costly in terms of effort and time, and worse error-prone, while domain experts are not always available with the required knowledge. Accordingly, automatic data repairing is needed which we target in this work.

Machine Learning and statistical-based Data Repairing Machine Learning is also employed by some data repairing methods, such as [28, 29]. These methods are mainly supervised since they require training data and rely on the chosen features. Other methods perform statistical repairing by applying probabilistic to infer the correct data [30,31,32]. Thus, our method, as a rule-based method, varies from this class of methods in that it is a declarative method to determine correct values and repair wrong values automatically based on the data in-hand.

Indeed, rule-based data cleaning methods are often preferred by end-users because rules are explicable, simply rectify and refine [33]. As a result, they have been diffused in industries and business, such as ETL (Extract-Transform-Load) rules [34]. However, there is still an essential need for automatic rule discovery and automatic methods based on the rules which this work introduces.

3 Weighted matching rectifying rules

In this section, we introduce weighted matching rectifying rules for data repairing, WMRRs. Consider a data set D over a schema S with a set of attributes $A=\{a_1, a_2,\dots ,a_n\}$, where each attribute $a_i$ has a finite domain $dom(a_i)$. We first define the syntax of the rules. Then, we describe the semantics of the rules.

3.1 Rule syntax

A matching rectifying rule r defined on a schema S has the following syntax:

$[X \approx _X DP(X)], [y \in WP(y)]\Rightarrow [DP(X)] \wedge [cp(y)]$, where

$X \subset A$ is a set of attributes in A, and $y \in A \setminus X$ is an attribute in A but not in X;
DP(X) is a pattern with attributes X, called as the director pattern on X such that $\forall a \in X, DP(a) \in dom(a)$ is a constant value in the domain of attribute a;
$WP(y)\subset dom(y)$ is a vector of constant values in the domain of attribute y, called as the wrong patterns of y;
$cp(y) \in dom(y)\setminus WP(y)$ is a constant value in the domain of y but not in WP(y), called as the correct pattern of y.
$\approx _X$ is a similarity metric on X attributes that identifies the similarity between X values in a rule, i.e., DP(X) and the corresponding values of X in a tuple, i.e., t(X). $DP(X) \approx _{X} t(X)$ iff $DP(x_i) \approx _{x_i} t(x_i)$ $\forall x_i \in X$. Formally, $\approx _{x_i}$ indicates true or false as follows.
$$\begin{aligned} DP(x_i) \approx _{x_i} t(x_i)={\left\{ \begin{array}{ll} \text {true}, &{} \text {if }dist(DP(x_i),t(x_i))<\vartheta \\ \text {false}, &{} \text {otherwise}, \end{array}\right. } \end{aligned}$$
(1)

where $dist(DP(x_i),t(x_i))$ is a distance function, and $\vartheta $ is a threshold for this function.

Similarity Measure $\approx _X$ can use domain-specific similarity operators or any distance functions, like Edit distance, Jaccard distance, Cosine similarity and Euclidean distance, with a predefined threshold $\vartheta $. To check similarity, by default, Edit distance is used for attributes with string values and Euclidean distance for attributes with numeric values [35]. In this paper, we formally have

$$\begin{aligned} dist(DP(x_i),t(x_i)) ={\left\{ \begin{array}{ll} editD(DP(x_i),t(x_i)), &{} \text {string}.\\ euclD(DP(x_i),t(x_i)), &{} \text {numeric}. \end{array}\right. } \end{aligned}$$

(2)

Rule Weights Since we discover rules from dirty data, rules could be, in turn, dirty. We assign two weights for each rule r: $w_1(r)$ and $w_2(r)$, to assure good performance of the rules on D for data repairing. $w_1(r)$, which is used for rule discovery (Sect. 4) and rule inconsistency resolution (Sect. 6), measures the validity of the rule. $w_2(r)$, which is used for rule-based data repairing (Sect. 7), measures the ratio of tuples with correct values of the rule for both attributes X and y to all tuples in the data set D. We define the weights of a rule r as follows:

$$\begin{aligned} w_1(r)= & {} \frac{|DP(X)\cap cp(y)|_D}{|DP(X)|_D}, \end{aligned}$$

(3)

$$\begin{aligned} w_2(r)= & {} \frac{|DP(X)\cap cp(y)|_D}{|D|}, \end{aligned}$$

(4)

$w_1(r) \in $ [0,1] is (a) the probability that y has a wrong value in a tuple $t \in D$ and will be rectified to cp(y), or X has one attribute or more with a typo in a tuple $t \in D$ which will be rectified to DP(X) when t matches r; or (b) the probability that a tuple $t \in D$ has a correct value cp(y) for y, and a correct value DP(X) for X when t matches r.

$w_2(r) \in $ [0,1] is the probability that the correct values DP(X) and cp(y) of attributes X and y, respectively, appear together in data set tuples.

For example, $w_1(r)=2/3$ and $w_2(r)=1/2$ are the weights of the rule r in Example 2 based on Eqs. (3) and (4), respectively.

3.2 Rule semantics

Let t be a tuple in a data set D, and $r \in R$ be a weighted matching rectifying rule with the syntax in Sect. 3.1. Intuitively, X and y are semantically correlated. R is a consistent set of weighted matching rectifying rules. The following definitions describe the semantics of applying a rule r, and a rule set R.

Definition 1

t matches r, denoted by $t\vdash r$, if (1) $t(X)\approx _X DP(X)$, and $t(y)\in WP(y)$, or (2) $t(X)\approx _X DP(X)$, and $t(y)=cp(y)$.

Definition 2

r is applied to t if t matches r, changing t to $\acute{t}$, denoted by $t \rightarrow _{r} \acute{t}$, where $\acute{t}(x)=DP(x)$ $\forall x \in X$ and $\acute{t}(y)=cp(y)$. This includes: (1) r rectifies X if $\exists x \in X; t(x)\ne DP(x)$, (2) r rectifies y if $t(y)\in WP(y)$, (3) r verifies $x \in X$ if $t(x)=DP(x)$, then $\acute{t}(x)=t(x)$, and (4) r verifies y if $t(y)=cp(y)$, then $\acute{t}(y)=t(y)$.

Therefore, r can rectify wrong values and verify correct values of t(X) and t(y) when t matches r.

Example 3

Consider the data set in Table 1 and the rule r in Example 2. $t_2,t_4,$ and $t_6$ match r since $t_i(Nation)\approx $ “China”, and $t_i(Capital) \in $ {“Hongkong”, “Shanghai”} $\forall i \in \{2,4,6\}$. $t_3$ also matches r since $t_3(Nation) \approx $ “China”, and $t_3(y)$= “Beijing.” Consequently, r detects and fixes all errors in these tuples as follows: $t_2$(Capital), $t_4$(Capital), $t_6$(Capital) are updated to “Beijing”, and $t_3(Nation)$, $t_6(Naion)$ are updated to “China.” r also verifies $t_2(Nation), t_3(Capital), t_4(Nation)$, as well as the values of Nation and Capital in $t_1,t_5,t_7,$ and $t_8$. $\square $

Definition 3

Applying $R=\{r_1,\dots ,r_c\}$ to t, denoted as $t \rightarrow _{R} \acute{t}$, is to retrieve a unique final repair $\acute{t}$, $\forall t \in D$, after a series of modifications as $t\rightarrow _{r_1} t_1 \dots \rightarrow _{r_c} t_c$, and whatever is the order in which the rules in R are appropriately applied.

Definition 4

t has a unique repair by R if there is only one $\acute{t}$ such that $t \rightarrow _{R} \acute{t}$.

To guarantee a unique final repair of each $t_i \in D$ by applying R, verified attributes ($VA_i$) are defined whose values cannot be updated by R. Then, we add an additional condition to apply a rule $r_k \in R$ to $t_i$, denoted as, $y_k \not \in VA_i \parallel X_k \not \subset VA_i$, which imposes that $y_k$ is not an attribute in $VA_i$ or $X_k$ is not a subset of $VA_i$.

In contrast to fixing rules [9], WMRRs use the fuzzy match with the data tuples, so they can depend on the director patterns with typos to detect and rectify errors in the related attributes, besides to rectify the typos in that director patterns. Thus, WMRRs subsume fixing rules for the similarity threshold equals 1, but they become automated fixing rules since WMRRs are generated automatically from dirty data, which is the main strength of our rules (see Sect. 4) while the original fixing rules are provided by experts. Furthermore, WMRRs with weights are different to employ for accurate data repairing, as we will explain later in Sect. 7.

4 Weighted matching rectifying rule discovery

In this section, we design our proposed rule discovery algorithm, WMRRD. First, we define the rule discovery problem in the data repairing context. Next, we develop the WMRRD algorithm to create and weight rules automatically from dirty data in-hand. Finally, we study the time complexity of this algorithm.

Problem 1

Given a data set D over a schema S and a set $\varSigma $ of functional dependencies over D, the rule discovery problem is to discover a WMRR set R automatically from the data D based on $\varSigma $ without any external data source.

Since every weighted matching rectifying rule is built on semantically correlated attributes, our rule discovery algorithm exposes the violations of given data functional dependencies and creates rules based on the assumption that the correct value of an attribute has a higher frequency than its wrong values (Assumption 1).

In our algorithm WMRRD (shown in Algorithm 1), for each FD $\varphi _j:X_j \rightarrow y_j$, the discovering process follows the next steps to create R.

Step 1 (lines 3–5). We build a hash map $XY_j$ to index data tuples of $X_j \cup \{y_j\}$, which are held in $VP_j$ by getVerticalProjection procedure, where $D(a_i)$ (line 24) refers to the values of an attribute $a_i$ in all tuples of data set D, s.t. $a_i \in X_j \cup \{y_j\}$ . For this end, we partition tuples of $VP_j$ according to $X_j$ patterns using getHorizontalProjection procedure, where each part $d_i$ composes an element in $XY_j$; $d_i$ is built on a specific pattern $P_i(X_j)$ and a set of different $y_j$ patterns, $P_i(y_j)$. Each pattern in $P_i(y_j)$ is attached with its frequency in $d_i$. Then, a hash map $XY_j$ has the following structure: $\{(P_1(X_j), P_1(y_j)), \dots (P_n(X_j),$ $P_n(y_j))\}$, s.t. $P_i(y_j)=\{(p_1(y_j), freq_{1j}), \dots (p_{m}$ $(y_j), freq_{mj})\}$, where n is the number of distinct $X_j$ patterns in $VP_j$, and m is the number of distinct $y_j$ patterns in $d_i$.

Step 2 (lines 6–10). We classify $y_j$ patterns in a part $d_i$ according to their frequency and based on Assumption 1. Thus, the value with the maximum frequency is correct, while other values are wrong. For that, we adopt only the parts with wrong patterns by the condition $|P_i(y_j)| > 1$ (line 7), because we aim rules with $WP(y)\ne \phi $ according to the syntax of WMRRs.

Step 3 (lines 11–19). A rule $r_{ij}$ is created as follows: (1) the director pattern is $P_i(X_j)$, (2) the correct pattern is $y_j$ pattern with the maximum frequency, and (3) the wrong patterns are $y_j$ patterns with frequencies less than maximum. Two weights are calculated for $r_{ij}$ based on Eqs. (3) and (4). $r_{ij}$ is adopted if $w_1(r_{ij})$ is no less than a given threshold $\theta $ in order to assure the validity of the rule.

Example 4

Consider the data set in Table 1. A hash map XY is built w.r.t. fd in Example 1, to index data tuples $\{Nation$ $\cup $ $Capital\}$ as: $XY=$ {(“China”, {(“Beijing” ,4), (“Hongkong”, 1), (“Shanghai”, 1)}), (“Chiena”, {(“Beijing”, 1),(“Hongkong”, 1)})}. Suppose $\theta =0.6$, a matching rectifying rule r in Example 2 is created and adopted. $\square $

In contrast to all other existing data repairing rules [8,9,10], weighted matching rectifying rules are full-automatically discovered by Algorithm 1, from the dirty data in-hand and without external master data.

Complexity The outer loop (line 3) iterates $|\varSigma |$ times. Vertical projection (line 4) runs in time linear to |A|, i.e., the number of data attributes. Horizontal Projection (line 5) runs in time $|D|*|XY_j|$, where $|XY_j|$ is the number of distinct frequent $X_j$ patterns in D. Then, line 5 in the worst case runs in time $|D|^2$. The inner loop (lines 6–17) runs in time $\sum _{i=1}^{|XY_j|} |P_i(y_j)|$ which equals |D| in the worst case. Accordingly, the total time complexity of Algorithm 1 is $O(|\varSigma |.(|A|+|D|^2+|D|))$.

Although the time complexity of our rule discovery algorithm in the worst case is quadratic in number of tuples, data sets often have many frequent $X_j$ and $y_j$ patterns in practice, which is a common data characteristic in most real-life dirty data sets. So, using the hash map index $XY_j$ can decrease the time complexity to be approximately linear as we see later in the experiments (Sect. 8.6).

5 Fundamental problems

5.1 Termination

One regular problem for rule-based data repairing methods is termination. Given a data set D and a set of rules R, the termination problem is to define whether each repairing process on D will end based on R.

Indeed, it is easy to ensure that the repairing process ends by applying a WMRR set to each tuple. Let $t \in D$ be a data tuple, and R be a WMRR set. According to the rule semantics in Sect. 3.2, repairing each tuple t based on R is a series of modifications which ends up with a final repair $\acute{t}$.

5.2 Consistency

Given a WMRR set R over a data set D, the consistency problem is to define whether R is a consistent set, i.e., whether applying R outputs a unique repair for all different applicable rules order.

R is consistent iff $r_i$ and $r_j$ are consistent $\forall r_i, r_j \in R$ [9].

Theorem 1

The consistency problem of WMRRs is PTIME.

We prove Theorem 1 by developing a PTIME algorithm in Sect. 6, which checks rule consistency and also solves rule inconsistency.

5.3 Determinism

The Determinism problem is to define whether all possible terminating processes of repairing lead to a unique repair.

According to the consistency condition and the rule semantics in Sect. 3.2, a unique final repair $\acute{t}$ is retrieved by applying a consistent set of WMRRs to each tuple $t \in D$. Thus, repairing D is deterministic.

5.4 Implication

Given a consistent set R of WMRRs, and another rule $r \not \in R$, the implication problem is to define whether R implies r, denoted as $R \models r$.

Definition 5

$R \models r$ if (1) $R\cup \{r\}$ is a consistent set, and (2) $\forall t\in D, t \rightarrow _R \acute{t} \wedge t \rightarrow _{R\cup \{r\}} \acute{t}$. (1) means that there is no conflict between R and r. (2) means that any data tuple will be rectified uniquely by applying either R or $R \cup \{r\}$, which marks r as an unnecessary rule.

Theorem 2

In general, the implication problem of WMRRs is coNP-complete, but it is PTIME when the data set is fixed [9].

6 Rule inconsistency resolution

Definition 6

Given a WMRR set R over D, and two different rules $r_i, r_j \in R$. Based on the consistency problem definition, $r_i$ and $r_j$ are consistent iff $\forall t\in D, t$ is rectified to $\acute{t}$ either we apply $r_i$ then $r_j$, or $r_j$ then $r_i$.

We develop an automatic algorithm Inconsis-Res (shown in Algorithm 2) for WMRRs inconsistency resolution, which checks the consistency for each pair of rules and solves the inconsistency automatically.

$\forall r_i,r_j \in R$, as follows:

$r_i: [X_i \approx _X DP_i(X_i)]\wedge [y_i \in WP_i(y_i)]\Rightarrow [DP_i(X_i)] \wedge [cp_i(y_i)]$.

$r_j: [X_j \approx _X DP_j(X_j)]\wedge [y_j \in WP_j(y_j)]\Rightarrow [DP_j(X_j)]$ $\wedge [cp_j(y_j)]$.

First, $r_i$ and $r_j$ are checked. If both rules have different X attributes or similar direct patterns for the same X attributes, they both can be matched by t (lines 3–5). Therefore, $r_i$ and $r_j$ are considered inconsistent in four conditions:

(1)
When $y_i=y_j$. If the rules share wrong patterns without the same correct pattern (lines 6–9).
(2)
When $y_i \ne y_j$, $y_j \in X_i$, and $y_i \not \in X_j$. If the correct pattern of $y_j$ in $r_i$ is wrong in $r_j$ (lines 10,11). Note for a matching tuple t, if $r_i$ is applied first, $t(y_j)$ is correct. But, if $r_j$ is applied first, $t(y_j)$ will be modified.
(3)
When $y_i \ne y_j$, $y_i \in X_j$, and $y_j \not \in X_i$. If the correct pattern of $y_i$ in $r_j$ is wrong in $r_i$ (lines 12,13).
(4)
When $y_i \ne y_j$, $y_i \in X_j$, and $y_j \in X_i$. If the correct pattern of $y_j$ in $r_i$ is wrong in $r_j$, and the correct pattern of $y_i$ in $r_j$ is wrong in $r_i$ (lines 14,15).

Second, if $r_i$ and $r_j$ are inconsistent, the rule with less confidence $w_1$ is excluded (lines 18–21).

In contrast to the existing data repairing rules, such as fixing rules [9], where experts are required to resolve the inconsistency, we resolve this problem for WMRRs automatically with keeping high-quality rules.

Complexity Since Algorithm 2 checks each pair of rules, its time complexity is $O({|R|}^2)$, where |R| is the rule set size, i.e., the number of rules. However, the algorithm scales better in our experiments.

7 WMRR-based data repairing

In this section, we present our data repairing algorithm based on weighted matching rectifying rules, WMRR-DR. First, we define the WMRR-based data repairing problem. Then, we develop the WMRR-DR algorithm and explain the repairing process of this algorithm. Finally, we study the time complexity of the algorithm.

Problem 2

Given a data set D over a schema S and a consistent set R of WMRRs over D, the WMRR-based data repairing problem is to retrieve a valid and unique repair $\acute{D}$ of D by detecting errors in D and rectify the detected errors uniquely, dependably and automatically without user verifications.

To efficiently use R in the repairing process, we index it as a hash map IR in order to efficiently determine the candidate rules CR for each tuple, as we see in the next steps. IR is a mapping from an attribute-value pair p(a, v) to a WMRR set $R_p$, s.t. $\forall r_k \in R_p; r_k$ matches p, i.e., $a \in X_k \wedge DP_k(a)=v$.

Our algorithm WMRR-DR (Shown in Algorithm 3) addresses Problem 2 by discovering a unique and valid repair $\acute{t}_i$ for each tuple $t_i \in D$, as follows.

Step I (lines 3-11). A candidate rule set $CR_i$ is identified by detecting rules of IR that exactly match a pair p in $t_i$, called $IR_p$. When no rules are found, $IR_{\approx p}$ is detected as the rules of IR that similarly match p, i.e., $a \in X_k \wedge DP_k(a)\approx _a v$ based on Eqs. (1) and (2), Sect. 3.1.

Step II (lines 12-16). To find matching rules: (1) $CR_i$ is classified based on FDs where $\forall r_k \in R_{\varphi _{j}}; X_k=X_j \wedge y_k=y_j$. (2) A matching rule set $R(t_i)$ is identified by findMatchingRules procedure. (3) $R(t_i)$ is filtered to $\acute{R}(t_i)$ by filterMatchingRules procedure, in order to assure the correctness of the director pattern of the applied rules. $\acute{R}(t_i)$ holds the rules with the minimum distance to $t_i$, where this distance is computed based on Eqs. (5) and (6). If $\acute{R}(t_i)$ has more than one rule, some dirty rules possibly exist. So, $\acute{R}(t_i)$ is filtered again keeping the rules with the maximum $w_2$ based on Assumption 1.

Definition 7

The distance between a rule $r_k$ and a tuple $t_i$ is defined as follows when $t_i$ matches $r_k$:

$$\begin{aligned} dist(r_k,t_i)&= dist(DP_k(X_k),t_i(X_k)),&\end{aligned}$$

(5)

$$\begin{aligned} dist(DP_k(X_k),t_i(X_k))&= \sum _{n=1}^{|X|} dist(DP_k(x_n), t_i(x_n)),&\end{aligned}$$

(6)

where $dist(DP_k(x_n), t_i(x_n))$ is defined in Eq. (2) Sect. 3.1.

Step III (lines 17-31). For each $r_k \in \acute{R}(t_i)$ that can be applied to $t_i$, $t_i$ is updated and the verified attributes $VA_i$ are extended accordingly.

The following example explains the importance of filtering rules based on the distance criterion, followed by the weight criterion.

Example 5

Consider the data set in Table 1 and the rule in Example 2 as $r_1$. Suppose another rule with a wrong directory pattern as: $r_2: ((Nation \approx $ “Chena”),(Capital $\in $ {“Hongkong”}) $\Rightarrow $ (“China”) $\wedge $ (“Beijing”)). To repair $t_2$ as example, $R(t_2)=\{r_1,r_2\}$. $\acute{R}(t_2)=\{r_1\}$ since $dis(r_1,t_2)< dis(r_2,t_2)$. Then, $t_2(Capital)$ is rectified to “Beijing.” To repair $t_6$ as another example, $R(t_6)=\{r_1,r_2\}$. First, $\acute{R}(t_6)=\{r_1,r_2\}$. Based on Assumption 1, $w_2(r_1)>w_2(r_2)$ where $DP_2(Nation)$ is wrong. Then, $\acute{R}(t_6)$ is updated to $\{r_1\}$. Accordingly, $t_6(Nation)$ is rectified to “China”, and $t_6(Capital)$ is rectified to “Beijing.” Suppose another tuple $t_9$ where $t_9( Nation)$=“Chena” and $t_9( Capital)$=“Hongkong”.

Similarly, $R(t_9)=\{r_1,r_2\}$. $\acute{R}(t_2)=\{r_2\}$. Then, $r_2$ rectifies $t_9(Capital)$ to “Beijing.” $\square $

Note that Example 5 highlights all possible cases when there are rules with correct DP(X), and corresponding rules with typos in DP(X). However, tuples with the same typos on X are so few since typos are distributed randomly in X attributes over the data tuples. Therefore, it is rare to generate rules with typos in DP(X) where they hardly exceed the validity threshold (line 13 Algorithm 1). Accordingly, the worst case is also so rare when there are rules with typos in DP(X) without corresponding correct rules; and their effect is neglected compared with a large number of corrections of other rules.

Complexity The outer loop (lines 4–31) iterates |D| times to repair all data where each iteration rectifies one tuple. The first inner loop (lines 6–11) runs in time linear to |IR| which in the worst-case equals |R|. The second inner loop (lines 12–28) runs in time linear to $|\varSigma |$ since the size of $|CR_i|, |R_{\varphi _{j}}|, |R(t_i)|$, and $|\acute{R}(t_i)|$ are indeed small enough to consider as constants. The number of FDs $|\varSigma |$ is also small compared with the number of rules |R|. Accordingly, the total time complexity of Algorithm 3 is O(|D|.|R|).

Table 2 FDs for data sets

Full size table

8 Experimental results

In this section, we discuss our extensive experiments to evaluate our rule-based data repairing method including WMRRD, Inconsis-Res and WMRR-DR algorithms, where WMRR-DR repairs data errors based on a consistent set of WMRRs that were discovered by WMRRG and checked by Inconsis-Res. First, we evaluate the effectiveness of our data repairing method. Next, we study the effect of threshold $\theta $ on the accuracy of data repairing and the number of discovered rules. Then, we check the effect of typo rate on the number of discovered rules and how varying the number of rules affects the data repairing accuracy. After that, we compare the performance of our rules with master data-built rules. Finally, we study the efficiency of our three algorithms.

8.1 Experiments context

We conducted the experiments on 3.2GHZ Intel(R) core (TM)i5 processor with 4GB RAM, using Microsoft Windows 10, and all algorithms were implemented by Java.

Data Sets We performed our experiments on both real-life and synthetic data. (1) Hospital^{Footnote 1} data set (HOSP) is a public data set provided by the US Department of Health and Human Services. It consists of 115K tuples with 17 attributes, and 24 FDs. (2) Address^{Footnote 2} data set (UIS) is a synthetic data set generated by the UIS data set generator. It consists of 15K tuples with 11 attributes, and 18 FDs. Table 2 shows the functional dependencies over each data set.

Noise We added two kinds of errors to the attributes on which FDs were defined: (1) typos; (2) active domain errors where a value v of an attribute $a_i$ is changed to a different value from the active domain of $a_i$, i.e., a random value from $dom(a_i)\setminus v$. In detail, typos are generated in a random location of the attribute value as insertion, deletion or replacement, except null values and values with one character in order not to lose the value semantics. Active domain errors are generated for attributes whose active domain has more than one value. For example, a typo in “Beijing” could be “Beijin” while active domain error could be “Ottawa.” The clean data sets were used as ground truth. We generated the two kinds of errors randomly by adding noise with a specific rate (10% by default) where noise rate is the sum of typo rate and active domain error rate. We spread noise over data tuples by generating errors at most in two attributes of each tuple.

Algorithms We implemented the three proposed algorithms: (1) WMRRD: the rule discovery algorithm (Sect. 4); (2) Inconsis-Res: the inconsistency resolution algorithm for the discovered rules (Sect. 6); (3) WMRR-DR: the data repairing algorithm based on the discovered consistent rules (Sect. 7). For comparison, we implemented the dependable and automatic data repairing method, FR-DR, based on fixing rules that were provided by experts [9].

Measuring Quality For a fair comparison with the state-of-the-art methods, we used the accuracy measures, recall, precision, and $f-measure$: precision is the ratio of the correctly rectified attribute values to all rectified attribute values, recall is the ratio of the correctly rectified attribute values to all wrong attribute values. precision assesses the correctness of repairing, while recall assesses the completeness of repairing, and $f-measure$ is the harmonic mean of precision and recall, which is defined as follows.

$$\begin{aligned} f-measure=\frac{2\times precision\times recall}{percision + recall} \end{aligned}$$

(7)

Table 3 Repairing accuracy comparison on address data

Full size table

Table 4 Repairing accuracy comparison on hospital data

Full size table

8.2 Effectiveness comparison

In the first experiment, we compared the effectiveness of our repairing method, WMRR-DR with FR-DR on both data sets. The comparison results are shown in Table 3 for UIS and Table 4 for HOSP, where we fixed the noise rate at 10%, varied the typo rate from 0% up to 100% with a step 10% (it equates to vary active domain error rate from 100% down to 0%), and reported the recall, the precision and the number of repairs (#Repair). We set the threshold $\theta = 0.4$ for UIS and $\theta $=0.6 for HOSP (see Sect. 8.3). Both tables show that our method outperforms FR-DR in recall for all adopted typo rates, with maintaining 100% of precision. This is due to the fact that our method rectifies a greater number of errors than FR-DR since WMRRs depend on similarity matching to detect and repair more errors. WMRRs are also built on the data that are most likely to be correct and weighted to ensure their quality.

Furthermore, the recall values are very different in UIS and HOSP because HOSP has more frequent patterns for each FD than UIS, where the existence of these frequent patterns holds with Assumption 1 (the correct value of an attribute has a higher frequency than its wrong values). Therefore, WMRRs, which are generated based on Assumption 1, can detect and rectify errors that place in these frequent patterns.

For the sensitivity to the error type, we can observe that while keeping 100% of precision, the recall increases with the growth of typo rate and the decrease of active domain error rate on HOSP, but it fluctuates on UIS. This is because (1) HOSP has more frequent patterns for each FD than UIS, as we mentioned before; (2) more consistent rules are generated with the growth of typo rate and the decrease of active domain error rate, as we see next in Sect. 8.4.

Since our method has higher recall than FR-DR with the same precision for each adopted typo rate, we measure the improvement of accuracy in terms of $avg.f-measure$ on both data sets, as shown in Table 5. The results show that our method improves the accuracy up to 10.8% for UIS and up to 87% for HOSP. These findings verify that our method discovers effective rules and repairs errors based on these rules effectively. In the next experiments, the accuracy will be evaluated using $f-measure$.

Table 5 $Avg.f-measure$ comparison

Full size table

8.3 Effect of threshold $\theta $

First, we checked the effect of decreasing the threshold $\theta $ from 0.9 down to 0 on the discovered consistent rule set size for the two data sets with a typo rate of 50%. Figure 1a and b reports the rule set size, i.e., the number of rules, on UIS and HOSP data sets, respectively. We observe the following: (1) The rule set size increases, while $\theta $ decreases since more rules will be discovered and adopted. (2) The growth of rule set size is greater for UIS than HOSP with the decrease of $\theta $ since the attribute values in UIS are less frequent than they are in HOSP; for example, the rule set size is almost the same for both thresholds 0.7 and 0.6 on HOSP, while the rule set size for $\theta =0.6$ is more than the double for $\theta =0.7$ on UIS. (3) The rule set size reaches the highest value when $\theta $=0.6 for HOSP and $\theta $=0.4 for UIS.

Second, with the same settings, we studied the accuracy of WMRR-DR for these different thresholds on the two data sets. Since we can guarantee almost 100% of precision, we measure the accuracy in terms of $f-measure$ on both data sets. Figure 1c and d reports $f-measure$ results on UIS and HOSP, respectively. They show that the accuracy of our method increases gradually with the drop of $\theta $, as expected from the growth of rule set size until the accuracy reaches 86.8% on HOSP for $\theta =0.6$, and it reaches 10% on UIS when $\theta =0.4$. Moreover, our method outperforms FR-DR significantly in accuracy for all thresholds, except for $\theta =0.9$ on UIS where both methods have the same accuracy since the attribute values in UIS are a little frequent. Accordingly, we adopted $\theta =0.4$ for UIS, and $\theta =0.6$ for HOSP in our experiments.

8.4 Effect of typo rate and rule set size

We investigated the number of discovered consistent WMRRs compared with the number of consistent fixing rules (FRs) with different typo rates. We increased the typo rate from 0% to 100% and reported the number of both kinds of rules on UIS and HOSP in Fig. 2a and b, respectively. The results show that more WMRRs are discovered with more typos on HOSP, while the number of WMRRs on UIS changes with a narrow fluctuation, but it often decreases little with the growth of typo rate. This change depends on to what extent the patterns of each FD are frequent and how the typos are distributed in these frequent patterns. In opposite, the same number of FRs is used even for different typo rates since they are provided by experts one time. These findings approve the accuracy comparison in Sect. 8.2.

For further performance understanding, we also examined the repairing accuracy of our method WMRR-DR compared with FR-DR based on different numbers of rules. We increased the number of rules from 10 to 100 for UIS and from 100 to 1000 for HOSP, with a typo rate of 50% for both data sets. Figure 3a and b reports the $f-measure$ on UIS and HOSP, respectively. The results indicate that although both methods can achieve better accuracy by using more rules, our method WMRR-DR is more accurate than FR-DR even by using a little number of discovered rules.

8.5 Comparison with master data-built rules

In this section, we compared our rules with editing rules [8] and Sherlock rules [10], which are both built on master data. This comparison depends on to what extent master data are available, i.e., how many values are available in master data to match with rules and dirty data. Therefore, we discussed two cases for this comparison: (1) master data are available with all required values, which is the best condition for editing rules and Sherlock rules; (2) not all required values are available in master data. We checked case 2 for the availability of 50% of master data. For that, we encode master data values into editing rules and Sherlock rules, making them automatic rules. We exclude wrong patterns of our rules to simulate editing rules since these rules do not include data errors, while we kept them to simulate Sherlock Rules.

The experimental results are shown in Fig. 4a and b for case 1 and Fig. 4c and d for case 2, on UIS and HOSP, respectively. We varied the typo rate from 0% to 100% with step 10% and reported $f-measure$ of repairing based on our rules, editing rules and Sherlock rules, which are denoted by WMRR-DR, ER-DR, and SR-DR, respectively.

The results of case 1 indicate that WMRRs as dirty data-built rules can achieve the same or better accuracy than master data-built rules for the best condition when all required master data are available. In particular, WMRR-DR outperforms ER-DR for all typo rates except 0, while it has the same accuracy with SR-DR. This is because of adopting fuzzy matching for WMRRs and Sherlock rules, but equality matching for editing rules.

The results of case 2 report that WMRR-DR outperforms both ER-DR and SR-DR when not all required master data are available. This indicates that the performance of editing rules and Sherlock rules is dependent on the availability of master data. In contrast, our rules are independent of this factor since they are built on data in-hand.

8.6 Efficiency and scalability

On UIS and HOSP, we evaluated the efficiency of WMRRD, and WMRR-DR algorithms by varying the data size, i.e., the number of tuples, and the efficiency of Inconsis-Res by varying the rule set size, i.e., the number of checked rules.

Figure 5a and b shows the runtime performance of WMRRD on UIS and HOSP, respectively. They report that the runtime of WMRRD is approximately linear to the number of tuples on both data sets. This result shows that although the time complexity of WMRRD in the worst case is in quadratic in number of tuples (Sect. 4), it scales practically quite well.

Figure 5c and d shows the runtime performance of Inconsis-Res on UIS and HOSP, respectively. The runtime of inconsistency resolution increases linearly on UIS with a small rule set size, and nonlinearly on HOSP with a large rule set, where each pair of rules should be checked including all their wrong patterns. This nonlinear result is not surprising because of the large rule set and the large number of negative patterns of rules that should be tested, where attribute values are highly frequent in HOSP. However, Inconsis-Res scales well since it takes only 2.5 m to check and resolve inconsistency automatically in more than 58K rules on HOSP.

Figure 5e and f depicts a comparison between the run time of WMRR-DR and FR-DR on UIS and HOSP, respectively. It is not surprising that the run time of WMRR-DR with a large set of rules is higher than FR-DR with a small set of rules. Note that there is a trade-off between the accuracy and efficiency of WMRR-DR and FR-DR. As shown in Fig. 5e and f, the repairing time of FR-DR is significantly lower than that of WMRR-DR; on the other hand, as illustrated in Table 5, the repairing accuracy of WMRR-DR is significantly higher than that of FR-DR. Consequently, users can either repair a little number of data errors based on FRs with little time cost or repair a large number of data errors based on WMRRs with more time cost. As a result, our proposed method is preferred for real critical applications that care about high-quality data more than time cost, such that they can sacrifice some of time in order to save many costs caused by errors.

Summary From the experimental results, (1) our method achieves higher accuracy than FR-DR, since it can achieve higher recall by rectifying more errors without any loss of repairing precision; (2) more rules are discovered by decreasing the threshold $\theta $ and, hence, the repairing accuracy is improved; (3) our method is more accurate than FR-DR even by using a little number of discovered rules; (4) without master data, our rules can achieve comparable or better accuracy than master data-built rules; (5) WMRRD scales approximately linearly with the size of data, Inconsis-Res scales well with the rule set size, while there is a trade-off between the accuracy and efficiency of WMRR-DR and FR-DR.

9 Conclusion

In this paper, we introduce a new class of data repairing rules, weighted matching rectifying rules on which we can perform reliable data repairing automatically, based on the data in-hand without external data source or user verifications. We propose three effective algorithms: (1) the rule discovery algorithm WMRRD which is the first algorithm to discover repairing rules automatically from dirty data in-hand, (2) the inconsistency resolution algorithm Inconsis-Res that checks rules consistency and also solve the captured inconsistency automatically, and (3) the data repairing algorithm WMRR-DR that rectifies data errors based on the discovered rules. Our method is reliable, automatic, and effective. We have conducted extensive experiments on both real-life and synthetic data sets, and the results demonstrate that our method can achieve both high precision and high recall.

This research is the first attempt to discover repairing rules automatically from the data in-hand, utilizing correct values to repair errors without any external source. In future work, we would like to define a new kind of repairing rules to uncover and repair errors of 1-to-n relationships, i.e., an attribute may have n correct values for a value of a set of related attributes, where our rules, like other rules, detect and repair errors of 1-to-1 relationships with only one correct value for each attribute. Another interesting future work is to investigate how to exploit external resources instead of getting rid of them, especially for certain domains with gold standards or benchmarks, to further boost the automatic data repairing solutions or propose new semi-automatic methods, which may reduce the number of discovered rules and enhance repairing efficiency and scalability without loss of high quality.

Notes

http://www.hospitalcompare.hhs.gov/
http://www.cs.utexas.edu/users/ml/riddle/data.html

References

Organaizations is full of dirty data. http://www.itpro.co.uk/609057/firms-full-of-dirty-data. Accessed 16 May 2019
Dirty data affects on the US. economy. http://www.ringlead.com/dirty-data-costs-economy-3-trillion/. Accessed 22 Apr 2019
Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, pp. 143–154 (2005)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp. 746–755 (2007)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)
Article Google Scholar
Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Discovering conditional matching rules. ACM Trans. Knowl. Discov. Data 11(4), 46 (2017)
Article Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. Proc. VLDB Endow. 3(1–2), 173–184 (2010)
Article Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)
Article Google Scholar
Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014)
Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp. 18–29 (2015)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 315–326 (2007)
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170 (2008)
Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, vol. 99. Citeseer, pp. 68–79 (1999)
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 53–62 (2009)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3(1–2), 197–207 (2010)
Article Google Scholar
Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)
Article Google Scholar
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)
Article Google Scholar
Fan, W., Ma, S., Tang, N., Wenyuan, Y.: Interaction between record matching and data repairing. J. Data Inf. Qual. 4(4), 16 (2014)
Google Scholar
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 458–469 (2013)
Ahmad, H.A., Wang, H.: An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 36(3), 593–612 (2018)
Article Google Scholar
Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)
Article Google Scholar
He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp. 893–907 (2016)
Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: a data cleaning system powered by knowledge bases and crowd sourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1247–1261 (2015)
Hao, S., Tang, N., Li, G., Li, J., Feng, J.: Distilling relations using knowledge bases. VLDB J. 27(4), 497–519 (2018)
Article Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In VLDB, vol. 1, pp. 381–390 (2001)
Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. Proc. VLDB Endow. 4(5), 279–289 (2011)
Article Google Scholar
Rekatsinas, T., Chu, X., Ilyas, I.F., Holoclean, C.R.: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)
Article Google Scholar
Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp. 553–564 (2013)
Shin, J., Sen, W., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)
Article Google Scholar
Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. arXiv:1505.04406 (2015)
Niu, F., Ré, C., Doan, A.H., Shavlik, J.: Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proc. VLDB Endow. 4(6), 373–384 (2011)
Article Google Scholar
Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiané-Ruiz, J.-A., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp. 1635–1638 (2017)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage techniques. Springer, Berlin (2007)
MATH Google Scholar
Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29(4), 727–742 (2017)
Article Google Scholar

Download references

Acknowledgements

This paper was partially supported by NSFC Grant U1509216, U1866602, 61772157, The National Key Research and Development Program of China 2016YFB1000703, NSFC Grants 61472099, 61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026. The authors would like to thank Prof. Jiannan Wang for his support in this work and also the anonymous reviewers for their valuable comments that greatly improved this paper.

Author information

Authors and Affiliations

Department of Computer Science, Harbin Institute of Technology, Harbin, China
Hiba Abu Ahmad & Hongzhi Wang
Department of Software Engineering and Information Systems, Tishreen University, Latakia, Syria
Hiba Abu Ahmad

Authors

Hiba Abu Ahmad
View author publications
Search author on:PubMed Google Scholar
Hongzhi Wang
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Hiba Abu Ahmad or Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abu Ahmad, H., Wang, H. Automatic weighted matching rectifying rule discovery for data repairing. The VLDB Journal 29, 1433–1447 (2020). http://doi.org/10.1007/s00778-020-00617-6

Download citation

Received: 17 September 2019
Revised: 10 March 2020
Accepted: 12 May 2020
Published: 09 June 2020
Issue Date: November 2020
DOI: http://doi.org/10.1007/s00778-020-00617-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part of a collection:

Automatic weighted matching rectifying rule discovery for data repairing

Abstract

Explore related subjects

1 Introduction

Example 1

Example 2

2 Related work

3 Weighted matching rectifying rules

3.1 Rule syntax

3.2 Rule semantics

Definition 1

Definition 2

Example 3

Definition 3

Definition 4

4 Weighted matching rectifying rule discovery

Problem 1

Example 4

5 Fundamental problems

5.1 Termination

5.2 Consistency

Theorem 1

5.3 Determinism

5.4 Implication

Definition 5

Theorem 2

6 Rule inconsistency resolution

Definition 6

7 WMRR-based data repairing

Problem 2

Definition 7

Example 5

8 Experimental results

8.1 Experiments context

8.2 Effectiveness comparison

8.3 Effect of threshold \(\theta \)

8.4 Effect of typo rate and rule set size

8.5 Comparison with master data-built rules

8.6 Efficiency and scalability

9 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords