我以为我把它缩小到一个文化问题(西里尔字符),我解决了,但我现在得到’假阴性'(两个显然相等的字符串显示为不相等).
我查看了以下类似的问题,并尝试了以下比较方法.
我检查的类似SO问题:
> Why does my comparison always return false?
> C# string equality operator returns false,but I’m pretty sure it should be true… What?
> String Equals() method fails even though the two strings are same in C#?
> Differences in string compare methods in C#
以下是比较字符串的示例:(标题和说明)
Feed title: Ellsberg: He’s a hero
Feed desc: Daniel Ellsberg tells CNN’s Don Lemon that NSA leaker Edward Snowden showed courage,has done an enormous service.
db title: Ellsberg: He’s a hero
db desc: Daniel Ellsberg tells CNN’s Don Lemon that NSA leaker Edward Snowden showed courage,has
done an enormous service.
我的应用程序将从RSS提要获取的值与我在数据库中提供的值进行比较,并且只应插入“新”值.
//fetch existing articles from DB for the current Feed: List<Article> thisFeedArticles = (from ar in entities.Items where (ar.ItemTypeId == (int)Enums.ItemType.Article) && ar.ParentId == Feed.FeedId && ar.DatePublished > datelimit select new Article { Title = ar.Title,Description = ar.Blurb }).ToList();
下面的比较中的每个人都显示与Ellsberg标题/描述不匹配.即matches1到matches6都有Count()== 0
(请原谅列举的变量名称 – 它们仅用于测试)
// comparison methods CompareOptions compareOptions = CompareOptions.OrdinalIgnoreCase; CompareOptions compareOptions2 = CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace; //1 IEnumerable<Article> matches = thisFeedArticles.Where(b => String.Compare(b.Title.Trim().Normalize(),a.Title.Trim().Normalize(),CultureInfo.InvariantCulture,compareOptions) == 0 && String.Compare(b.Description.Trim().Normalize(),a.Description.Trim().Normalize(),compareOptions) == 0 ); //2 IEnumerable<Article> matches2 = thisFeedArticles.Where(b => String.Compare(b.Title,a.Title,CultureInfo.CurrentCulture,compareOptions2) == 0 && String.Compare(b.Description,a.Description,compareOptions2) == 0 ); //3 IEnumerable<Article> matches3 = thisFeedArticles.Where(b => String.Compare(b.Title,StringComparison.OrdinalIgnoreCase) == 0 && String.Compare(b.Description,StringComparison.OrdinalIgnoreCase) == 0 ); //4 IEnumerable<Article> matches4 = thisFeedArticles.Where(b => b.Title.Equals(a.Title,StringComparison.OrdinalIgnoreCase) && b.Description.Equals(a.Description,StringComparison.OrdinalIgnoreCase) ); //5 IEnumerable<Article> matches5 = thisFeedArticles.Where(b => b.Title.Trim().Equals(a.Title.Trim(),StringComparison.InvariantCultureIgnoreCase) && b.Description.Trim().Equals(a.Description.Trim(),StringComparison.InvariantCultureIgnoreCase) ); //6 IEnumerable<Article> matches6 = thisFeedArticles.Where(b => b.Title.Trim().Normalize().Equals(a.Title.Trim().Normalize(),StringComparison.OrdinalIgnoreCase) && b.Description.Trim().Normalize().Equals(a.Description.Trim().Normalize(),StringComparison.OrdinalIgnoreCase) ); if (matches.Count() == 0 && matches2.Count() == 0 && matches3.Count() == 0 && matches4.Count() == 0 && matches5.Count() == 0 && matches6.Count() == 0 && matches7.Count() == 0) { //insert values } //this if statement was the first approach //if (!thisFeedArticles.Any(b => b.Title == a.Title && b.Description == a.Description) // { // insert // }
显然我一次只使用上述选项之一.
在大多数情况下,上述选项确实有效,并且检测到大多数重复,但仍然有重复的问题从中解决 – 我只需要了解“裂缝”是什么,所以任何建议都会受到欢迎.
我甚至尝试将字符串转换为字节数组并进行比较(暂时删除了该代码,抱歉).
Article对象如下:
public class Article { public string Title; public string Description; }
更新:
我已经尝试了规范化字符串以及包括IgnoreSymbols CompareOption,我仍然得到一个假阴性(不匹配).我注意到的是,撇号似乎在虚假的不匹配中表现出一致的外观;所以我认为这可能是撇号与单引号的情况,即’vs'(等等),但IgnoreSymbols肯定应该避免这种情况?
我找到了几个类似的SO帖子:
C# string comparison ignoring spaces,carriage return or line breaks
String comparison: InvariantCultureIgnoreCase vs OrdinalIgnoreCase?
下一步:尝试使用正则表达式根据此答案剥离空格:https://stackoverflow.com/a/4719009/2261245
更新2
在6比较之后STILL没有返回任何匹配,我意识到必须有另一个因素扭曲结果,所以我尝试了以下
//7 IEnumerable<Article> matches7 = thisFeedArticles.Where(b => Regex.Replace(b.Title,"[^0-9a-zA-Z]+","").Equals(Regex.Replace(a.Title,""),StringComparison.InvariantCultureIgnoreCase) && Regex.Replace(b.Description,"").Equals(Regex.Replace(a.Description,StringComparison.InvariantCultureIgnoreCase) );
这可以找到其他人错过的比赛!
下面的字符串通过所有6个比较,但不是第7个:
a.Title.Trim().Normalize()和a.Title.Trim()都返回:
“Corrigendum: Identification of a unique TGF-β–dependent molecular and
functional signature in microglia”
DB中的值是:
“Corrigendum: Identification of a unique TGF-ß–dependent molecular and
functional signature in microglia”
更仔细的检查表明德国人的“eszett”角色在数据库中与来自饲料的东西不同:βvsß
我原本预计至少会有一个比较1-6来挑选它……
有趣的是,经过一些性能比较后,Regex选项绝不是七个中最慢的选项. Normalize似乎比正则表达式更加密集!
当thisFeedArticles对象包含12077个项目时,以下是所有七个的秒表持续时间
Time elapsed: 00:00:00.0000662
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000009
Time elapsed: 00:00:00.0000016
解决方法
尝试规范化你的字符串.有关更多信息,请参阅http://msdn.microsoft.com/en-us/library/System.String.Normalize.aspx