I've noticed that every instance of a term, when added to Calais, duplicates a taxonomy term then adds a dash (-) and a number (after the first one, adds zero, then 1, ...) so that tags show up multiple times.
For example, I have 2 pages where Barack Obama is mentioned. In Calais, I added the name in "person" catagory. Now, I have 2 entries for Barack Obama, one plain, one -0. I have tried using the "Calais Tag Modifications" but multiple tags still show up - and I don't want to have to modify every instance of a tag, right?
I saw a similar issue titled "Duplicate taxonomy terms" so this may not be limited to Calais, but the other issue was related to Views and not exactly the same issue I am having.
Comments
Comment #1
febbraro commentedThis should definitely not be happening.
If you save a node twice, does a whole collection of duplicates exist? Or is this just for a few specific tags?
I really want to make sure this is not an issue moving forward as I'm in the process of revamping the tables/data structures and want to make sure this happens cleanly. When you have these multiple Taxonomy Terms, can you tell me a few things
1) Are there also multiple entries in the calais_term table?
2) In the taxonomy term_data table for these duplicate entries, do they have the same guid value? Are there any guid values for those terms?
The Taxonomy Manager module can help you clean up the mess, it lets you merge tags, etc from the taxonomy side, but lets make sure we kill this from the Calais side too.
Comment #2
2createwdrupal commentedI can see that this could get fairly complicated.
The duplicates were around without my knowledge until a few weeks ago when I decided to place the "persons" category in a block on the site. At that time, I noticed that every instance of a person was displayed separately.
Since adjusting the configuration in the RDF module, most of the duplicates displayed are gone. Most.
For the 2 blog posts re Barack Obama, there are 2 separate terms in the Calais module and they appear to have the same guid:
http://d.opencalais.com/pershash-1/cfcf1aa2-de05-3939-a7d5-10c9c7b3e87b
http://d.opencalais.com/pershash-1/cfcf1aa2-de05-3939-a7d5-10c9c7b3e87b
However, some have no guid. And the terms return no pages from the site, for example, the author of a book I reviewed has 3 Calais entries (one blog post) and returns this on a page:
Benyamin Cohen
There are currently no posts in this category.
But now it gets complicated because for Obama, I did put barack-obama-0=barack-obama in "tag modifications > rename." I did this with other names as well.
Sometimes, when the second word in a sentence is capitalized, Calais will treat the phrase as a name. This occurs several times on my site. I've done the rename with some of them, blacklisted some, and have deleted some. One thing that is not clear is whether or not the rename wants case sensitive entries. One phrase kept coming back even though I placed it in a rename line. I've now deleted the phrase as a category term.
New in 2.2 is "Calais Document Category" which has created a sub category of "other." The page I just created on my site now has a tag of "other" which I do not understand, but I am leaving it as is for the time being.
This is long and still probably not as complete as it could be. I hope it helps. Let me know. The "persons" Calais tagged display is at the bottom of the page on http://www.fromoutoftheblue.com
Comment #3
febbraro commentedI'm not exactly sure how the duplicates are getting in there. I have seen it occasionally, but not nearly consistently enough to nail it down. As of release 3.0 I have done an extensive refactoring of the data structures and tables to be in a much better state for reusability, etc. I even caught a bug or two so that could help.
You might consider trying that upgrade to see how that works for you. You will have to upgrade the RDF module too, so keep that in mind. Also, as always, try in a test environment and backup your database and site directory just to be safe.
Let me know how it goes,
Frank
Comment #4
alfthecat commentedWell, I've got some extra pointers maybe.
If you take a look at this site I'm setting up and experimenting with (still under heavy construction): www.myentropy.net.
You can see that if you click on any of the categories Health, Business or Politcs in the main navigation bar you'll find all articles are double tagged.
It seems like something goes wrong with calais document categories and topics. I've been trying for a while to fix this but I'm very much stuck.
I'll probably be removing the buttons soon, as I work through my other to-do's....
Comment #5
billsdesk commentedI was noticing the same problem. After spinning my wheels for a whole day, I realized that Calais uses categories. Consequently, the same term can be category to group nodes that uses th.e same term. Only after studying the Category module did I understand what was happening, as there are two separate paths /term and /category. My problem is that I want to manage were the categories appear, and where they are positioned. I really don't want to install the category module, but am still looking for an alternate solution.
Comment #6
febbraro commented