|
|
Welcome to the Invelos forums. Please read the forum
rules before posting.
Read access to our public forums is open to everyone. To post messages, a free
registration is required.
If you have an Invelos account, sign in to post.
|
|
|
|
Invelos Forums->DVD Profiler: Contribution Discussion |
Page:
1 2 3 4 5 ...15 Previous Next
|
Credit Name Parsing |
|
|
|
Author |
Message |
| T!M | Profiling since Dec. 2000 |
Registered: March 13, 2007 | Reputation: | Posts: 8,749 |
| Posted: | | | | Quoting hal9g: Quote: In all fairness, I need to point out one downside of this system. Take for instance "Al Smithee". Today, with as credited, "Al Smithee" has been identified to be an alias for more than one person (which is in fact the case) and only the profiles that are associated with the same person will get linked. In the simple linking system, this cannot be done, and all people associated with the name of "Al Smithee" will get linked together.
Personally, I believe this scenario is so infrequent, that the benefits of a simple linking system far outweigh this one small flaw. And knowing this community, it is entirely possible that someone here can come up with a solution to that problem, too. Unfortunately, this isn't "so infrequent" at all - there are plenty of examples. Think fathers and sons with the same name, for instance: Robert Downey, Sr. and Robert Downey, Jr., who have both been credited as just "Robert Downey". Same for Lon Chaney, Jr. and Sr. (have both been credited as "Lon Chaney"), or, in crew, Roger Heman, Jr. and Sr. (both have been credited as "Roger Heman"). The list goes on and on - IMHO this is too big a problem to ignore. Bottom line: in the current system, I have Robert Downey, Sr.'s credits nicely separated from his son's credits, and I wouldn't like to trade that for a system where their credits are suddenly linked together. | | | Last edited: by T!M |
| Registered: February 23, 2009 | Reputation: | Posts: 1,580 |
| Posted: | | | | Ken, I'm very excited, thrilled and happy that you will tackle this issue in such a structural and profound way. No band aid on an open wound but instead a solution that can withstand the test of time. I gladly offer my full support, however small it may be. If you can achieve all the points you set out to do, it'll lift DVDP to the next level for sure.
Here's my take, summed up in short bullet points:
- as many, I believe a unique ID per actor/crew member is the way to go
- each ID can have an unlimited number of 'credited as' entries, which is linked to the profile. For example, UPC 000000000 has actor ID 000000 credited as "J. Doe"
- Locally users can set: a) naming order (first, last or last, first) b) display name per ID (chose one of the online credited as names or enter one locally) c) settings for name displaying (either only the local default name for an ID, or only the credited as name or both e.g. Robert Downey [Robert Downey, Jr])
That the ideal I have in mind. Now some things to consider:
1) There will need to be a thorough brainstorming on how we attribute unique ID's: when I, as a contributor, enter a cast listing, submit it online, then at some point, somehow, I'll need to tell the online database which actor has which ID. Things to cover: - an existing name is in the DB and it's the same actor > same ID - an existing name with various ID's > I need to choose the correct ID - an existing name but new actor > new ID needs to be created - a new name, existing ID > I need to link that new name to the existing ID - a new name, no existing ID > completely new entry So there are various possibilities that can occur and we'll need to consider how contributors will need to submit their cast/crew entries, how they'll need to document and how the ID assigning process will work (fully automated, partially automated or not at all automated with screeners doing the linking?)
2) Existing database (which is huge) will need to be automatically converted to ID-based database. For that, I believe some kind of algorithm can do the trick. With that in mind, we'll need to find a way to created unique ID's, using the existing data as a base. For example, we can attribute a number to each letter and add birthyear. Example: John Doe (1968) > 10 15 8 18 4 15 5 1968 > ID= 101581841551968 There would still need to be some clean up done to merge ID's from name variants.
All things considered, I think this unique ID identification method is the best and only way to move forward, to create something that has all the required functionalities and finally get rid of birthyear, CLT and other problems. I'm also in favor of keeping first and last names, as otherwise we will lose functionality, as Ken stated.
One final request: locally add a tick box for 'reversered name order'. That way, locally we can tick all the Asian names with that and have locally a listing like Tom Cruise (first, last) Kitano Takeshi (last, first) Instead of first, last or last, first for all actors.
I'm sorry I don't have more of a programmer's insight, but that's the gist of what I had been contemplating. | | | Blu-ray collection DVD collection My Games My Trophies |
| | Corne | Registered: Nov. 1, 2000 |
Registered: April 5, 2007 | Posts: 1,059 |
| Posted: | | | | Quoting Taro: Quote:
2) Existing database (which is huge) will need to be automatically converted to ID-based database. For that, I believe some kind of algorithm can do the trick. With that in mind, we'll need to find a way to created unique ID's, using the existing data as a base. For example, we can attribute a number to each letter and add birthyear. Example: John Doe (1968) > 10 15 8 18 4 15 5 1968 > ID= 101581841551968 There would still need to be some clean up done to merge ID's from name variants. That's my main concern. Like you I'm really excited too but I don't want to loose all the effort other users (myself included) have put into the current system. I'm no programmer so is there a way to convert the 'common names' and its variant to an ID without loosing the 'credited as' locally? The BY tag is a start, but that would mean for contributed BYs only. What about crew that is entered locally as 'other'? Locally I have 'other' crew that isn't in the database yet, so in that case there should be a possibility to add persons manually? | | | Cor |
| Registered: March 13, 2007 | Posts: 2,759 |
| Posted: | | | | Quoting xradman: Quote: If we are going to support adding accent characters, I think it's essential that searching for e returns all variations of e with accents. While this may be some development work, it would be worthwhile. It would solve many discussions before they even appear. BTW Accent insensitive searching would not only improve name searching but title searching and probably any other kind of searching as well. BTW2 While saying searching, filtering should be included as well of course. |
| Registered: June 12, 2007 | Reputation: | Posts: 2,665 |
| Posted: | | | | The potential changes sound good but i don't see how it will resolve many of the parsing arguments. Most of those are about where to split the given and family names and if we still have two fields split at that point we will still have arguments about how to parse names. | | | Bad movie? You're soaking in it! |
| Registered: March 13, 2007 | Reputation: | Posts: 17,334 |
| Posted: | | | | I agree tweeter... As I said... I understand Ken's thoughts on this matter. But as others said I for one am willing to give up that little bit lost to have a single name field and not have to worry about the parsing.
With this ID system I am still concerned if we would have to download the entire cast and crew database... as I said before... I wouldn't want to have countless entries for people I don't have a single release for. | | | Pete |
| Registered: March 29, 2007 | Reputation: | Posts: 4,479 |
| Posted: | | | | Quoting tweeter: Quote: The potential changes sound good but i don't see how it will resolve many of the parsing arguments. Parsing need to think before choosing where to put middle part of the name, but is rarely a real problem. How many real questions had we in the past? For people unfamiliar with a language, it might be difficult to tell whether a middle part belongs to given or family name, but users of the same nationality (or a google search in the actor's language) can give the answer in most cases. If really it is impossible to find the truth, each user will be able to choose how to display in his local, as we do now. I find a good thing that Ken keeps the possibility to sort by surname, which is the way all names listings sort, and I would be very sad to see this possibility disappear. | | | Images from movies | | | Last edited: by surfeur51 |
| | T!M | Profiling since Dec. 2000 |
Registered: March 13, 2007 | Reputation: | Posts: 8,749 |
| Posted: | | | | Quoting Addicted2DVD: Quote: I for one am willing to give up that little bit lost to have a single name field and not have to worry about the parsing. For the record: I agree. But since Ken doesn't seem to want to go there, I'm just happy that we at least eliminate one of the fields. Additionally: if we're going to use unique identifiers for everyone, then I would expect it to be extremely easy for Ken to do some kind of an automated check for double entries like that. Surely having separate ID's for, say Helena Bonham/Carter and Helena/Bonham Carter should be completely out of the question - and if we can't trust the users not to make that mistake, then a filter should be implemented that enforces that it just can't happen. As part of the whole implementation of unique ID's, this particular little issue doesn't seem like the hardest bit to address. Also: with unique ID's I would expect that changing someone's parsing would no longer happen on a per-profile basis, but once for that particular person's ID, after which the change propagates to all users. Quote: With this ID system I am still concerned if we would have to download the entire cast and crew database... as I said before... I wouldn't want to have countless entries for people I don't have a single release for. I, too, like to have a lean, mean and clean database. I have that now, and ideally, I'd like to keep it that way. Then again: if that's going to be the only drawback, then so be it. |
| Registered: March 13, 2007 | Reputation: | Posts: 17,334 |
| Posted: | | | | Yeah... but if we are going to have every actor or crew member ever (rightfully) put into profiler... then I have to think performance has to be an issue too. I for one have a slow machine as it is. | | | Pete |
| Registered: February 23, 2009 | Reputation: | Posts: 1,580 |
| Posted: | | | | I have virtually every DVDP actor/crew in my database (due to the headshots database: I always keep all of them, not just the ones in profiles). It runs smooth on my desktop as well as office notebook. I must say that the only time it slows down a bit, is when filtering on actor name or crew name: if I type it in, it goes a bit slow. If I copy-paste the name at once into the search field, it's instant.
Personally, I'd rather have to download the entire ID list and have a good functioning system (no more CLT, birth years, etc), even if it slows down the process a bit, rather than stick with what we have now. But that's my personal opinion, of course. | | | Blu-ray collection DVD collection My Games My Trophies |
| Registered: March 10, 2007 | Posts: 4,282 |
| Posted: | | | | Quoting T!M: Quote: Bottom line: in the current system, I have Robert Downey, Sr.'s credits nicely separated from his son's credits, and I wouldn't like to trade that for a system where their credits are suddenly linked together. Agreed. As mentioned in my post, any replacement system must maintain the abilities of the current system. Among those is the ability to differentiate between two credit entries with the same name referencing different people. I believe we can achieve that. | | | Invelos Software, Inc. Representative | | | Last edited: by Ken Cole |
| Registered: March 13, 2007 | Reputation: | Posts: 6,635 |
| Posted: | | | | Quoting Ken Cole: Quote: Quoting T!M:
Quote: Bottom line: in the current system, I have Robert Downey, Sr.'s credits nicely separated from his son's credits, and I wouldn't like to trade that for a system where their credits are suddenly linked together.
Agreed. As mentioned in my post, any replacement system must maintain the abilities of the current system. Among those is the ability to differentiate between two credit entries with the same name referencing different people. I believe we can achieve that. I think we can also. What is needed is a variation of the BY concept, since they are literally impossible to find in many cases. We need to come up with a method to uniquely identify different people who have the same name. We do not really need a unique identifier for every name as long as we have away to link variations of names. In IMDb they use roman numerals to achieve that. We could easily come up with some other scheme, like a 2-character alpha field and a way to assign them and identify who they refer to. The thing that concerns me most is the contribution process, and being able to easily identify which name/ID is the one to use for the person in the film that I am actually contributing. I don't see how a Unique ID system will do this, and it's not necessary for the vast majority of credits. If I am an average contributor, I should be able to simply copy the film credits exactly as I see them and contribute without having to do anything else. I may not be the least bit worried about linking and I may not want to spend the time researching every name to see if there is a variant in the database or if someone else is using the exact same name. If we implement a system that requires people to do a lot of research in order to contribute cast and crew, we are going to lose a lot of contributors. A lot of the linking of name variants can be extracted directly from the existing online db using the "credited as" and common name fields. Some "same name linking" can be extracted by looking at names with BYs attached. That would give us a very strong base to work from. We need a system that makes it easy for people who want to take the time to ensure proper linking. So when I hit the contribute button, perhaps the system would prompt the contributor and ask them if they want to review potential linking issues. If the contributor answers "no" then the contribution goes in as is. If they say yes, the names in the contribution are "bounced" against the online db and existing links are presented for the contributor to review and decide if they are correct for this profile. The contributor would also have the ability to create "new" links or add "tags" for same name issues. Any changes would be identified in the voting process and need approval of the voters and screeners. For people with the same name (Lon Chaney - is he dad or son?) we need to assign a "tag" (2 character alpha field referred to above). Associated with that "name and tag", we need to have a way to easily know which one of the possibilities it refers to. Perhaps having a short bio attached to these "name and tags" entries would be a possible way to go. One thing we need to be careful of is not to "over-engineer" a solution. The simpler the better, especially when it comes to the contribution process. | | | Hal |
| Registered: March 13, 2007 | Reputation: | Posts: 17,334 |
| Posted: | | | | Quoting Taro: Quote: I have virtually every DVDP actor/crew in my database (due to the headshots database: I always keep all of them, not just the ones in profiles). It runs smooth on my desktop as well as office notebook. I must say that the only time it slows down a bit, is when filtering on actor name or crew name: if I type it in, it goes a bit slow. If I copy-paste the name at once into the search field, it's instant.
Personally, I'd rather have to download the entire ID list and have a good functioning system (no more CLT, birth years, etc), even if it slows down the process a bit, rather than stick with what we have now. But that's my personal opinion, of course. I currently just have cast and crew that are in profiles I own... and what you are talking about is already very slow for me... even with copy & paste. | | | Pete |
| Registered: March 14, 2007 | Reputation: | Posts: 6,749 |
| Posted: | | | | Quoting hal9g: Quote: In IMDb they use roman numerals to achieve that. We could easily come up with some other scheme, like a 2-character alpha field and a way to assign them and identify who they refer to.
The thing that concerns me most is the contribution process, and being able to easily identify which name/ID is the one to use for the person in the film that I am actually contributing. I don't see how a Unique ID system will do this, and it's not necessary for the vast majority of credits. Just for clarification: IMDb uses an unique ID system, the roman numerals are "just for show". For example: Fred Astaire is nm0000001. The "Silent Bob" Kevin Smith is nm0003620, the "Ares" Kevin Smith is nm0808963. If I wanted to add the first one to a new movie in IMDb and enter the name "Smith, Kevin" it will show me all the Kevin Smiths in the DB, I can check their profile and then decide if it's any of them or an entirely new one. | | | Karsten DVD Collectors Online
| | | Last edited: by DJ Doena |
| Registered: March 13, 2007 | Reputation: | Posts: 17,334 |
| Posted: | | | | I definitely agree with what Hal said... especially this part... Quoting hal9g: Quote: If I am an average contributor, I should be able to simply copy the film credits exactly as I see them and contribute without having to do anything else. I may not be the least bit worried about linking and I may not want to spend the time researching every name to see if there is a variant in the database or if someone else is using the exact same name. If we implement a system that requires people to do a lot of research in order to contribute cast and crew, we are going to lose a lot of contributors. I completely agree that if you take away the simplicity of typing in what you see and being able to contribute it from there you will lose contributors. I could see if it gets to complicated it could lose me from contributing cast. | | | Pete |
| Registered: March 13, 2007 | Reputation: | Posts: 6,635 |
| Posted: | | | | Quoting DJ Doena: Quote: Quoting hal9g:
Quote: In IMDb they use roman numerals to achieve that. We could easily come up with some other scheme, like a 2-character alpha field and a way to assign them and identify who they refer to.
The thing that concerns me most is the contribution process, and being able to easily identify which name/ID is the one to use for the person in the film that I am actually contributing. I don't see how a Unique ID system will do this, and it's not necessary for the vast majority of credits.
Just for clarification: IMDb uses an unique ID system, the roman numerals are "just for show". For example: Fred Astaire is nm0000001. The "Silent Bob" Kevin Smith is nm0003620, the "Ares" Kevin Smith is nm0808963.
If I wanted to add the first one to a new movie in IMDb and enter the name "Smith, Kevin" it will show me all the Kevin Smiths in the DB, I can check their profile and then decide if it's any of them or an entirely new one.
Yes, I understand, however, the Unique ID system is not required for this process to be workable. | | | Hal |
|
|
Invelos Forums->DVD Profiler: Contribution Discussion |
Page:
1 2 3 4 5 ...15 Previous Next
|
|
|
|
|
|
|
|
|