Alpha version of new Schedules Direct lineup information

Discussion about Schedules Direct grabber code and data formats.
hall5714
Posts: 20
Joined: Thu Feb 07, 2013 4:34 pm

Re: Alpha version of new Schedules Direct lineup information

Post by hall5714 »

rkulagow wrote:No. The schedule file itself is fairly lightweight, since it doesn't contain any real program details. The program files are what would make up the bulk of the transfer, and if there are a lot of repeats on a schedule then you only download the prog_id a single time.
I didn't even think that through.
rkulagow wrote:That was investigated, but it took way too long to assemble everything. Since the client is supposed to download the schedule first, determine which programs it needs and then request only those program, we can't really bundle into one file.
I was completely unclear there... my thinking was that each one individually be a gzip stream. Here's the process as it is now:
Download lineup -> Unzip -> Read in file(s) 1 by 1 -> Process
Download schedule -> Unzip -> Read in file(s) 1 by 1 -> Process -> Determine programs needing update

What my thought was:
Download lineup (single gzip stream) -> Process
Download schedule (single gzip stream) -> Process -> Compare md5 key to one in database -> If changed:
Download programs (single gzip stream with X programs) -> Process -> Repeat until processed

The unzipping and file read-ins seem to make the process pretty I/O heavy. Obviously the client writers would need to take care downloading program data, since that is where the build of this comes from. But even if it didn't work for programs, for whatever reason, if the lineup and schedule are light weight (which they should be) there's really no reason to add unzipping as a part of the process
rkulagow wrote: Maybe, if you make your case strong enough.
The only reason I brought that up is because there is a ton of cross over data between SD's programs and TVDB's series data. Giving the TVDB data in the schedule would let the client which source to use for episode/program meta.
hall5714 wrote: You won't get stale data. Each time you get the sched file for a stationID, it will contain the next 12-14 days worth of programs that are on the stationID. If any program gets updated (say a new guest star is added, or metadata is updated, or whatever) on "today + 4 days", then while the prog_id stays the same, the MD5 will change. The client should look for that. In the same way, if you've downloaded a program and it's in your database, and the program comes up again on the schedule, but the MD5 is the same, then you don't need to download it again.

The schedules are pre-generated, so no, can't make them a shorter duration; each stationID will have the next 12-14 days.
The design is starting to make more sense now, and given the schedule data is so small, it seems moot to download this daily to look for schedule changes.

I guess the only thing that really leaves is a question of whether or not we could get lineup data in 1 file and schedule data in 1 file. Whether or not they are gzipped or zipped is moot I suppose (since most languages can handle those in-place, but for a single file gzip is really the way to go), but a single file vs multiple files would be a great deal easier to process in place... and as such, a great deal easier to make a Python module, without obligating callback handling on many files.

Thanks for the response and clarification!

rmeden
SD Board Member
Posts: 1527
Joined: Tue Aug 14, 2007 2:31 pm
Location: Cedar Hill, TX
Contact:

Re: Alpha version of new Schedules Direct lineup information

Post by rmeden »

hall5714 wrote: I guess the only thing that really leaves is a question of whether or not we could get lineup data in 1 file and schedule data in 1 file. Whether or not they are gzipped or zipped is moot I suppose (since most languages can handle those in-place, but for a single file gzip is really the way to go), but a single file vs multiple files would be a great deal easier to process in place... and as such, a great deal easier to make a Python module, without obligating callback handling on many files.
I'll answer for RobertK, of course correct me if I'm wrong...

The reason you can't download a single zip file with schedule data for all your station-ids is because the zip file is prepared in advance (on a different server) and is shared with anyone else that gets that station. For example, HBO-West is in just about all the lineups in San Francisco, San Diego, Seattle, etc. Preparing a custom zip for each user would be computationally more expensive.

The lineup-zip is basically a suggestion of station-ids. An application doesn't *need* to request all of them (but can).

RobertE

hall5714
Posts: 20
Joined: Thu Feb 07, 2013 4:34 pm

Re: Alpha version of new Schedules Direct lineup information

Post by hall5714 »

rmeden wrote: I'll answer for RobertK, of course correct me if I'm wrong...

The reason you can't download a single zip file with schedule data for all your station-ids is because the zip file is prepared in advance (on a different server) and is shared with anyone else that gets that station. For example, HBO-West is in just about all the lineups in San Francisco, San Diego, Seattle, etc. Preparing a custom zip for each user would be computationally more expensive.

The lineup-zip is basically a suggestion of station-ids. An application doesn't *need* to request all of them (but can).

RobertE
Hmmm... you mean the text files for each station is prepared in advance and the zip file is prepared at call time? I can't imagine the zips being prepared in advanced since the schedule call to the server selects specific schedule ID's. If that's true than the API is simply grabbing text files and throwing them in a zip file. Whereas, a single file would involve grabbing all the text files, merging them into a single text file and then gzipping and sending. The added step shouldn't eat CPU cycles, but it will shift the I/O overhead from client to server (instead of the client reading in the files, the server reads and merges them).

Now how much I/O overhead that is, and whether or not that would put too much strain on the server requires someone smarter than I am to figure out :).

rmeden
SD Board Member
Posts: 1527
Joined: Tue Aug 14, 2007 2:31 pm
Location: Cedar Hill, TX
Contact:

Re: Alpha version of new Schedules Direct lineup information

Post by rmeden »

hall5714 wrote: Hmmm... you mean the text files for each station is prepared in advance and the zip file is prepared at call time? I can't imagine the zips being prepared in advanced since the schedule call to the server selects specific schedule ID's. If that's true than the API is simply grabbing text files and throwing them in a zip file. Whereas, a single file would involve grabbing all the text files, merging them into a single text file and then gzipping and sending. The added step shouldn't eat CPU cycles, but it will shift the I/O overhead from client to server (instead of the client reading in the files, the server reads and merges them).
Uh oh, maybe I'm wrong... Based on your description, there is one file per station-id with all the days schedules in it. That's what I meant is prepared in advance and why you can't request specific days +1,+2,+14.

The details for a program-id has to be returned "on the fly", but the vast majority of details for a program-id hasn't changed so isn't requested again.

Robert

rkulagow
SD Staff
Posts: 917
Joined: Tue Aug 14, 2007 3:15 pm

Re: Alpha version of new Schedules Direct lineup information

Post by rkulagow »

I think it's in the history of the wiki of what the API and the resulting data used to look like. What happened was that trying to generate everything dynamically was too I/O intensive, and the json encode routines in PHP weren't up to the task. As more and more programs were added into a single "program" entity (so that all the programs you asked for were a single file with multiple program entities in an array), php got slower and slower no matter how much CPU and memory was thrown at the problem. It wasn't scaleable. And this scheme shifts control to the client. If you can't do 10000 programs, then don't request that many. But if you request 10000, you're going to get a .zip file with 10000 files in it, and each file is going to be a programID that you requested. If you only want to process 1 per second, throttling is on the client side. If you get a headend with 800 channels in it, then only request the schedules for the channels that the user wants - maybe they're only interested in 50 of those 800, so only request those. And a program ID may be on multiple channels; it's on a HD station id on channel 256 on a cable set top box, but it's also on channel 13.1 over-the-air in HD, but it's standard def on channel 27 on the cable set top box. You only need to request the program ID, because it's the same program, just different "flavors" on the different stationIDs.

By having the JSON for each program generated once and stored in the database, that meant that when 10000 people all request the program for "CBS, on 2013-02-11 @ 8PM Eastern Standard Time", because they're all requesting the same program ID, they all get the same data. CBS Boston, CBS NY, CBS Chicago, CBS Omaha, CBS San Francisco etc, are all showing "EP000123455667788" that day. Compute the JSON for "EP000123455667788" once, serve multiple times. And because that JSON is memcached, there's only a database dip the for the first request.

I don't foresee changing things at this point, at least as far as that goes, because it was already tried and found to not work. The season/episode information in the schedule rather than the program is much more likely to get a second look, because it's feasible.

This isn't a ding, just "Tried it, didn't work."

RobNewton
Posts: 4
Joined: Sun Feb 10, 2013 9:55 pm

Re: Alpha version of new Schedules Direct lineup information

Post by RobNewton »

rkulagow wrote:Programs now include metadata from thetvdb; we have a 46% "hit" rate at this time.
...
The algorithm currently checks the name of the program and the episode title (subtitle) to determine the season / episode information from thetvdb. If the seriesid is incorrect though, please send email to grabber@schedulesdirect.org with the prog_id and the proper seriesid.
If you are still only looking up the TVDB information by show name, might I suggest an alternative method that may increase your hit rate. Forgive me if you are already doing this, I am new to all of this information.

Usage:

Code: Select all

                    #Zap2it ID example: EP01243717 - Into the Universe With Stephen Hawking
                    tvdbid = tvdbAPI.getIdByZap2it('EP01243717')

                    #Sometimes GetSeriesByRemoteID does not find by Zap2it so we use the series name as backup
                    if tvdbid == 0:
                        tvdbid = tvdbAPI.getIdByShowName('Into the Universe With Stephen Hawking')

                    #Now that we know the show, use the air date to get the episode
                    if tvdbid > 0:
                          #Only way to get a unique lookup is to use TVDB ID and the airdate of the episode
                          episode = ElementTree.fromstring(tvdbAPI.getEpisodeByAirdate(tvdbid, airdate))
                          episode = episode.find("Episode")
                          episodeId = episode.findtext("id")
                          seasonNumber = episode.findtext("SeasonNumber")
                          episodeNumber = episode.findtext("EpisodeNumber")


TVDB.py

Code: Select all

class TVDB(object):
    def __init__(self, api_key=''):
        self.apikey = api_key
        self.baseurl = 'http://thetvdb.com'

    def __repr__(self):
        return 'TVDB(baseurl=%s, apikey=%s)' % (self.baseurl, self.apikey)

    def _buildUrl(self, cmd, parms={}):
        url = '%s/api/%s?%s' % (self.baseurl, cmd, urllib.urlencode(parms))
        debug(url)
        return url

    def getIdByZap2it(self, zap2it_id):
        try:
            response = urllib2.urlopen(self._buildUrl('GetSeriesByRemoteID.php', {'zap2it' : zap2it_id})).read()
            tvdbidRE = re.compile('<id>(.+?)</id>', re.DOTALL)
            match = tvdbidRE.search(response)
            if match:
                return match.group(1)
            else:
                return 0
        except:
            return 0

    def getEpisodeByAirdate(self, tvdbid, airdate):
        try:
            response = urllib2.urlopen(self._buildUrl('GetEpisodeByAirDate.php', {'apikey' : self.apikey, 'seriesid' : tvdbid, 'airdate' : airdate})).read()
            return response
        except:
            return ''

    def getIdByShowName(self, showName):
        try:
            #NOTE: This assumes an exact match. It is possible to get multiple results though. This could be smarter
            response = urllib2.urlopen(self._buildUrl('GetSeries.php', {'seriesname' : showName})).read()
            tvdbidRE = re.compile('<id>(.+?)</id>', re.DOTALL)
            match = tvdbidRE.search(response)
            if match:
                return match.group(1)
            else:
                return 0
        except:
            return 0

rkulagow
SD Staff
Posts: 917
Joined: Tue Aug 14, 2007 3:15 pm

Re: Alpha version of new Schedules Direct lineup information

Post by rkulagow »

The code that does the matching has a number of algorithms, one of which is to check the original air date. This catches shows where the "Region 1" name doesn't match everyone else, like for the series "'Allo 'Allo".

RobNewton
Posts: 4
Joined: Sun Feb 10, 2013 9:55 pm

Re: Alpha version of new Schedules Direct lineup information

Post by RobNewton »

I guess your hit rate will best mine then. Kudos on the JSON API by the way. It's coming along nicely and should be very efficient.

Slugger
Posts: 77
Joined: Sun Sep 18, 2011 1:22 pm

Re: Alpha version of new Schedules Direct lineup information

Post by Slugger »

Shameless plug :oops: , but if you want something that grabs your EPG data and produces a single zip file then might I suggest my Java grabber/API? My grabber pulls all the separate zip files from the json service, processes the streams in-memory (i.e. never needs to write the zips to disk, uncompress, etc.) and produces a single zip file of your EPG data. The zip file my grabber generates can then be fed into the Java API for easy programmatic access to your data within Java code. Alternatively, the raw zip file could be used directly by (non-Java) apps as well, though you would need to be careful with that approach as the format of the zip file I generate could change in the future.

hall5714
Posts: 20
Joined: Thu Feb 07, 2013 4:34 pm

Re: Alpha version of new Schedules Direct lineup information

Post by hall5714 »

Just realized I never thanked Robert(s) for helping my brain process how the SD API is supposed to work. Thanks to both of you!

As an aside, I noticed when running the following JSON:

{"action":"get","api":20130107,"request":["WA11430","DISH881"],"randhash":"{randhash}","object":"lineups"}

I'm getting a zip file with the WA11430.json.txt and the serverID.txt but DISH881 is no where to be found. I was doing this without any subscribed headends in my account, so I'm assuming I shouldn't be getting either (since I'm not subscribed) or both (if it doesn't matter). Thought it was curious either way.

Post Reply