SFM is not a set-it-and-forget-it kind of application. Things change constantly on social media platforms like Twitter, so we have to check constantly for these changes and act appropriately. For example, if you haven’t yet read our summary of the lifecycle of a TwitterUser, read it now and come back, you’ll see what we mean.
We have added several commands and tweaks to the data model to account for these changes as we’ve been running SFM for the past few years. Please read through the descriptions below and consider how they should apply in your scenario, as well as what might be missing that you will want to supply yourself for your own environment, or perhaps to add to SFM itself and submit back to the project. There are likely to be more of these to come as more people use the app, and we welcome your ideas.
Once you have successfully installed SFM the first task is to add app users; if at least one other person will be using the app, go to the /admin/ url and sign in with the administrative system account you created during installation. Under “Auth -> Users” you can add one or more additional SFM users (ask them to set their password). Once you’ve saved a new user, you can edit them from the list of Users and give them “superuser” status if you want them to be able to add users like you can with your own admin account. If one of these people ever leaves your organization or stops using the app, you can set their account to inactive by unchecking the “Active” box on their user edit page, too, instead of deleting their account entirely. Note that this functionality is all provided out-of-the-box by Django itself, with no custom SFM code.
An alternate way to add a user is to let them sign in at the / url by authenticating through Twitter. The advantage of this approach is that SFM will save a copy of authorized OAuth tokens for their account, which you can use later to manage a stream-based filter for that user. Once someone logs in successfully this way, you can edit their account under /admin/ just like any other SFM user, but note that you can end up with two different SFM accounts for the same person by accident if you use both methods.
Now that you and your colleagues have accounts for your SFM, you can add TwitterUsers. This is the simplest way to capture data using SFM. From /admin/ under “UI -> Twitter Users” add Twitter accounts to capture by their names, one by one, by entering their account name in the “Name” box. Be careful to spell it correctly! SFM will look up that account by name and verify that it’s a public account, and will then store the Twitter UID. Try adding a few accounts.
Now that there are a few TwitterUsers in your database, to capture their recent tweets, use the user_timeline management command. Run the command once, and you’ll see updates of the data-fetching process on the commandline. As it proceeds, you can go to / in your app and you’ll see the data start to appear in the UI. You can also go to /admin/ and see these same tweets in the admin UI under “UI -> Twitter user items”. Finally, there will be a separate record of the user_timeline “job” you ran under “UI -> Twitter user timeline jobs”.
As you capture tweets this way, you might want to create a record of the urls linked to by shortened urls in tweet text. To do this, use the fetch_urls command.
Note that as you add more and more TwitterUsers and their tweet data, both of these commands can take a long time – even many hours – to run. It takes a while because SFM abides by the rate limits defined by the Twitter API, leaving a little multi-second buffer between every call to the API so the app never goes over the limits. The more users you’re collecting, the longer it will take.
Both the user_timeline and fetch_urls commands are well suited to being automated with something like a cronjob. There are subtle issues to consider here, though, namely that whenever you fetch a user’s tweets, the metadata associated with each tweet will be accurate as of the moment you fetch it, rather than from the moment the tweet was originally published. This means that the first time you grab, say, 500 old tweets from a TwitterUser you just added, every one of those 500 tweets will contain exactly the same follower/following counts on the TwitterUser. Also, if that 500th tweet you capture is only five minutes old, then the retweet count on your capture of that tweet only accounts for the five minutes of that tweet’s existence. Older tweets may have correspondingly higher retweet numbers.
It’s important to understand these issues because how regularly you capture tweets using user_timeline will determine how accurate these numbers are. If it is important to you to see how following/follower counts change tweet by tweet, you’ll want to run user_timeline often. If it’s important to get an accurate retweet count on each tweet, you might want to run it less often. Either way, there will be a bit of a sliding time gap over the range of tweets you capture at any given time because of these implementation details of the Twitter API, and the relative accuracy for a given purpose of the metadata you capture when you’ve captured it will vary accordingly. It also means that when you first capture a TwitterUser’s older tweets you will not be able to see how old tweets affected their follower/following counts. These details might be important to users of the data you collect, so please familiarize yourself with them.
At GW Libraries, as of July 2014, we track about 1,800 TwitterUsers, running the user_timeline command on a cronjob every six hours. We run the fetch_urls command on a cronjob once a day, limiting (with the optional start and end date parameters) to the previous day’s tweets. Each of these jobs takes several hours to complete. Our PostgreSQL database for SFM uses over 6Gb in production, and a complete export of the database to a single file compresses to about 1.5Gb.
Due to the many changes that can occur on a single TwitterUser account (as described in lifecycle of a TwitterUser), you should run update_usernames regularly as well. Because SFM uses the Twitter uid of a TwitterUser rather than the name to capture new data, user_timeline will continue to work if SFM doesn’t have an updated username even after the Twitter account name changes, but it’s best all around if you have a record of the changes over time, and if you’re never too far out of date. At GW Libraries we’ve found that running it once a week during the weekend suffices.
If the user_timeline or update_usernames scripts report errors, such as an account no longer being available, or no longer being public, you can deactivate a TwitterUser the /admin/ UI under “UI -> TwitterUsers”, just search for that account by its name or uid, click on its SFM id when you find it, then uncheck the “Is active” box on the TwitterUser edit page. When a TwitterUser is inactive, user_timeline will no longer check for new tweets, saving time and rate limit capacity. You can always re-activate a TwitterUser later if its account changes again.
If you are using one or more Supervisord-managed streams to capture filtered queries live off the Twitter hose or the sample stream, you will want to establish an appropriate set of scripts to handle the resulting files. SFM has no opinion about how you manage digital content, aside from a bias toward gzipping text files at regular intervals. :) You might want to set up a cronjob pipeline to package up files using BagIt, or move them to another server, or whatever works for you, but keep in mind that these files can grow to fill up gigabytes and terabytes of storage quickly.
SFM does provide the organizedata management command to walk through a set of gzipped stream files and sort them into a year/month/date/hour-based folder structure. This is optional, but we find it convenient to spread files out on a filesystem, and for the scripts we’re working with to post-process files we generate at appropriate time intervals.
These are outside of the scope of SFM proper, but worth keeping in mind.
It is best to establish a regular snapshot backup of the PostgreSQL SFM database, and to rotate those files to a secondary storage environment. This can help both with testing new versions of the software and should you ever otherwise need to restore your database from scratch.
The same logic holds for taking a snapshot backup of your configuration files, such as your local_settings.py and apache config file. These should be relatively easily reproduceable - you can get your OAuth keys back from Twitter, for example - but it can be a pain to have to do so.
At GW Libraries we have a twice-daily cronjob that performs these operations.