Twitter Scraping

With my foray into Twitter, I’ve learned a lot about calls to its API. More interestingly, I’ve found that one can retrieve the most recent 200 tweets of a given user’s timeline! The code is pretty simple actually:

def fetch_tweets(user):
    print('Fetching tweets from @%s...' % user)

    auth = tweepy.OAuthHandler(consumerKey, consumerKeySecret)
    auth.set_access_token(accessToken, accessTokenSecret)

    api = tweepy.API(auth)

    tweets = api.user_timeline(screen_name=user, count=200)

    return tweets

Tweets are stored in JSON files that contain information about their ID, their date of creation, and, most importantly, their content.

Now, I can simply dump the tweets I find from a user (say, @kanyewest) and train any language model I have using those tweets. Now, my Twitter bot can essentially emulate any Twitter account.

So much power…

Actually, not nearly enough power. The most recent 200 tweets only contain so much information, and there must be a way to retrieve more. Digging more into tweepy documentation, I found that

tweets = api.user_timeline(screen_name=user, count=200, max_id=max_id)

was a valid call, where max_id is the ID of the oldest tweet. Then, it was simply a matter of updating the oldest tweet we have retrieved and continue fetching tweets. Note that 200 is the hard maximum of tweets the API can fetch per call.

def fetch_all_tweets(user):
    print('Fetching tweets from @%s...' % user)

    auth = tweepy.OAuthHandler(consumerKey, consumerKeySecret)
    auth.set_access_token(accessToken, accessTokenSecret)

    api = tweepy.API(auth)

    all_tweets = []

    current_tweets = api.user_timeline(screen_name=user, count=200)

    all_tweets.extend(current_tweets)

    oldest = all_tweets[-1].id - 1

    while len(current_tweets) > 0:
        current_tweets = api.user_timeline(screen_name=user, count=200, max_id=oldest)
        all_tweets.extend(current_tweets)
        oldest = all_tweets[-1].id - 1

    return all_tweets

However, I ran into a stumbling block here. Retrieving all tweets from @taylorswift13 only returned 3,200 tweets, but Taylor Swift definitely has more than 3,200 tweets.

It turns out the API can only fetch the most recent 3,200 tweets, and there is no workaround that I can find. Now, those tweets have to be stored somewhere, and there are paid services that can track them down for you, but I think 3,200 tweets as an upper limit is totally reasonable to train a language model on, especially if we are combining models.

So, long story short, I can emulate pretty well almost any Twitter account. Now what would happen if I started emulating my own bot…

Andy Zhang

Programming enthusiast and math geek

Twitter Scraping