Notes on Scaling Social Media Data Collection

Dylan K. Baker

This is an informal guide outlining how I scaled the task of downloading data from social media sites. Social media companies have little incentive to make this process easy, instead opting to make public-interest research on their platforms difficult or impossible to conduct, so I’m writing up my notes in the hopes that it might be useful to one of the many researchers or journalists out there reinventing this particular wheel.

This certainly isn't the only approach, but it’s what I’ve used for the last few years to gather many tens of millions of posts from Twitter/X, TikTok, Facebook, and YouTube.

And, as with any project, it's always worthwhile to find out if there are existing tools you can use before you build something like this on your own. To this end, the Coalition For Independent Technology Research has been an invaluable resource for me for finding out about existing tools and datasets. That's how I found out about projects like Junkipedia and minet, which are worth checking out before diving into this guide—they might suit your use case!

--> Jump directly to the guide here.

Are these notes going to be useful for me?

This guide will probably be useful if you’re trying to do some kind of large-scale data collection, and you can figure out how to do it locally on a small scale, but run into issues doing it at the scale you need.

A stick figure gesturing at a laptop that's on fire, exclaiming "oh no"

Specifically, this guide assumes you’re more or less comfortable with moving data around locally, and it’s the scaling and parallelization part that you feel lost about.

Throughout this guide, I use the Twitter Academic API as an example, which has since been closed. Don’t fret, though! Having collected data from Facebook, YouTube, and TikTok, I can confirm that minor variations on this approach will work for all sorts of data downloading tasks.

The approach itself involves using a cloud compute service to set up an HTTP endpoint that does a small piece of data collection each time it’s called, then orchestrating all of the different calls you need to make to this endpoint via a queue manager. If you just want a quick pointer explaining how this particular approach works at a high level, here’s a Medium post that explains it succinctly.

It should be noted that there are many, many ways to manage tasks like this as you scale, and this might not be how you would want to set things up if you were building long-term production data engineering infrastructure— this is only one approach I took as a research data engineer working alone!

Some common tools with a bit more overhead are Apache Airflow or Prefect, which can help orchestrate large and complex data pipelines and centralize things in one UI. If you’re really doing some heavy duty data engineering for the long haul, consider looking up other data workflow orchestration tools and see if those suit your use case.

Where this approach will and won’t work

The approach I’m outlining here was designed for the following situation:

  • You’re trying to collect a pretty large amount of data (e.g. 10M+ tweets)
  • You can collect this data by doing short, repetitive tasks (e.g. a task that makes a different query to the Twitter API every 5 minutes)
  • The place you’re getting data is fairly stable and quick to respond (~seconds to get a response)
  • Optional but good: You have a little money to throw at the problem (otherwise, you will be limited by the free tier of your platform).

This may still be useful, but you’ll need to modify this approach in cases where:

  • You can’t break down your data collection into a bunch of subtasks that can be done in parallel.
    • For example, you’re making a query and scrolling back as far as you can in results, iterating through pagination tokens (I describe a method for doing this below).
  • The API you’re using might be laggy, slow, or otherwise unresponsive. Note that an API with rate limits is still fine; it’s just important that the API responds with errors quickly.
    • If this is the case, instead of using cloud functions as I do in this guide, you can run data collection on a virtual machine; it’s much cheaper to leave a VM running than spin up a bunch of cloud function instances that run forever and fail.

I’ll make notes throughout this guide for places I have made modifications as the above situations have cropped up.

Ethical Notes

Before I get into the details, huge disclaimer.

A person standing on a box labeled "soap", as in, on a soapbox.

Don’t use this to take advantage of people’s data.

The goal of this guide is to support public-interest research that helps hold platforms accountable. The fact that data is technically publicly available does not mean that you should use it for whatever you want. Using social media data towards surveillance and profiteering is against the spirit of this guide!

Two doodled outlines of browser windows. One has the phrase "my transition vlog!" under a video player. The other has the header "blab", showing the top of a post with the text "Planning a hate crime, who's in? #dogwhistle".

Even when TOS aren’t violated, dynamics between researchers and data collection subjects are thorny. Like, what does it mean for a non-Black researcher to study Black Twitter, as an outsider? What right do people have for their data to be anonymized, if it’s “already public”? Do most people want their social media data used for research?

This kind of data collection task should only be implemented when there’s a good, documented, and well-scoped reason for it to be happening. Please be thoughtful!

Okay, let’s get started

Below is my order of operations when I’m trying to set up a system like this. Don’t worry about the details— it’ll all be explained as we go.

  1. Mapping out the big picture
  2. Getting your job to work locally
  3. Figuring out where your raw data will go
  4. Setting up your HTTP function
  5. Setting up the queue
  6. Ingesting data
  7. Celebrating and documenting

1. Mapping out the big picture

Get a handle on the whole system

The purpose of this guide is to help scale data downloading, so we’re going to start by thinking about the big picture. First, we come up with a way to break our large data collection task into many smaller subtasks. Then, we set up all of our data collection subtasks to run independently and automatically, handling errors as they go, until we end up with a big pile of raw data. Finally, we process that raw data into a more manageable format and are left with a beautiful dataset!

Here’s the gist of the setup I’ll be outlining in this guide. Don’t worry too much about the technical details, it’ll all get explained further down.

A flowchart with 5 shapes labeled A through E. The stages of the flowchart are, in order, A: Subtask Queue. This points to B: Subtask Executor. B has arrows pointing both to and from C: Data Source. B also points to D: Raw Data. Finally, D points to E: Dataset.

I used the Google Cloud Platform (GCP) to do all of the cloud-hosted components, but Amazon and Microsoft (and probably other places) have comparable services*.

A) Subtask Queue: A queue that handles the execution of each subtask.

  • I used Google Cloud Tasks for Twitter downloading because it handles a lot of the headache of scheduling tasks and retrying them when they fail. This doesn’t need to be a special tool just for task queueing, though— I have also successfully managed task queue management with a few tables and a cron job.

B) Subtask Executor: This is an HTTP function that gets invoked by the subtask queue. It’ll take a subtask, execute it, and write the resulting data out.

C) Data Source: This is where you’re getting data from.

  • In this example, I talk about the Twitter Academic API, which is no longer available. This could be any other social media API, though (I’ve had success with APIs for many different platforms, official and otherwise).

D) Raw Data: This is the place where you’re writing raw data, in whatever format you get it.

E) Dataset: This is the final place that your organized data will live.

  • I recommend something well-structured and easy to query. I used a BigQuery table here.

Breaking up the Big Task

So, you have a large data collection task and you need to break it into many little data collection pieces. I’m referring to these as the “Big Task” and “subtasks”.

Ideally, you want subtasks that are independent and small: that is, ideally, tasks shouldn’t depend on one another and can be finished in a few minutes. This is for the following reasons:

  • If tasks are independent, you can run them in parallel, and one failure won’t affect others
  • If tasks are small, you’re less likely to run into some kinds of issues (like timeouts) and it’s easier to isolate errors. It’s fine if tasks fail often— they just need to fail fast.
    • Note: The shortness of subtasks is also particularly important if you’re using cloud functions, as I do in this guide— it’s expensive to leave cloud tasks running, and they may impose their own time limits (the v1 GCP cloud function that I use here times out at 9 minutes).

Figuring out how to get tasks of the right size requires some trial-and-error, so expect to do lots of experimentation here to find the right subtasks for your situation.

Every case will look a little different here, so I’m going to go into some detail about different ways I’ve approached this.

Three doodled pictures. On the left, a long folded sheet of paper with a bunch of scribbled lines labeled "Big Task". An arrow passes from this paper through a pair of scissors to the second drawing, which is a stack of smaller papers, each with a few lines on it. An arrow connects this drawing to the final one, a cylinder labeled "Subtask queue"

More on choosing subtasks

Here’s an example:

If your Big Task is something along the lines of “Query every instance of ‘#dog’ via the Twitter API from 2018-2022”, the subtask breakdown could be pretty straightforward: you could define a subtask as querying Twitter for all of the ‘#dog’ tweets in a single 24-hour period. That would give you 1826 subtasks, one for each day between January 1, 2018 and December 31, 2022.

The phrase "Big task: 'query every instance of #dog from 2018 to 2022'" points to a folded-up timeline labeled "Jan 2018" on one end and "Dec 2022" on the other. There is an arrow from this timeline through a pair of scissors to a stack of individual dates, the topmost of which is "Jan 1 2018". There's an arrow from this stack of dates to a cylinder labeled "Subtask queue". A small cutout in "Subtask queue" reveals a list of dates.

However, a breakdown this straightforward is not always possible. Some situations you might find yourself in:

You can't get a small amount of data at a time

Not every API will let you make queries that are specific enough that you’ll only get a handful of results at a time. That is, sometimes you’ll just get a million results all at once and have to scroll back through it. Luckily, when this happens, it’s typically provided in a linked list of smaller chunks that you can iterate through one at a time, and if your job fails, you’ll still have a pointer to the last chunk you were reading from.

In this case, you can still kinda think of reading each chunk as a subtask. It’ll just be different because the subtasks are not independent, so you’ll have to use a different approach than the one I outline in this guide. In the appendix below, I give a brief example of what I’ve done in those cases.

You can’t get a small amount of data at a time and can’t pick up where you left off

If you can’t create queries such that you get a small amount of data at a time and you have to restart scrolling through all of the data from the beginning every time you make a query, one approach might be using the “repeat the same subtask” approach below.

In this case, the “same subtask” is just scrolling through the data as far as you can before you get cut off. You’d run it over and over and hope that one of your attempts is lucky and makes it through a lot of the data before crashing. This is also a situation where you’d want to avoid running things in a time-sensitive environment like a cloud function— just let it run as long as it will go and keep re-trying.

Perhaps there are ways to do a better job with that, but that’s outside of the scope of this guide.

What if I just want to repeat the same subtask over and over?

If your subtask is something that never changes, like “get the latest posts from this account every day”, you can also use the approach I outline in this guide, with a minor change.

2. Getting your job to work locally

A flowchart. A square with a green border labeled "subtask executor" (which has a note next to it saying "this is just any local script that gets data") points to a cloud labeled "data source", which has a pink blob emerging from it labeled "raw data", traveling along an arrow pointing to a cylinder with a green border labeled "anywhere local"A key indicating that blue squiggly borders denote something running on the cloud, where green straight borders indicate something running on a local machine.

Tl;dr: Figure out how to get the data you want to get, then save a small amount of it locally somewhere.

Just get a python script running on your computer that accesses the data you want to access. I think of this as the first version of my Subtask Executor that will ultimately end up running on the cloud.

In my case, this was successfully querying 10 tweets from the Twitter Academic API. At this stage, your code may simply be copy-pasted from some walkthrough or example you found on github. Your credentials key will be sitting as a local file somewhere or in your env. That’s fine! (We’ll address that later)

Save some sample data to a local file in whatever form makes sense (probably .json or .csv). You’ll want this later.

3. Figuring out where your raw data will go

Tl;dr: Figure out where you want to dump your data, and figure out how to put it there.

The Actual Storage

I suggest dumping all of your raw data into some kind of blob storage. I use GCP Storage Buckets for this purpose.

Getting reading and writing to work here on any cloud service will likely involve some amount of fiddling with annoying permissions errors.

You can write to a storage bucket like this (for secret management, see the code here). I used these helper functions:

1def get_service_account_credentials() -> service_account.Credentials:
2    """
3    Retrieves service account credentials from a secret manager and returns a credentials object.
4    Returns:
5        service_account.Credentials: The service account credentials.
6    """
7
8    # We will replace this with reading credentials from a secret manager soon.
9
10    credentials = service_account.Credentials.from_service_account_file(
11        "credentials.json"
12    )
13
14    return credentials
15
16
17def write_to_bucket(
18    out_json: Dict[Any, Any],
19    output_bucket_name: str,
20    output_path: str,
21    storage_client: storage.Client,
22) -> None:
23    """Writes output JSON data to a bucket.
24    Args:
25        out_json (JSON object): JSON data you wish to write to the bucket.
26        output_bucket_name (str): The name of the output bucket.
27        output_path (str): The path to write the data to within the output bucket.
28        storage_client (storage.Client): The storage client to use for writing the data.
29    """
30
31    bucket = storage_client.bucket(output_bucket_name)
32
33    data_blob = bucket.blob(output_path)
34
35    with data_blob.open("w") as f:
36        f.write(json.dumps(out_json))
37
38    credentials = get_service_account_credentials()
39
40    storage_client = storage.Client(credentials=credentials)


So I could write to buckets like this:

1write_to_bucket(
2    SOME_JSON_DATA,
3    output_bucket_name=BUCKET_NAME,
4    output_path=OUTPUT_PATH,
5    storage_client=storage_client,
6)

Data provenance

It’s a good idea to record as much data provenance as possible, and include redundancy wherever it’s reasonable to do so. Including basic metadata that answers questions like “what query did I make to retrieve this particular post?” and “when did I download this?” is not only helpful, but may also be necessary as a research standard.

It’s common for this data to end up somewhere in the file path, although I also like to save this data in at least one additional way. You can tack it on to raw JSON, add some columns to output csv, save a whole separate file or table somewhere with metadata—the world is your oyster.

An example:

raw_tiktok_data > query_by_keyword > {tiktok_id} > [media.mp4, metadata.json]

That is:

dataset name > how you got this data > what unique piece of data this is > [the media itself, information about the run that produced this media]

Don’t worry too much about your data being perfectly organized; it’s common for data to end up a little disorganized in the early stages of a research project. As long as you keep track of enough metadata, you’ll be able to do cleanup in the ingestion step at the end.

4. Setting up your HTTP function

A box labeled "subtask executor" with a blue wiggly border. Text around the box says "We're taking the local data-getting code and putting it on the cloud". The word "cloud" has a blue wiggly border around it.

Tl;dr: Get the local subtask-executing code to run on the cloud. Also, this is the time to set up your secret manager.

Now it’s time to bundle up all of our code and set it to run on the cloud!

At this stage, my local code did the following:

  1. Configure one subtask: Construct a query to pass to the Twitter API
  2. Execute one subtask: Make the query and retrieve data
  3. Write output data: Write the data to my output storage

To make this code into an HTTP function, I loosely followed this tutorial. I added a main.py file with an entry point function (that had the appropriate @functions_framework.http decorator and took in a request argument). Using request.get_json(), you can access your incoming JSON arguments as a dictionary.

These JSON arguments are where subtask parameters are going to get passed via the task queue (which we haven’t set up yet).

So, now instead of having some local script with hard-coded parameters like this:

1batch_query.query_tweets(
2  query=["dog"],
3  output_bucket_name="dog-twitter-data",
4  start_date="2020-01-01",
5  end_date="2020-01-02",
6)

You’ll just be reading those parameters from the input JSON:

1batch_query.query_tweets(
2  query=request_json["terms"],
3  output_bucket_name=request_json["output_bucket_name"],
4  start_date=request_json["start_date"],
5  end_date=request_json["end_date"],
6)

Then, you can make the same call by sending your HTTP function this JSON in via a POST request:

{“query”: “dog”, “start_date”: “2020-01-01”, “end_date”: “2020-01-02”}

In GCP, you can test your function out under the ‘Testing’ tab by putting your JSON in ‘Configure Triggering Event’. You can also use the CLI to run a test.


Tip: To start out here, I would recommend having your entry point function print out the value of request.get_json() to make sure you can actually pass parameters to your function, like this:

1import json
2import functions_framework
3from google.cloud import logging
4
5logging_client = logging.Client()
6logger = logging_client.logger("main_logs")
7
8
9@functions_framework.http
10def handle_http(request) -> str:
11  """Handles incoming HTTP post requests from Cloud Scheduler.
12  Parameters are expected in the JSON body of the request.
13  
14  Args:
15    request: HTTP post request
16  
17  Returns:
18    String indicating status of the completed request
19  """
20
21  request_json = request.get_json(silent=True)
22
23  print("Received parameters: %s" % json.dumps(request_json, indent=2))
24  
25  return "Successfully processed request"

Once you’ve made sure you can actually invoke your HTTP function via the task queue and read that JSON, then try executing your actual subtask.

Once set up, my HTTP function did this:

  1. Read subtask parameters: Read in subtask parameters as JSON from the incoming HTTP request
  2. Configure one subtask: Use those parameters to construct a query to pass to Twitter
  3. Execute one subtask: Make the Twitter query and retrieve data
  4. Write output data: Write the data to the output location

Some other code notes to be aware of:

  • For setting up your code on the cloud, I recommend putting your code in a private GitHub repository and syncing it that way.
  • Moving things onto the cloud is a common place to start to run into annoying permissions errors! Don't be dismayed if this happens to you.
  • This is a good time to integrate a logger into your function. At the very least, write yourself helpful print functions, and make sure you know how to find them in the logs.


Secrets and permissions

If your code relies on any keys or tokens lying around in your env, set up a secret manager for them during this stage. For me, I used the secret manager to hold my Twitter API Bearer Token and various service account credentials.

5. Setting up the queue

A cylinder labeled "subtask queue" with an arrow pointing to a rectangle labeled "subtask executor". Both have blue wiggly borders, indicating being run on the cloud. There is a sheet of paper labeled "JSON" on top of the arrow, indicating JSON being passed to the subtask executor. A key indicating that blue squiggly borders denote something running on the cloud, where green straight borders indicate something running on a local machine.

Tl;dr: Write a script that generates subtasks and adds them to a queue.

Each “subtask” is really just a JSON string that gets passed to your HTTP function via a POST request. But how do we actually make all of those POST requests to the function?

There are lots of ways! The way I did it here was using Google Cloud Tasks. It will make all of the POST requests to your HTTP function one at a time, and also do some handy bookkeeping for you: it’ll keep track of when your subtasks fail, automatically reschedule failed tasks, and manage how quickly and at what concurrency your tasks run.

Google Cloud Tasks aren’t the only way to do this, though, you can invoke your HTTP function any number of ways. There are practical reasons not to use a task manager: for example, if you want to run the same subtask over and over, you wouldn’t want to re-enqueue the same task a million times in a fancy queue like this—you can set up a scheduler that will invoke the same subtask over and over. See this appendix note for more notes on that.

As for actually adding tasks to the queue, I wrote a separate python script here to generate subtasks and enqueue them, and ran this locally, since I only needed to enqueue each subtask once. This code generated subtask parameters, formatted them as JSON, and followed this starter code to add that JSON as a payload to the HTTP-invoking task.

6. Ingesting data

Tl;dr: Move all of your data from blob storage into something more structured, like a table.

Now you’ve theoretically gathered a bunch of data and it’s sitting in blob storage somewhere! Amazing!

Most of the time for me, the ingestion step is a python script I run periodically that takes all of the raw data ingested since ingestion was last run, formats it, and dumps it into a table. You could also run ingestion as a cron job, automatically ingesting data every day.

Coming up with a data model

In my example, Tweets come in a very structured format, so I made a BigQuery table with the same schema, added a few fields for data provenance, and dumped tweets directly into it. More often than not, though, you’ll need to do a little more planning. What queries do you want to be able to make? How can you structure your data to make those queries easy to make?

A drawing of a bunch of nested rectangles representing a data structure. The largest rectangle is labeled "post" and has smaller rectangles with labels like "id", "text", and "timestamp". One rectangle labeled "media_item" has a subfield called "local_path" which points out to an image labeled "media_id.png"

7. Celebrating (and documenting)

I know you know this, but I’m saying it anyway: once you have things running smoothly, make sure you’ve written stuff down! Keeping notes as you go will save you an enormous amount of work and heartache, and it’s good to periodically check and make sure you’re keeping those up.

On that note, this concludes my notes on data downloading with cloud functions. I hope this has been helpful, and best of luck in your data collection adventures!

Many thanks to Megan Brown, Alex Hanna, Jason Greenfield, William Pietri, and Skyler Sinclair for their feedback in the development of this piece.

A smiling stick figure saying "hooray" next to a smiling laptop with clouds around it.

Appendix

Scrolling through pagination tokens

Example: I wrote the pagination token (which functions as the pointer in the linked list of results) for each chunk to a file, and each “task” was just iterating through more chunks. The gist was: “Read the last pagination token you saw from a .txt file. Scroll back as far as you can, and when you get stopped, write the last token you saw to a .txt file. If you hit the end of the results, indicate that in the .txt file, notify me, and end this job”. I have done this both via a cloud task with google cloud scheduler and as a plain old cron job running on a cloud vm.

Running the same task over and over

If you’re running the same thing over and over— e.g. your subtasks are all identical, like “get the latest posts from this account today”—you can make a really easy substitution here. Instead of having a queue with a bunch of different tasks to execute, you can set up a single task in the Cloud Scheduler. It’ll just have the same JSON parameters every time, rather than a new set for every subtask. You can set it to run, e.g., every day at 1am...

A screenshot of configuring a job with the google cloud scheduler. It has the description "Downloads tweets mentioning dogs" with a schedule "0 1 * * *", which indicates running every day at 1am in cron syntax.

...and have the scheduler make a POST request to the HTTP function with your parameters:

A screenshot of the google cloud scheduler configuration page, now on the "configure the execution" section. It shows a task configured to make a POST request with a JSON body with parameters like "'query': 'dog'".

Setting up a secret manager

If you have any credentials that need managing, such as API keys, passwords, or bearer tokens, I highly recommend using a secret manager to keep track of them.

A cylinder labeled "secrets" on the left and a square labeled "subtask executor" on the right. There is an arrow running from "secrets" to "subtask executor" with a key on it. There's also an arrow running from "subtask executor" back to "secrets". Both have blue wiggly borders, indicating that they're running on the cloud.

Secret managers keep important credentials in one place, and keep track of who has access to which credentials. (A kind of secret management system you may have encountered already: personal password managers, e.g. 1password, LastPass, or one built into your browser.) Cloud infrastructures often have their own secret management systems that allow you to access credentials easily.

Coding notes

I used Google Secret Manager here. It’s very simple but took me an irritatingly long time to debug so here are some screenshots:

On the cloud console side, I made a new secret for my Twitter bearer token…

A screenshot of the google cloud console secret manager. There is a secret called "twitter_bearer_token" selected.

…made sure my gservice account had access to it…

A screenshot of the google cloud secret manager. The "twitter_bearer_token" secret is open, and there's a red arrow pointing to a button labeled "Grant access".

…and then made sure my gservice account had the “Secret Manager Secret Accessor” role.

A screenshot of the google cloud IAM page for some account whose name is not shown. There is a red box highlighting that a role called "Secret Manager Secret Accessor" has been added to the account.

Once that was set up, my code to access secrets looked roughly like this:

1from google.cloud import secretmanager
2
3def read_secret(
4    project_id: str, secret_name: str, is_json: bool = False
5) -> Union[str, Dict[Any, Any]]:
6    """
7    Reads a secret from a secret manager.
8
9    If is_json is true, will decode and return the secret as JSON.
10    Otherwise, the secret will be interpreted as a string.
11
12    Args:
13        project_id (str): The project ID.
14        secret_name (str): The name of the secret.
15        is_json (bool, optional): Whether to interpret the secret as JSON. Defaults to False.
16
17    Returns:
18        The secret, either as a string or a JSON object.
19    """
20
21    secrets_client = secretmanager.SecretManagerServiceClient()
22
23    request = {"name": f"projects/{project_id}/secrets/{secret_name}/versions/latest"}
24
25    secret_response = secrets_client.access_secret_version(request)
26
27    if is_json:
28
29        output = json.loads(secret_response.payload.data.decode("UTF-8"))
30
31    else:
32
33        output = secret_response.payload.data.decode("UTF-8")
34
35    return output


You can use it like this:

1from google.oauth2 import service_account
2
3def get_service_account_credentials() -> service_account.Credentials:
4    """
5    Retrieves service account credentials from a secret manager and returns a credentials object.
6
7    Returns:
8        service_account.Credentials: The service account credentials.
9    """
10
11    credentials = service_account.Credentials.from_service_account_info(
12        read_secret(PROJECT_ID, secret_name=SERVICE_ACCOUNT_KEY_NAME, is_json=True),
13        scopes=["https://www.googleapis.com/auth/cloud-platform"],
14    )
15
16    return credentials