Reading a CSV file directly from a URL into Pandas is a common task, especially when dealing with web data.

However, sometimes the data you need requires authentication to access. Fortunately, Python and Pandas provide straightforward methods to handle this scenario. Let's explore how to read a CSV from a URL with authentication using Pandas.

Authentication with requests

If the URL requires authentication, you'll need to provide credentials to access it. This typically involves passing a username and password or an access token. Python requests supports various authentication methods, including HTTP basic authentication and token-based authentication.

For example, if the URL requires basic authentication, you can use the requests library to pass the credentials:

import requests
import pandas as pd
from io import StringIO
from requests.auth import HTTPBasicAuth

url = 'https://example.com/items/data.json'
user = 'username'
password = 'password'

data = requests.get(url, auth=HTTPBasicAuth(user, password))
df = pd.read_json(StringIO(data.text), lines=True)

df

the same applies for read_csv method.

Authentication with custom headers

Pandas offers parameter storage_options for:

  • read_csv
  • read_json methods

We can use custom headers to provide authentication information to pandas by generating authentication header like: {'Authorization': 'Basic xxxx'}

read_json + storage_options

The following example provides how this can be done:

from http.client import HTTPSConnection
from base64 import b64encode
import base64


def basic_auth(username, password):
    token = b64encode(f"{username}:{password}".encode('utf-8')).decode("ascii")
    return f'Basic {token}'

username = "username"
password = "password"

headers = { 'Authorization' : basic_auth(username, password) }
headers

df = pd.read_json(
    "https://example.com/items/data.json",
    storage_options=headers, lines=True
)

read remote CSV file

Below you can find shorter version for read_csv method:

import pandas as pd
from base64 import b64encode

df = pd.read_csv(
     'https://example.com/items/data.csv',
     storage_options={'Authorization': b'Basic %s' % b64encode(b'username:password')})

df

Read files from amazon buckets

public bucket

To read data from amazon public buckets we need first to install library by - s3fs:

pip install s3fs

then we can use:

import pandas as pd
pd.read_csv(
    "s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
    "-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
    storage_options={"anon": True}
)

private buckets

as alternative we can use Pandas read_csv method with AWS accounts to read remote data:

df = pd.read_csv("s3://my-private-bucket/data.csv")

We can pass the amazon keys and secrets by:

df = pd.read_csv(

    "s3://my-private-bucket/data.csv",
    storage_options={"key": "AKIAIOSFODNN7EXAMPLE", "secret": "SECRET"},
)

or by using boto3 library:

import os
import pandas as pd
import boto3

session = boto3.Session(profile_name="test")

os.environ['AWS_ACCESS_KEY_ID'] = session.get_credentials().access_key
os.environ['AWS_SECRET_ACCESS_KEY'] = session.get_credentials().secret_key
df = pd.read_csv("s3://xxxx.csv")

Conclusion

Reading a CSV from a URL with authentication in Pandas is a straightforward process.

By following the steps outlined above, you can access data hosted on the web securely and leverage the powerful data manipulation capabilities of Pandas for your analysis.

Resources

You can learn more about reading remote files with Pandas here: