geo-bert-multilingual

Property	Value
Author	Kateryna Lutsai
Base Model	bert-base-multilingual-cased
Paper	Research Paper
Training Data	Twitter dataset with text and metadata

What is geo-bert-multilingual?

geo-bert-multilingual is an advanced natural language processing model designed to predict geographical locations from short text content. Built upon BERT's multilingual architecture, it uniquely outputs predictions as Gaussian Mixture Models (GMM), providing both location estimates and uncertainty measures for text up to 500 words.

Implementation Details

The model implements a custom wrapper layer of linear regression on top of BERT's classification token. It's trained using carefully tuned hyperparameters including a cosine learning rate scheduler starting at 1e-5 and ending at 1e-6, with Adam optimization over 3 epochs.

Handles both text-only and metadata-enhanced predictions
Outputs 5 prediction points with associated confidence levels
Supports multilingual input
Achieves up to 82% accuracy within 161km for user location prediction

Core Capabilities

Text-based geolocation prediction with mean accuracy of 1588km
Enhanced prediction with user metadata achieving 800km mean accuracy
Gaussian Mixture Model output for uncertainty estimation
Support for both tweet and user home location prediction

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to output geographical predictions as probability distributions (GMMs) sets it apart, providing not just point estimates but also confidence measures about predicted locations.

Q: What are the recommended use cases?

The model is ideal for geo-tagging of big data, particularly social media content. It performs best when user metadata is available alongside text content, making it particularly suitable for Twitter data analysis.