Baselight

Languages Through Time - Semantic Enigmas.

Sentiment analysis of the Old Babylonian version of the Gilgamesh Epic.

@kaggle.patricklford_language

About this Dataset

Languages Through Time - Semantic Enigmas.

Unveiling Ancient Voices: Exploring Language Evolution Through the Epic of Gilgamesh

Languages are the bedrock of human civilisation, weaving intricate tapestries of culture, history, and identity across millennia. From the enigmatic cuneiform inscriptions of Sumeria to the elegant hieroglyphs of ancient Egypt, and from the philosophical richness of Sanskrit to the profound elegance of Classical Chinese, each linguistic epoch tells a story of innovation, adaptation, and legacy.

Ancient Languages

Sumerian (c. 3000 BCE)

  • Writing System: Cuneiform, one of the earliest known writing systems.
  • Contributions: Extensive records in administration, literature (e.g., the Epic of Gilgamesh), and law (e.g., Code of Ur-Nammu).
  • Complexity: Complex system of logograms and phonetic elements.

Egyptian (c. 3200 BCE)

  • Writing System: Hieroglyphics, hieratic, and later demotic scripts.
  • Contributions: Extensive religious texts, monumental inscriptions, medical texts, and administrative documents.
  • Complexity: Combination of logographic and alphabetic elements.

Sanskrit (c. 1500 BCE)

  • Writing System: Brahmi script (earliest form), later evolving into Devanagari.
  • Contributions: Rich literary tradition (Vedas, Upanishads, Mahabharata, Ramayana), scientific and philosophical texts (e.g., works by Aryabhata, Patanjali).
  • Complexity: Highly inflected language with a sophisticated grammatical structure described by Panini in his treatise "Ashtadhyayi."

Classical Chinese (c. 1200 BCE)

  • Writing System: Logographic script, which has evolved into modern Chinese characters.
  • Contributions: Extensive literary, historical, and philosophical texts (e.g., Confucian classics, Daoist texts).
  • Complexity: Complex system of characters, tones, and syntax.

Mediaeval Languages

Latin (c. 700 BCE - Mediaeval Period)

  • Writing System: Latin alphabet.
  • Contributions: Foundation of Western literature, science, law, and administration. Many scientific names and terms are Latin.
  • Complexity: Inflectional language with a rich system of noun cases, verb conjugations, and classical literature.

Arabic (c. 6th century CE)

  • Writing System: Arabic script.
  • Contributions: Extensive contributions to science, mathematics, medicine, and literature. Preservation and expansion of Greek and Roman knowledge during the Islamic Golden Age.
  • Complexity: Root-based morphology, extensive vocabulary, and intricate system of grammatical rules.

Modern Languages

English

  • Writing System: Latin alphabet.
  • Contributions: Dominant language in science, technology, business, and international diplomacy.
  • Complexity: Extensive vocabulary with influences from many languages, flexible grammar, and rich literary tradition.

Mandarin Chinese

  • Writing System: Simplified and traditional Chinese characters.
  • Contributions: Major language of international business, growing influence in technology and science.
  • Complexity: Tonal language with a large number of characters, but relatively simple grammar.

Programming Languages (e.g., Python, C++)

  • Writing System: Varies (syntax of the language).
  • Contributions: Foundation of modern technology, software development, and automation.
  • Complexity: Depends on the language; some have simpler syntax (Python), while others are more complex (C++).

Summary

  • Ancient Sumerian and Egyptian were advanced for their time in developing early writing systems and creating comprehensive records.
  • Sanskrit and Classical Chinese represent significant linguistic complexity and cultural richness.
  • Latin and Arabic had profound impacts on mediaeval scholarship and the preservation and expansion of knowledge.
  • Modern English has become the global lingua franca, while Mandarin Chinese is increasingly influential.
  • Programming languages represent a different kind of linguistic advancement, crucial for the digital age.

The watered down languages of modernity

The notion that modern languages are "watered down" versions of ancient root languages is a matter of perspective and depends on how one views linguistic evolution. Here are some key points to consider:

Linguistic Evolution

Languages evolve over time, influenced by social, political, cultural, and technological changes. This evolution can lead to simplification in some areas and increased complexity in others.

Simplification:

  • Grammar: Many modern languages have simplified grammatical structures compared to their ancient counterparts. For example, English has significantly less inflection than Old English or Latin.
  • Phonology: Sound systems can become simpler. For instance, English has lost many of the vowel and consonant sounds found in Old English.
  • Vocabulary: Borrowing from other languages often leads to a more streamlined vocabulary. Modern English, for example, has incorporated many loanwords, sometimes replacing native terms.

Complexity:

  • Vocabulary Expansion: Modern languages often have larger vocabularies due to technological, scientific, and cultural advancements. New words are created or borrowed to describe new concepts, tools, and phenomena.
  • Syntax: Sentence structures and word order can become more complex. Modern English, for instance, allows for a variety of syntactic structures that weren't present in Old English.
  • Usage Contexts: Modern languages have to adapt to new contexts such as digital communication, which has led to the creation of new linguistic norms and forms (e.g., internet slang, emojis).

Examples of Language Evolution

English:

  • Old English: Highly inflected with a complex system of noun declensions and verb conjugations.
  • Middle English: Simplified inflectional endings, significant borrowing from Norman French.
  • Modern English: Minimal inflection, extensive vocabulary with many loanwords from Latin, French, and other languages.

Chinese:

  • Classical Chinese: Concise and highly context-dependent, used primarily in formal writing.
  • Middle Chinese: More polysyllabic words and tones developed.
  • Modern Mandarin: Simplified characters (in mainland China), reduced number of tones (compared to some other Chinese dialects), and standardised grammar.

Preservation and Innovation

While modern languages might seem less "pure" or "complex" compared to their ancient roots, this does not mean they are inferior. Languages adapt to the needs of their speakers, and these adaptations often result in more efficient and versatile means of communication.

Preservation:

  • Classical Texts: Ancient languages like Latin, Ancient Greek, Sanskrit, and Classical Chinese are still studied, preserving their literary and scholarly traditions.
  • Cultural Heritage: Many modern languages retain idioms, expressions, and structural elements from their ancient predecessors.

Innovation:

  • Technological Terminology: Modern languages constantly innovate to describe new technologies and scientific discoveries.
  • Global Communication: Languages like English have evolved to become tools of global communication, incorporating elements from many other languages and cultures.

Summary

Modern languages are not necessarily "watered down" versions of their ancient counterparts; they are different. They reflect the dynamic nature of human societies, adapting and evolving to meet new challenges and contexts. While they may simplify certain aspects, they also develop new complexities and capabilities, ensuring that language remains a vital and effective tool for communication.

The Epic of Gilgamesh: A Linguistic Lens

The Epic of Gilgamesh, a cornerstone of ancient literature, transcends time and culture. Its narrative spans ancient Mesopotamia, offering glimpses into the ethos and worldview of its era. This project delves into the epic's text through advanced computational linguistics, employing sentiment analysis and topic modelling techniques to uncover hidden narratives and thematic undercurrents.

Visualisations (Gilgamesh.csv)

Visualisations and Interpretations

Visual representations of sentiment and topic analyses provide insights into the cultural and emotional dimensions of the Epic of Gilgamesh. These analyses not only enrich our understanding of ancient texts but also highlight the enduring relevance of linguistic exploration in deciphering human expression across epochs.

A Markdown document with R code for all the below visualisations. link

Sentiment Analysis:

  • Using sentiment lexicons like Bing and NRC, helps explore the emotional landscape of the Gilgamesh epic. Visualisations reveal nuanced sentiments embedded within the text, shedding light on the emotional tapestry of its characters and events:
  • Bing Lexicon: The code uses the Bing lexicon to classify words into positive and negative sentiments.
  • NRC Lexicon: The NRC lexicon classifies words into a wider range of sentiments (e.g., joy, anger, fear).
    Plots are generated to visualise the sentiment counts.

The above chart is a visualisation of the top terms for each topic identified through topic modelling using Latent Dirichlet Allocation (LDA). Here’s an explanation of the process and how the topics are determined:

Topic Modelling

Applying Latent Dirichlet Allocation (LDA), this study identifies distinct topics within the Gilgamesh narrative. Each topic encapsulates clusters of words that resonate with specific themes, offering a structured approach to understanding the epic's multifaceted narrative.

Topic Modelling:

  • Topic modelling is a type of statistical model used to discover the abstract "topics" that occur in a collection of documents.
  • LDA (Latent Dirichlet Allocation) is a popular algorithm for topic modelling. It assumes that each document is a mixture of a small number of topics and that each word in the document is attributable to one of the document's topics.

LDA Process:

  • Document-Term Matrix (DTM): The text is tokenized and converted into a DTM, where each row represents a document, each column represents a term (word), and each cell represents the frequency of the term in the document.
  • Number of Topics: The number of topics (num_topics) is set to 4, meaning the algorithm identifies 4 distinct topics.
  • LDA Model: The LDA algorithm is run on the DTM to assign topics to words. The result is a set of topics, where each topic is a distribution over words, and each document is a distribution over topics.

Top Terms for Each Topic:

  • After running LDA, you get a matrix called "beta" that gives the probability of each word belonging to each topic.
  • The top_terms dataframe lists the top 10 words (terms) for each topic based on their probability (beta value).

Interpreting the Chart

  • X-axis (Beta): Represents the probability of each term being associated with a topic. Higher beta values indicate a stronger association.
  • Y-axis(Terms): Represents the top terms for each topic. The terms are reordered within each topic for better visualisation.
  • Facets: Each facet (subplot) represents one topic. Since I set num_topics to 4, there are 4 facets in the chart.
  • Bars: The bars show the beta values of the top terms for each topic. The length of the bar indicates the strength of the association between the term and the topic.

What Determines the Topic

The topics are determined based on the co-occurrence patterns of words across the documents. Here’s a simplified explanation:

  • Co-occurrence Patterns: LDA identifies groups of words that frequently occur together across the documents. These groups of words form the "topics."
  • Distribution: Each topic is a distribution over words, meaning it assigns probabilities to each word based on how likely it is to appear in that topic.
  • Assignment: During the LDA process, words and documents are assigned to topics in a way that maximises the likelihood of the observed data given the model. This involves iterative updating of topic assignments until the algorithm converges.

Example of Interpretation

Let’s say one of the topics (Topic 1) has the top terms: "king", "battle", "hero", "city", "enemy". This topic might be interpreted as relating to warfare or leadership. The exact interpretation requires understanding the context in which these words are used in the text.

By examining the top terms for each topic, you can infer the underlying themes or subjects that the LDA model has identified in your corpus.

Terms for each topic: Gilgamesh.csv

Given the top terms for each topic, we can infer potential themes or subjects that each topic might represent. Here are the inferred topics based on the terms:

Topic 1:

  • Terms: tablet, line, gilgamesh, na, assyrian, version, sú, ta, ma, enkidu
  • Inference: This topic likely revolves around the Assyrian version of the Epic of Gilgamesh, focusing on specific tablets and lines from the text.

Topic 2:

  • Terms: enkidu, gilgamesh, assyrian, line, version, na, sú, lines, ki, ta
  • Inference: This topic appears to emphasise the characters Gilgamesh and Enkidu, and their depiction in the Assyrian version, with a focus on lines from the epic.

Topic 3:

  • Terms: tablet, ma, na, line, sá, lines, gish, form, enkidu, babylonian
  • Inference: This topic may be related to the physical tablets and forms of the epic, particularly focusing on the Babylonian version and its structure.

Topic 4:

  • Terms: version, enkidu, ka, ma, lines, si, woman, ki, la, gish
  • Inference: This topic might address different versions of the Epic of Gilgamesh, possibly with emphasis on the characters and their relationships, including the mention of a woman (likely Shamhat, the temple prostitute who tames Enkidu).

Summary of Inferred Topics

  1. Assyrian Version of the Epic: Focus on specific tablets and lines.
  2. Gilgamesh and Enkidu: Their depiction and relationship in the Assyrian version.
  3. Physical Tablets and Forms: Structure and version, particularly the Babylonian one.
  4. Different Versions and Characters: Emphasis on the variations in the epic, relationships, and possibly specific characters like Shamhat.

Language datasets

Finding a useful language dataset for analysis depends on the specific requirements of your project, such as the type of analysis (e.g., natural language processing, linguistic research), the languages involved, and the nature of the data (e.g., text, audio). Here are some notable sources and types of datasets that can be highly valuable for language analysis:

General Language Datasets

Common Crawl

  • Description: A vast repository of web data collected over many years.
  • Use Cases: Web scraping, large-scale text analysis, training language models.
  • Access: Common Crawl

Google Books Ngram Viewer

  • Description: Provides word frequency data from a vast collection of digitised books spanning several centuries.
  • Use Cases: Historical linguistic analysis, trend analysis, lexical studies.
  • Access: Google Books Ngram Viewer

Project Gutenberg

  • Description: A large collection of free eBooks, primarily consisting of texts that are in the public domain.
  • Use Cases: Literary analysis, training data for natural language processing, linguistic studies.
  • Access: Project Gutenberg

Linguistic and NLP-Specific Datasets

Universal Dependencies (UD)

  • Description: A project that provides a collection of treebanks for various languages, annotated with part-of-speech tags, syntactic dependencies, and morphological features.
  • Use Cases: Syntax and morphology research, multilingual parsing.
  • Access: Universal Dependencies

WordNet

  • Description: A lexical database for the English language that groups words into sets of synonyms and provides short definitions and usage examples.
  • Use Cases: Semantic analysis, word sense disambiguation, lexical research.
  • Access: WordNet

COCO (Common Objects in Context)

  • Description: Primarily an image dataset, but includes extensive annotations and descriptions useful for visual language tasks.
  • Use Cases: Multimodal language models, image captioning.
  • Access: COCO Dataset

Specialised Language Datasets

Europarl Corpus

  • Description: A parallel corpus of the proceedings of the European Parliament, available in multiple languages.
  • Use Cases: Machine translation, multilingual NLP research.
  • Access: Europarl

Tatoeba Project

  • Description: A large database of sentences and translations in many languages.
  • Use Cases: Translation studies, multilingual text analysis.
  • Access: Tatoeba

LibriSpeech

  • Description: A corpus of read English speech, derived from audiobooks.
  • Use Cases: Speech recognition, language modelling.
  • Access: LibriSpeech

Research and Academic Sources

Linguistic Data Consortium (LDC)

  • Description: Provides a wide variety of linguistic resources, including text, speech, and lexicons.
  • Use Cases: Academic research, language technology development.
  • Access: Linguistic Data Consortium

Kaggle Datasets

  • Description: A platform offering a variety of datasets, including those for language processing and analysis.
  • Use Cases: NLP projects, machine learning experiments.
  • Access: Kaggle Datasets

Access and Licensing Considerations

When using these datasets, consider the following:

  • Licensing: Ensure that you comply with the licensing terms of each dataset.
  • Ethical Use: Be mindful of the ethical implications, especially with sensitive or personally identifiable information.
  • Data Cleaning: Some datasets may require significant preprocessing and cleaning to be useful for analysis.

By leveraging these datasets, you can conduct various forms of linguistic analysis, from syntactic and semantic studies to large-scale NLP tasks.

Conclusion: Embracing Linguistic Diversity

The notion that modern languages are "watered-down" versions of their ancient counterparts is a simplification. Languages evolve, influenced by societal dynamics and technological advancements. While some aspects become streamlined, others become more intricate. We've explored how languages simplify grammar and phonology for efficiency, while simultaneously expanding vocabulary to accommodate technological advancements. The analysis of the Epic of Gilgamesh through sentiment analysis and topic modelling techniques further demonstrates the dynamic nature of language and how its essence can be preserved across time. Each language serves as a unique window into the human experience, reflecting the ingenuity and adaptability of our species.
This project celebrates the diversity of linguistic expression, inviting you, dear reader, to embark on a journey through time and text, discovering the profound legacy of languages in shaping human history.

Patrick Ford 📖

Share link

Anyone who has the link will be able to view this.