Context
Census of India is a rich database which can tell stories of over a billion Indians. It is important not only for research point of view, but commercially as well for the organizations that want to understand India's complex yet strongly knitted heterogeneity.
However, nowhere on the web, there exists a single database that combines the district- wise information of all the variables (most include no more than 4-5 out of over 50 variables!). Extracting and using data from Census of India 2001 is quite a laborious task since all data is made available in scattered PDFs district wise. Individual PDFs can be extracted from http://www.censusindia.gov.in/(S(ogvuk1y2e5sueoyc5eyc0g55))/Tables_Published/Basic_Data_Sheet.aspx.
Content
This database has been extracted from Census of 2001 and includes data of 590 districts, having around 80 variables each.
In case of confusion regarding the context of the variable, refer to the following PDF and you will be able to make sense out of it: http://censusindia.gov.in/Dist_File/datasheet-2923.pdf
All the extraction work can be found @ https://github.com/preetskhalsa97/census2001auto
The final CSV can be found at finalCSV/all.csv
The subtle hack that was used to automate extraction to a great extent was the the URLs of all the PDFs were same except the four digits (that were respective state and district codes).
A few abbreviations used for states:
AN- Andaman and Nicobar
CG- Chhattisgarh
D_D- Daman and Diu
D_N_H- Dadra and Nagar Haveli
JK- Jammu and Kashmir
MP- Madhya Pradesh
TN- Tamil Nadu
UP- Uttar Pradesh
WB- West Bengal
A few variables for clarification:
Growth..1991...2001- population growth from 1991 to 2001
X0..4 years- People in age group 0 to 4 years
SC1- Scheduled Class with highest population
Acknowledgements
Inspiration
This is a massive dataset which can be used to explain the interplay between education, caste, development, gender and much more.
It really can explain a lot about India and propel data driven research.
Happy Number Crunching!