sp25 cross listed with data processing/management with python--class is in python (*not* stata): https://theaok.github.io/datManPy
56:824:718 data management
(56:834:651 special problems in pub pol and adm)
https://theaok.github.io/dm most current syllabus (class materials
updated continuously)
labs:during office hours
Sp 2024 Tue 6.00-8.50pm BSB-336
instructor
- Adam Okulicz-Kozaryn adam.okulicz.kozaryn@gmail.com
- office: 321 Cooper St, room 302; office hours: Mon 4-5, and by appointment
- or just stop by: this semester I am in most of Mon and Thu
prerequisites
You need to be comfortable using a computer. Knowledge of Python (or
Stata, R, etc ) and data-management/computer science is helpful but not
necessary. We will cover the basics.
course description
Most of this class is writing computers programs; if you do not like
programming, this class is not for you... But you may not yet know
whether you like it and you may start liking it in this class: it
often happened before!
There are two major components to the class: (1) data management, (2) simple programming.
We will use Python and Stata. I'd discourage R for data management (unless small simple datasets and you already use R).
The class teaches tools for data
cleaning, organization, quality control, and automation. Topics
include data types, text/math functions, labeling, recoding,
documentation, merging, reshaping, and programming (macros, loops,
and branching).
This class should be named "programming Stata for social
science" or "intro to data science for social science using Stata"
Stata is an excellent software for data
management. But sometimes you need to use a general purpose
programming language for data management. Python is both powerful and
easy to use. We will use it for files manipulation, text processing, and interacting with APIs and scraping websites. [Python is optional]
learning objectives/outcomes
The key is the mastery of "data story-telling:" 1) What data are
telling, 2) what I want to say, and 3) what audience needs to know
You'll learn:
about data (sources, best practices, tips and tricks): this class
is as much about stata as about data (you'll use the data you chose
that will serve you well beyond this class!!)
the basics of the computer programming
the practice of data management (there will be some theory, too)
how to conduct reproducible research
how to automate research by programming
basics of Python for data management [optional]
GIT, a version control system
You'll demonstrate mastery of the material by writing code for a
project/paper using learned techniques;
you may cowrite code (upto 2 people) but then
the project should be 2 times better than a single-authored paper
required textbooks and materials
No required textbooks. All required materials (code, readings) will be provided.
recommended course materials
Most of the class is based on: Mitchell
'' Data Management Using Stata: A
Practical Handbook'' http://www.Stata.com/bookstore/dmus.html(nice)
A similar book, but with focus on organization is: Long
'' The Workflow of Data Analysis Using Stata'' http://www.Stata.com/bookstore/wdaus.html(much boilerplate, outdated)
Programming, specifically, is covered in Chris Baum "An
Introduction to Stata
Programming" https://www.stata.com/bookstore/introduction-stata-programming/(detailed))
For beginners:
'' A Gentle Introduction to Stata, 3rd Edition''
https://www.stata.com/bookstore/gentle-introduction-to-stata/
Also for beginners: "Statistics with STATA: Version 12" by L
Hamilton https://www.stata.com/bookstore/statistics-with-stata/(good but overpriced)
But no need to buy books; many superb free online stuff:
UCLA is the best website: https://stats.idre.ucla.edu/stata/
UCLA for data
management: https://stats.idre.ucla.edu/stata/seminars/stata-data-management/
love this guy https://www.princeton.edu/~otorres
and many more links here: http://www.Stata.com/links/resources1.html
and here
software
stata
We will use Stata 16 or 17 (Intercooled/IC or higher: SE or MP).
Some lucky people can download it for free
at https://software.rutgers.edu;
but, probably have to be RU employee
can buy Stata/BE (Basic Edition) perpetual license for $225 https://www.stata.com/order/new/edu/profplus/student-pricing
not much difference between versions: Stata >=12 is fine; new version is out every 1-2 years
Or can just run it remotely:
https://apps.camden.rutgers.edu/novnc/; note: just hit connect at top right, and type your netid (it wont show anything typed), hit enter, type password, and hit enter;
may see howto
at https://it.camden.rutgers.edu/help/remote-x/
esp how to resize geometry to fit your screen, important! eg
for 1280×1024: netid:geom=1280x1024
an alternative is bad windows via RU Camden Virtual
Lab https://rcit.rutgers.edu/virtlab
sometimes may need to run those within
VPN vpn1.rutgers.edu, and sometimes may
need to activate apps first (maybe even vpn too): on the left 'Service
Activation' https://netid.rutgers.edu/index.htm
git
howto get started with git (very quick HOWTO on basic use of github.com):
[right before the break so can troubleshoot during the break] (first
make a quick ex1.do with say just 'sysuse auto, clear' and 2nd line 'd')
sign up or login at github.com
(can also use bitbucket.org or something else)
may see on your right 'New repository' button: click it; or on
the left go to 'repositories', click 'new'; then pick some name for
your repository, keep
selected 'public', important!: must check 'Initialize this repository
with a README', and click button 'Create repository'
now can simply hit button 'Upload files' and choose your dofile,
say ps0.do, important: add some meaningful commit message, say: 'first try on importing and exporting data' and hit 'Commit
changes' button
then hit 'Settings' towards the top-right, and then on the left select
'Collaborators' tab and add me "theaok" and hit 'Add collaborator',thats it!
I will download it, edit, and upload back
you can click my commit message and see the so called
diff--the difference between your version and my version
you can
download this latest version first, edit it, and upload it back when
done--dont forget about a meaningful commit message--can keep on
uploading newer versions as many times as you like
note: when you click the file, you can then click 'History' and
see how the file evolved over time :)
a thought about file naming: ps0.do, ps1.do, etc or just
substantive name and keep it updating with new stuff as we go! say "incomeInequalityAcrossCounties.do"
below are general references on how to get started using it fully,
probably the first two are most useful
http://www.sitepoint.com/git-for-beginners/
http://rogerdudler.github.io/git-guide/
http://stackoverflow.com/questions/315911/git-for-beginners-the-definitive-practical-guide
https://backlogtool.com/git-guide/en/intro/intro1_1.html
recommended software
Python
there will be at least 2 classes in the second part of the
semester about Python
Python is a general purpose programming language that can do much
more than stata (statistical software)
Python is the most user friendly and easy to use general
programming language
Stata 16 or 17 can embed Python
advice/requirements
2 keys to success: start early AND ask often many questions This is a
software class. It is different from all other classes! You will get
stuck often and whenever stuck, email me, as opposed to pulling
your hair out! And stop by my office, too.
There are 6 problem sets (ps) due the following week after being
posted (unless indicated otherwise; some ps will be due in 2 weeks). You will be asked to write some computer code that
does something that we covered in the class to your data. You may
work in groups (<=2), but say who you worked with,
and the more people in the group, the better/longer the code should be.
Final project is like final paper (doing some useful empirical
quantitative research), except that I only grade code, in fact you can submit
code only.
grading
problem sets 60% (6ps x 10%) [just computer code (dofile)]
final project [just another computer code dofile wrapping previous ps] 40%
calendar
[*] = bonus (extra/not required)
ps0.pdf
intro_to_course.pdf intro.do
replication.pdf
final_project.pdf: just skim through TOC
[*] Data revolution! economist
data data everywhere
[*] "The end of theory" http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
data management
jan27 data formats and conversion (quick stata
lab at 5.30 and will stay 15 min after the class if
needed)
zoom vid pass: Yzcm*6pX
We will talk about different basic data formats, conversion
between them, and how they can be imported/exported to/from stata
ps1.pdf
dataFormats.pdf dataFormats.do
https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesentering-data/
present ps0
[*] flip the class: we will flip the last half an hour or so
[*] Mitchell ch2
feb3 [IN PERSON! NO MORE ZOOM!] data
manipulation
vid
[old vid]
[zoom vid] pass: rxi@s!x5
again, push stuff early to github and ask for comments!!!
start looking at github and ps1
ps2.pdf
manipulate.pdf manipulate.do
https://stats.idre.ucla.edu/stata/seminars/notes/stata-class-notesmodifying-data
if time or at home: do Example 1 (p12)
from https://www.stata.com/manuals14/dmerge.pdf; at
home: read that whole file and make sure you run examples; try to
come to labs where we can discuss and practice more!
[*] Mitchell ALL
This class covers the key command for this class: merge.
ps3.pdf
(start with a look at git repos) mergeAppendReshape.pdf mergeAppendReshape.do
merge conceptual setup:
https://www.princeton.edu/~otorres/Merge101.pdf
merge practice: (make sure you run examples that start on p12!):
https://www.stata.com/manuals14/dmerge.pdf
a quick overview of what we have done so far and doing today http://dss.princeton.edu/training/DataPrep101.pdf
reshape (and see "help reshape"!) https://stats.idre.ucla.edu/stata/modules/reshaping-data-wide-to-long/ https://stats.idre.ucla.edu/stata/modules/reshaping-data-long-to-wide/
[*] Mitchell ALL
flip the class
feb17 continue with merge: digest, practice; and: organization, documentation
[old vid]
vid
We will continue with last class. We covered key topic, merge, last week, so we
make sure today that everything is crystal clear. Also, labeling data, variables, and values. Keeping your data
organized.
if needed: revisit the code from last class
organize.pdf organize.do
present ps2 in the middle of the class
[*] https://stats.idre.ucla.edu/stata/modules/labeling-data/
[*] Scott Long "The Workflow of Data Analysis Using Stata" ALL
[*] Mitchell ch5
if time: wrapup/exercises/tutorials/flip as per next class
feb24 ps3 presentations and revisi/wrap up
vid
Revisit/wrap up what we did so far, esp import/export and
manipulating data; next week new module: visualization/graphs.
do organize.pdf and organize.do from last week
present ps3 15min and 15min discussion
wrap up what we did so far, review, revisit, maybe do http://dss.princeton.edu/training/DataPrep101.pdf
flip the class: flip the last 45min or so; maybe: exercises from
last few classes; and tutorials from links from last classes
mar3 visualization 1 and exporting
results
vid
[old vid]
zoom vid pass: yJyr3+k2
NOTE: 2 classes on this important topic of visualization.
Graphics is critical in understanding data, and understanding data is critical in data management.
Results from stata are data too, and need to manage them too!
ps4.pdf
c7_graphics.pdf c7_graphics.do
mar10 Visualization 2
vid
[old zoom vid] pass: K#BaXN7%
(traditional tables: table_jargo.do) exp.do and see
at
home http://dss.princeton.edu/training/Outreg2.pdf AND great resource per regressions: just follow these examples in
your dofile! https://stats.idre.ucla.edu/stata/webbooks/reg/
theory.pdf
try to do as much as we can: possibly just get to key slides based on
scott long in first part as for theory for soc sci
15min zach
presentation zachPresentation.do and zachPresentation.pptx
flip the class: we spend most of class on your graphs, so please have some and be ready to present: the more you have, they more help you will get!
mar17 spring break
programming
mar24 programming elements: macros, loops
vid
[old
vid]
zoom vid pass: 2T^1pk@B
Introduction to elements of programming in stata: macros and loops. macros are building blocks of stata programs. loops are very useful for automating repetitive tasks in stata.
ps5.pdf
macrosLoops.pdf macrosLoops.do
ps4 10 min (plus 10min discussion) presentations: focus on graphs
a very basic introduction to basic programming http://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
present ps4/graphs as time allows
[*] https://stats.idre.ucla.edu/stata/faq/how-can-i-reshape-doubly-or-triply-wide-data-to-long
[*] an introduction to programming https://stats.idre.ucla.edu/stata/seminars/stata-programming/
[*] foreach examples https://stats.idre.ucla.edu/stata/modules/working-across-variables-using-foreach
[*] more
examples http://fmwww.bc.edu/ec-p/wp612.pdf
[*] Mitchell ch9
[*] Baum first few chapters
mar31 advanced macros and loops; [*] replication/practice using my dofiles
vid
[old vid]
zoom vid pass: $H3=Bc@4
anyone would like to present some loops, macros you've made so
far?
advMacLoo.do
merging and data management project: replicateMiComp.zip
cars and happiness paper: replicateLsCar.zip
[*] work and happiness paper: REPLICATION.tar.bz2
[*] examples of replication materials http://myweb.uiowa.edu/fboehmke/methods.html
[*] Baum first few chapters
flip the class: work on ps5
apr7 text as data and quick start with python
vid
[old
vid]
[old zoom vid] pass: !=#7*BtD
ps6.pdf
note: do brief presentations of ps5
stata_text.pdf stata_text.do
we are finishing stata, anything to revisit, eg loops, any
questions? btw did we go too fast or too slow?
a quick dive into python (if time)
basPy.pdf
colab
(probably thru bas des sta)
python
apr14 python for social science data management
vid
[old vid]
general point for ps6 and final project: the code you run must make substantive sense,
too! dont just run stuff for the sake of it...we code to accomplish something
start with py from last class in colab: 'basic descriptive
statistics' and continue throughout
apiPy.pdf api: get data from internet
colab: pulling data from wb, fred
theory.pdf [very important for ps6 and final project!!]
if time: discuss final project for this class; qick look and skim through https://theaok.github.io/dm/final.pdf
[sp22 added new sec PANDAS: start with that]
colab (matplotlib, gis)
>>>note: the following will be updated>>>
apr28 last class presentations and wrap up
[old vid]
ps6 15min (sharp; ill cut you off) student presentations, focus on bottomline/results (eg des sta, substatntive findings in your soc
sci research)!, which we will discuss and brainstorm, just code, no need for ppt
student presentations: what data; why? what is special about
those data?, any limitations?; show nice chunk of code youre proud of; show some
interestig des stats or graphs or maps or network analysis etc; also
ask us questions!
have a look at canvass at your predicted course grade so far;
remember final project is 40perc of the grade
if you get >=9.5 on ps6 AND your total is >=95% on canvass,
youre done, can just submit ps6 as final project and its an A
check out my paper on happiness and pop growth across us
counties: pdf
and colab
wrap.pdf
final project due on may5 at 6pm
final.pdf
final_project.pdf
just to be safe, delete the data you have posted online, you never know: someone may be picky about it
rules
do not share or link to class videos!
These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar.
attendance
Attendance is recommended. Be advised that you are
responsible for any material covered in the class, whether or not it was in the readings or
lecture notes. You are also responsible for any announcements made in class. For most
students, attendance is simply essential to learning the material. If you do need to miss a
class, be sure to consult with a fellow student to learn what transpired.
incompletes: Generally speaking, the material in this course is best learned as a single unit. I
will grant incompletes only in cases where a substantial change in life circumstances occurs that
is beyond the control of the student, and only with appropriate
documentation.
study groups. You are encouraged to form a regular study group. Many students over the years
have found the study groups to be very helpful. Study groups are permitted and encouraged to
work on the problem sets together. However, each individual student should write up his or her
own answer to hand in, based on his or her own understanding of the material. Do not hand in a
copy of another person’s problem set, even a member of your own group. Writing up your own
answer helps you to internalize the group discussions and is a crucial step in the learning process.
Academic Integrity. I am very serious about this. Make no
mistake--I may appear accommodating and informal--but I am extremely
strict about academic integrity. Violations of academic integrity include cheating on tests or handing in
assignments that do not reflect your own work and/or the work of a study group in which you
actively participated. Handing in your own work that was performed not
for this class (e.g. other class, any other project) is cheating,
too. I have a policy of zero tolerance for cheating. Violations will be referred
to the appropriate university authorities.
For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy
Accommodating Students with Disabilities.
Any student with a disability affecting performance in the class
should contact the disability office ASAP: http://learn.camden.rutgers.edu/disability/disabilities.html